6  Measurement and Construct Validation

Measurement is one of the hardest parts of science and, when done poorly, has implications for the validity of our inferences (Flake et al., 2020; Gelman, 2015). The more you learn about a discipline or an applied problem, the deeper you’ll travel down the measurement rabbit hole. Debates about definitions and data are unavoidable. Even something as seemingly black and white as death is hard to measure completely, consistently, and correctly. Let’s take the COVID-19 pandemic as an example.

Gelman, A. (2015). What’s the most important thing in statistics that’s not in the textbooks?
Whittaker, C. et al. (2021). Under-reporting of deaths limits our understanding of true burden of covid-19. BMJ, 375.
World Health Organization. (2021). WHO civil registration and vital statistics strategic implementation plan 2021-2025.
Our World in Data. (2023a). Coronavirus (COVID-19) Deaths.

There are several challenges to counting the number of COVID-19 deaths globally since the start of the pandemic (Figure 6.1). First, some countries do not have robust civil registration and vital statistics systems that register all births, deaths, and causes of deaths (Whittaker et al., 2021; World Health Organization, 2021). Many people die without any record of having lived. Second, an unknown number of deaths that should be attributed to COVID-19 are not, which means we can only count “confirmed deaths” (Our World in Data, 2023a). In part this is due to limited testing capacity in many settings. It might also stem from deliberate misreporting for political or economic reasons.

Figure 6.1: Cumulative confirmed COVID-19 deaths.

Watch this CDC explainer about how to certify deaths due to COVID-19.

There is also the challenge of defining what counts as a COVID-19 death. Did someone die “with” COVID-19 or “from” COVID-19. All cause of death determinations are judgment calls, and COVID-19 deaths are no different (CDC, 2020). When someone dies in the U.S., for instance, a physician, medical examiner, or coroner completes a death certificate and reports the death to a national registry.1 States are encouraged to use the Standardized Certificate of Death form which asks the certifier to list an immediate cause of death and any underlying causes that contributed to a person’s death.2 COVID-19 is often an underlying cause of death (e.g., COVID-19 ➝ pneumonia), and it’s up to the certifier to make this determination. This is not easy in many cases because COVID-19 can lead to a cascade of health problems in the short and long term (Boyle, 2021).

CDC. (2020). Reporting and Coding Deaths Due to COVID-19.
Boyle, P. (2021). How are COVID-19 deaths counted? It’s complicated.

The list of challenges is long. Two more examples from the pandemic that you might observe in your own work: (1) Definitions change over time, making time-series data complicated; (2) Reporting is typically delayed, leading to changing totals that can fuel mistrust.

Our World in Data. (2023b). Excess mortality during the Coronavirus pandemic (COVID-19).
Hay, S. I. et al. (2023). Conflicting COVID-19 excess mortality estimates – Authors’ reply. The Lancet, 401(10375), 433–434.

Given the challenges in counting all COVID-19 deaths, some scholars and policymakers prefer the metric of excess mortality, which represents the number of deaths from any cause that are above and beyond historical mortality trends for the period. Excess mortality counts confirmed COVID-19 deaths, ‘missing’ COVID-19 deaths, and deaths from any other causes that might be higher because of the pandemic. But the quantification of excess mortality has its own challenges (Our World in Data, 2023b). Among them is that excess mortality is estimated with statistical models, so differences in input data and modeling approaches can lead to different estimates (Hay et al., 2023). Figure 6.2 shows separate analyses by the World Health Organization and The Economist that both conclude excess mortality is far above the number of confirmed COVID-19 deaths globally, but differ by about 1 million in the central estimates.3

Figure 6.2: Estimated cumulative excess deaths, from The Economist and the WHO, World

My goal here is not to depress you or make you question if we can ever truly measure anything. There is a time and place for a good existential crisis about science, but this chapter is not it (see Chapter X instead). Rather, my goal is to convince you of the importance of thinking hard about measurement, and to give you some frameworks for doing so.

6.1 Using Conceptual Models to Plan Study Measurement

It’s frustrating to get to the analysis phase and realize you are missing key variables. Ask me how I know. The best protection against this outcome is to plan your analysis upfront, including data simulation and mocking up tables and figures. It’s more effort in the study design phase, but not more effort overall.

The first step in planning study measurement is to decide what to measure. Consider starting this process by creating a conceptual model, such as a DAG or a theory of change. Conceptual models can help you identify what data you must collect or obtain to answer your research question.

DAG EXAMPLE

For instance, Barnard-Mayers et al. (2022) created the DAG shown in Figure 6.3 to plan a causal analysis of the effect of the human papillomavirus (HPV) vaccine on abnormal cervical cytology among girls with perinatal HIV infection. The process of creating the DAG helped the authors to (i) identify a sufficient adjustment set that closes all backdoor paths, and (ii) understand the limitations of the available dataset. With this DAG in hand, they realized that they lacked good data on sexual history, structural racism, and maternal history, leaving their proposed analysis susceptible to confounding. It was back to the drawing board.

Figure 6.3: A proposed causal DAG by Barnard-Mayers et al. (2022), at ghr.link/hpv.

LOGIC MODEL EXAMPLE

For intervention studies, a theory of change or logic model can aid in measurement planning. To see this in practice, let’s return to the paper by Patel et al. (2017), introduced in Chapter X, that reports on the results of a randomized controlled trial in India to test the efficacy of a lay counsellor-delivered, brief psychological treatment for severe depression called the Healthy Activity Program, or HAP. I encourage you to pause here, read the article, and create your own logic model for the HAP intervention. Figure 6.4 displays my understanding of the HAP logic model.

For a refresher on conceptual models, see Chapter X.

Figure 6.4: HAP logic model.

Inputs

Inputs are the resources needed to implement a program. HAP inputs included money and the intervention curriculum (Anand, Arpita and Chowdhary, Neerja and Dimidjian, Sona and Patel, Vikram, 2013). A key part of any cost-effectiveness analysis is an accounting of expenditures (see Chapter X). On average, HAP cost $66 per person to deliver.

Anand, Arpita and Chowdhary, Neerja and Dimidjian, Sona and Patel, Vikram. (2013). Healthy Activity Program.

Activities

Watch Dr. Vikram Patel talk about task sharing to scale up the delivery of mental health services in low-income settings.

The main HAP activities were psychotherapy for patients and supervision of lay counselors. HAP was designed to be delivered in an individual, face-to-face format (telephone when necessary) over 6 to 8 weekly sessions each lasting 30 to 40 minutes (Chowdhary et al., 2016). Supervision consisted of weekly peer-led group supervision and twice monthly individual supervision.

Chowdhary, N. et al. (2016). The healthy activity program lay counsellor delivered treatment for severe depression in india: Systematic development and randomised evaluation. The British Journal of Psychiatry, 208(4), 381–388.

In intervention studies like this, it’s important to determine if the intervention was delivered as intended. This is called treatment fidelity, and it’s a measure of how closely the actual implementation of a treatment or program reflects the intended design. The study authors measured fidelity in several ways, including external ratings of a randomly selected 10% of all intervention sessions. An expert not involved in the program listened to recorded sessions and compared session content against the HAP manual. They also had counselors document the duration of each session.

Implementation failure vs theory failure

Low treatment fidelity usually results in an attenuation (i.e., shrinking) of treatment effects, which is a threat to internal validity. If the study shows no effect but treatment fidelity is low, the null result may not be valid. Implementation failure rather than theory or program failure could be to blame. Low fidelity is also a threat to external validity because it is not possible to truly replicate the study.

Outputs

Outputs are counts of activities. Patel et al. counted the number of sessions delivered to patients in the treatment arm, as well as the number of these patients who completed the program (69% had a planned discharge). Presumably they also tracked the number of counselors trained and supervision sessions conducted.4

Non-Compliance with study assignment

Outputs are key reporting metrics for staff working on the “Monitoring” side of M&E units (see Chapter 1), but intervention researchers also need access to monitoring data to understand and describe how the program was delivered. This is related to treatment fidelity, but also to treatment compliance. Treatment compliance is a measure of the extent to which people were treated or not treated according to their study assignment. Sometimes people assigned to the treatment group do not take-up the treatment, or they complete only part of the planned intervention.

Noncompliance to randomization on the treatment side is called one-sided noncompliance. When members of the control group5 are also noncompliant with randomization, meaning they are treated despite being assigned to the no treatment condition, this is called two-sided noncompliance. Analysis strategies for one- and two-sided noncompliance are discussed in a later chapter.

Treatment compliance is one of those terms like “research subjects” that can make you cringe a bit. Research is voluntary, and participants have the right to decline a treatment offer or stop treatment at any point. We call this behavior “non-compliance”, which sounds to some like participants are misbehaving. Non-compliance can make your analysis and interpretation more complicated, but participants are not to blame. Look instead to the root causes in the research design, study procedures, or the intervention itself to find ways to limit non-compliance.

Outcomes and impacts

The hypothesized outcome in the HAP study was a reduction in depression:

The two primary outcomes were depression severity assessed by the modified Beck Depression Inventory version II (BDI-II) and remission from depression as defined by a PHQ-9 score of less than 10, both assessed 3 months after enrollment.

The authors assumed that the long-term impact of reducing depression at scale would be improvements in quality of life for patients and their families, increased workforce productivity, and a reduction in costs to society.

FROM CONCEPTUAL MODEL TO STUDY MEASUREMENT

Both of these examples provide a solid foundation for measurement planning. If you or I were designing the HAP study, for instance, the example logic model would tell us that we need to collect or obtain data about the following:

  • Expenditures
  • Measures of treatment fidelity
  • Counts of therapy sessions completed, supervision sessions held
  • Measures of depression and several secondary outcomes

Generating this list is the first step. The next step involves a deep dive on the specifics of measurement. For instance, what is depression, and how can it be quantified?

Long-term impacts often not measured

Missing from this list are proposed impacts: measures of quality of life, productivity, and societal costs. This is because long-term impacts are often assumed but not measured. In most cases, study timelines are too short to detect long-term (or distal) effects. Often, the best we can do is measure proximal effects in the near term. For instance, if the proposed mechanism of change is intervention ➝ more days in school ➝ greater earnings as adults, a study would need to collect data after a few decades! There are examples of long running cohort studies that follow people over big spans of time, but these studies are exceptions.

6.2 Measurement Terminology

Table 6.1 lists four measurement terms to know.

Table 6.1: Common measurement terms, adapted from Glennerster and Takavarasha (2013).
Term Definition Example
Construct A characteristic, behavior, or phenomenon to be assessed and studied. Often cannot be measured directly (latent). Depression
Outcome (or Endpoint) The hypothesized result of an intervention, policy, program, or exposure. Decreased depression severity
Indicator Observable measures of outcomes or other study constructs. Depression severity score
Instrument The tools used to measure indicators. A depression scale (questionnaire) made up of questions (items) about symptoms of depression that is used to calculate a severity score

CONSTRUCTS

At the top of the list are study constructs. Constructs are the high-level characteristics, behaviors, or phenomena you investigate in a study. Constructs are the answer to the question, “What is your study about?”.

In the Patel et al. (2017) example, the key construct of interest was depression. Constructs like depression have no direct measure; there is not (yet) a blood test that indicates whether someone is depressed or “has depression”. Therefore, depression is an example of a latent construct. Many constructs in the social and behavioral sciences are latent constructs—such as empowerment, corruption, democracy.

OUTCOMES AND ENDPOINTS

Outcomes (or endpoints) are the specific aspects of a construct that you are investigating. In intervention research, program evaluation, and causal inference more generally, outcomes are often framed as the hypothesized result of an intervention, policy, program, or exposure. In a theory of change or logic model, outcomes take on the language of change: increases and decreases.

In the clinical trial literature, study targets are also called endpoints. Death. Survival. Time to disease onset. Blood pressure. Tumor shrinkage. These are all endpoints you might seek to measure after offering some treatment (e.g., an experimental drug).

You’ll also see outcomes and endpoints referred to as dependent variables, response variables, 𝑌, or left-hand side variables (referring to an equation).

Most studies are designed to generate evidence about one or two primary outcomes linked directly to the main study objective. In the HAP study, Patel et al. (2017) hypothesized two primary outcomes: a reduction in severe depression and a reduction in the prevalence of depression.

Secondary outcomes may be registered, investigated, and reported as well, but these analyses will often be framed as exploratory in nature if the study design is not ideal for measuring these additional outcomes. Patel et al. (2017) specified several secondary outcomes, including increases in behavioural activation and reductions in disability, total days unable to work, suicidal thoughts or attempts, intimate partner violence, and resource use and costs of illness (Patel et al., 2014).

INDICATORS AND INSTRUMENTS

Indicators are observable metrics of outcomes, endpoints, or other study constructs.

The language of qualitative studies is a bit different. These studies emphasize study constructs, but not indicators or measures. Quantification is not typically the goal.

Indicator categories

In intervention and evaluation research—which is only a subset of quantitative research, remember—there are three main categories of indicators:

  1. Input indicators: Measures of what is needed to implement the program.
  2. Process indicators: Measures of program implementation (e.g., fidelity).
  3. Outcome indicators: Measures of the program or study endpoints (hypothesized outcomes or impacts).

Indicators and instruments go together. Instruments are the tools used to measure indicators. Instruments can take many forms, including surveys, questionnaires, environmental sensors, anthropometric measures, blood tests, imaging, satellite imagery, and the list goes on.

Returning to the HAP study, Patel et al. (2017) hypothesized that the intervention would reduce severe depression and the prevalence of depression (the primary outcomes). Table 6.2 summarizes how each was operationalized and measured.

Table 6.2: HAP outcomes, indicators, and instruments.
Outcome Indicator Instrument
Depression severity a continuous measure of severity where higher scores suggest someone is experiencing more severe symptoms of depression assessed with the Beck Depression Inventory, version II
Depression prevalence a binary indicator of the presence of depression based on a person’s depression score relative to a reference cutoff score; < 10 on the Patient Health Questionnaire-9) assessed with the Patient Health Questionnaire-9

The HAP study instruments are reproduced in Figure 6.5. The authors measured depression with two instruments: (i) the 21-item Beck Depression Inventory version II (Beck et al., 1996); and (ii) the 9-item Patient Health Questionnaire-9 (Kroenke et al., 2002).

Beck, A. T. et al. (1996). Manual for the beck depression inventory-II. Psychological Corporation.
Kroenke, K. et al. (2002). The PHQ-9: A new depression diagnostic and severity measure. Psychiatric Annals, 32(9), 509–515.

Responses to each BDI-II item are scored on a scale of 0 to 3 and summed to create an overall depression severity score that can range from 0 to 63, where higher scores indicate more severe depression. PHQ-9 responses are also scored on a scale of 0 to 3 based on the frequency of symptoms and summed to create a total score with a possible range of 0 to 27. Based on prior clinical research, the HAP authors defined the cutoff for depression as a score of at least 10 on the PHQ-9.

Figure 6.5: Instruments used in Patel et al. (2017). Source: ghr.link/ins.

6.3 What Makes a Good Indicator?

When you select and define indicators of outcomes and other key variables, this is called operationalizing your constructs. Operationalization a critical part of measurement planning. When you finish the study and present your findings, one of the first things colleagues will ask is, “How did you define and measure your outcome?” Hopefully you can say that your indicators are DREAMY™.

SMART is another acronym worth knowing. SMART indicators are Specific, Measurable, Achievable, Relevant, and Time-Bound.

Defined clearly specified
Relevant related to the construct
Expedient feasible to obtain
Accurate valid measure of construct
Measurable able to be quantified
customarY recognized standard

DEFINED

It’s important to clearly specify and define all study variables, especially the indicators of primary outcomes. This is a basic requirement that enables a reader to critically appraise the work, and it serves as a building block for future replication attempts.

Patel et al. (2017) defined two indicators of depression:

  1. Depression severity: BDI-II total score measured at 3 months after the treatment arm completed the intervention
  2. Depression prevalence: the proportion of participants scoring 10 or higher on the PHQ-9 total score measured at 3 months post intervention

RELEVANT

Indicators should be relevant to the construct of interest. For instance, scores on the BDI-II and PHQ-9 are clearly measures of depression. An example of an irrelevant indicator would be a total score on the Beck Anxiety Inventory, a separate measure of anxiety. While anxiety and depression are often comorbid, anxiety is a distinct construct.

EXPEDIENT

It should be feasible to collect data on the indicator given a specific set of resource constraints. Asking participants to complete a 21-item questionnaire and a 9-item questionnaire, as in the HAP study, does not represent a large burden on study staff or participants. However, collecting and analyzing biological samples (e.g., hair, saliva, or blood) might be too difficult in some settings.

ACCURATE

Accurate is another word for “valid”. Indicators must be valid measures of study constructs (a topic discussed extensively later in this chapter). When deciding on indicators and instruments, the HAP authors had to ask themselves whether scores on the BDI-II and PHQ-9 distinguish between depressed and non-depressed people in their target population. The authors cited their own previous work to support the decision to use these instruments (Patel et al., 2008).

Patel, V. et al. (2008). Detecting common mental disorders in primary care in india: A comparison of five screening questionnaires. Psychological Medicine, 38(2), 221–228.

MEASUREABLE

Indicators must be quantifiable. Psychological constructs like depression are often measured using questionnaires like the BDI-II and the PHQ-9. Other constructs require more creativity. For instance, how would you measure government corruption? Asking officials to tell on themselves isn’t likely to yield a helpful answer. Olken (2005) took a different approach in Indonesia—they dug core samples of newly built roads to estimate the true construction costs. The authors then compared cost estimates based on these samples to the government’s reported construction expenditures to construct a measure of corruption (reported expenditures > estimated costs).

Olken, B. A. (2005). Monitoring corruption: Evidence from a field experiment in indonesia. National Bureau of Economic Research.

CUSTOMARY

In general, it’s good advice to use standard indicators, follow existing approaches, and adopt instruments that have already been established in a research field. There are several ways to do this.

Sometimes the status quo stinks and you’ll want to conduct a study to overcome the limitations of the standard methods.

One way is to read the literature and find articles that measure your target constructs. For example, if you’re planning an impact evaluation of a microfinance program on poverty reduction and wish to publish the results in an economics journal, start by reading highly cited work by other economists to understand current best practices. How do these scholars define and measure outcomes like income, consumption, and wealth?

Systematic reviews and methods papers are also good resources for learning about measurement. For instance, Karyotaki et al. (2022) critically appraised several task sharing mental health interventions, including HAP. Their review is a resource for for understanding how depression is operationalized and measured across studies. Larsen et al. (2021) evaluated nine commonly used depression screening tools in sub-Saharan Africa on the basis of their diagnostic performance, cultural adaptation, and ease of implementation. If you wanted to measure depression in a target population in sub-Saharan Africa, their paper should be high on your reading list.

Karyotaki, E. et al. (2022). Association of task-shared psychological interventions with depression outcomes in low-and middle-income countries: A systematic review and individual patient data meta-analysis. JAMA Psychiatry.
Larsen, A. et al. (2021). Is there an optimal screening tool for identifying perinatal depression within clinical settings of sub-saharan africa? SSM-Mental Health, 1, 100015.
WHO. (2018). Global reference list of 100 core health indicators. World Health Organization.
United Nations. (2023). SDG indicators.

A third approach is to search for nationally or internationally recognized standards. If studying population health, for instance, a good source of customary indicators is the World Health Organization’s Global Reference List of the 100 core health indicators (WHO, 2018). Another good source of customary indicators for population health is the United Nations Sustainable Development Goals (SDG) metadata repository, which includes 231 unique indicators to measure 169 targets for 17 goals (United Nations, 2023).

6.4 Constructing Indicators

Some indicators are based on a single measurement and require only a definition. For instance, a hemoglobin level of less than 7.0 g/dl is an indicator of severe anemia. If you were evaluating the impact of a new diet on severe anemia, you would need only to record the result of a blood test (instrument). Lucky you. Most indicators are more complex and must be constructed.

NUMERATORS AND DENOMINATORS

Population-level global health indicators often involve numerators and denominators. For instance, the WHO defines the maternal mortality ratio as (World Health Organization):

World Health Organization. Maternal mortality ratio (per 100 000 live births).

The denominator for the maternal mortality rate is the number of women of reproductive age.

the number of maternal deaths during a given time period per 100,000 live births during the same time period

Constructing this indicator requires data on the counts of maternal deaths and live births. Each has a precise definition.

Maternal deaths The annual number of female deaths from any cause related to or aggravated by pregnancy or its management (excluding accidental or incidental causes) during pregnancy and childbirth or within 42 days of termination of pregnancy, irrespective of the duration and site of the pregnancy.
Live births The complete expulsion or extraction from its mother of a product of conception, irrespective of the duration of the pregnancy, which, after such separation, breathes or shows any other evidence of life such as beating of the heart, pulsation of the umbilical cord, or definite movement of voluntary muscles, whether or not the umbilical cord has been cut or the placenta is attached

COMPOSITE INDICATORS

Latent (unobservable) constructs like empowerment, quality of life, and depression, and some manifest (observable) constructs like wealth, are often measured with multiple items on surveys or questionnaires and then combined into indexes or scales.

The terms index and scale are often used interchangeably, but they are not quite synonyms. While they share in common the fact that multiple items or observations go into their construction, making them composite measures or composite indicators, the method for and purpose of combining these items or observations are distinct (see Figure 6.6).

In an index, indicators give rise to the construct that is being measured. For example, a household’s wealth is determined by the assets it owns (e.g., livestock, floor quality). Conversely, in a scale, the indicators exist because of the construct. Depression manifests in symptoms such as a loss of appetite.

Figure 6.6: Scale vs index.

Indexes

Watch this introduction to the Equity Tool, a wealth index alternative.

Indexes combine items into an overall composite, often without concern for how the individual items relate to each other. For instance, the Dow Jones Industrial Average is a stock-market index that represents a scaled average of stock prices of 30 major U.S. companies such as Disney and McDonald’s. Companies with larger share prices have more influence on the index. The Dow Jones is a popular indicator of market strength and is constantly monitored during trading hours.

An index popular in the global health field is the wealth index. The wealth index uses household survey data on assets as a measure of household economic status (Rutstein et al., 2004). The data come from national surveys conducted by the Demographic and Health Surveys (DHS) Program (DHS, n.d.). Asset variables include individual and household assets (e.g., phone, television, car), land ownership, and dwelling characteristics, such as water and sanitation facilities, housing materials (i.e., wall, floor, roof), persons sleeping per room, and cooking facilities. A household gets an overall score that is the sum of the weights for having (or not having) each asset.

Rutstein, S. O. et al. (2004). The DHS wealth index (DHS Comparative Reports No. 6). ORC Macro.
DHS. (n.d.). Demographic and Health Surveys (DHS) Program.

A common way to present wealth index scores is to divide the sample distribution into quintiles. Each household in the sample falls into 1 of 5 wealth quintiles reflecting their economic status relative to the sample, from poorest (1st quintile) to richest (5th quintile).

Watch Dr. Selim Jahan, former lead author of the Human Development Report, discuss the history of the Human Development Index.

Another widely known index among global health practitioners is the Human Development Index, or HDI (UNDP, n.d.). It combines country-level data on three dimensions: (i) life expectancy at birth; (ii) expected years of schooling for kids entering school and mean years of schooling completed by adults; and (iii) Gross National Income per capita. The HDI is produced by the United Nations Development Program.

UNDP. (n.d.). Human Development Index.

Figure 6.7: Human Development Index. Source: UNDP.

Scales

Scales also combine items into an overall composite, but unlike the components of most indexes, scale items should be intercorrelated because they stem from a common, latent cause. For instance, the BDI-II used in the HAP trial was designed to measure the construct of depression with 21 related items that ask people to report on their experience of common problems. Figure 6.8 shows how these items were correlated in the Patel et al. (2017) data.6

Figure 6.8: BDI-II correlations in the HAP trial.

This heatmap visualizes data from participants in the control arm, collected three months after the treatment arm completed the HAP program. These correlations can range from -1 to 1, and most item pairs show moderately sized, positive correlations in the 0.2 to 0.3 range. An exception on the low side is bdi02 (tiredness) and bdi04 (changes in appetite). With a correlation coefficient of 0, it seems like there is no association between tiredness and changes in appetite. bdi04 (changes in appetite) also has a few other small correlations with other items, which could suggest that, in this population, changes in appetite often do not manifest alongside other symptoms of depression.

Does this mean that bdi04 (changes in appetite) should have been excluded from the outcome measure? Possibly. I’ll return to this question later in this chapter.

Index and Scale Construction

A key decision in creating composite indicators like the wealth index and the BDI-II scale score is how to weight the individual components. In the case of the wealth index, should owning a car be given the same weight as owning a phone in the construction of wealth? When it comes to measuring depression severity, should feeling sad be given the same weight as feeling suicidal?

The answer is ‘no’ for the wealth index. The wealth index is constructed using relative weights derived from a data reduction technique called principal component analysis, or PCA, in which indicators are standardized (i.e., transformed into z-scores) so that they each have a mean of 0 and a standard deviation of 1. A principal component is a linear combination of the original indicators; thus, every indicator (e.g., yes/no response to owning a phone) has a loading factor that represents the correlation between the individual indicator and the principal component. The first principal component always explains the most variance, so the factor loadings on the first principal component are assigned as the weights for each asset in the index. A household gets an overall score that is the sum of the weights for having (or not having) each asset.

Visit the DHS Program’s website for country files and detailed instructions to construct the wealth index.

Let’s look at an example. Figure 6.9 shows how one item—a household’s water source—contributes to the construction of the overall index. On the left is the wording of the survey item asked to each household in the India 2015-15 DHS Survey. On the right are the PCA results provided by the DHS Program. If a survey respondent said their household has water piped into its dwelling, the household’s wealth index score would be—in part, this is only one item—the sum of the values for having and not having each type of water source. Scan the values and you’ll see that safer sources of water such as piped water into the dwelling get higher scores (0.102) compared to possibly contaminated sources such as surface water from a lake or stream (-0.055). These relative weights are summed to make the overall wealth index score for each household.

Figure 6.9: Wealth index construction example.

In contrast, the BDI-II—and many scales like it—use equal weighting (or unit weighting). Responses to each BDI-II item are scored on a scale of 0 to 3 and summed to create an overall depression severity score that can range from 0 to 63. Figure 6.10 shows data from four people in the HAP study. Each item contributes equally to the sum score (bdi_total).

The possible range is 0 to 60 in the case of Patel et al. (2017) because they purposively omitted one item.

Figure 6.10: BDI-II data example.

An alternative construction method for scales like the BDI-II is optimal weighting with a congeneric model (McNeish et al., 2020). In this method, items more closely related to the construct are weighted more heavily in the construction of the scale score. We can do this in R using a package for confirmatory factor analysis such as the {lavaan} package (Rosseel, 2012).

Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.
# specify the model
    model <- '
    total_factor =~ bdi01 + bdi02 + bdi03 + bdi04 + bdi05 + 
                    bdi06 + bdi07 + bdi08 + bdi09 + bdi10 +
                    bdi11 + bdi12 + bdi13 + bdi14 + bdi15 + 
                    bdi16 + bdi17 + bdi18 + bdi19 + bdi20
  '
  
# fit the model
    fit <- lavaan::cfa(
        model, data = df, ordered = TRUE, std.lv = TRUE
        )

# plot a path diagram with model coefficients
    lavaanPlot::lavaanPlot(
        model = fit, 
      node_options = list(shape = "box"), 
    edge_options = list(color = "grey"), 
    labels = list(total_factor = "Depression Severity"),
    coefs = TRUE
    )

Figure 6.11 shows a path diagram of a congeneric factor model for the BDI-II data from the HAP trial control arm at the 3-month endline. The loadings from the latent depression severity score are uniquely estimated for each item. The items bdi14 (feeling like a failure) and bdi17 (indecisiveness) have the highest loading of 0.76, meaning these items are most closely related to the construct of depression severity. Thus, these items contribute the most to the overall factor score. bdi04 (changes in appetite) has the weakest relationship to the construct and contributes the least to the factor score. (Remember, in the commonly used equal weighting approach, these items would contribute equally.)

Figure 6.11: Path diagram of a congeneric factor model for the BDI-II fit to data from the HAP trial control arm at the 3-month endline.

Does it matter which construction method you choose, equal weighting sum scores or optimal weighting factor scores? McNeish et al. (2020) argue that it can. Figure 6.12 demonstrates that two people with the same sum score can have substantially different factor scores.

Higher sum scores and higher factor scores represent more severe depression. The factor score is on a standardized Z-scale, so negative numbers indicate below average severity.

Figure 6.12: Sum scores vs factor scores. This plot highlights two individuals with equal sum scores (11) but different factor scores (-0.63 vs -1.35).

Two people with identical sum scores but different factor scores are highlighted in orange in Figure 6.12. In Figure 6.13 I reproduce their responses to each BDI item. Person A endorsed several items with the highest loadings, which helps to explain their more severe factor score. Person B endorsed fewer items overall, but endorsed them strongly.

Figure 6.13: Different pattern of item endorsement from two people with identical BDI-II sum scores. 0=no endorsement. 3=highest endorsement.

So do you think Person A and B have the same level of depression severity, as suggested by equivalent sum scores of 11? Or are they experiencing differing levels of severity as suggested by the factor scores? If you said different, you might prefer to construct optimally weighted factor scores, especially if you are using the score as an outcome indicator in a trial.

In a clinical setting the ease of calculating sum scores might outweigh other considerations.

McNeish, D. et al. (2020). Thinking twice about sum scores. Behavior Research Methods, 52, 2287–2305.

McNeish et al. (2020) make some helpful suggestions that you can revisit when you are faced a decision about how to construct scale scores. For now, I think a good takeaway is that measurement is complicated and we should think hard about what numbers mean and how scores are constructed. The next two sections will help you do just that.

6.5 Construct Validation

This Open Science Framework project website hosts a comprehensive reading list on study measurement.

Construct validation is the process of establishing that the numbers we generate with a method of measurement actually represent the idea—the construct—that we wish to measure (Cronbach et al., 1955). Using the BDI-II example, where sum scale scores can range from 0 to 63, validating this construct means establishing that higher numbers correspond with greater depression severity, or that scores above a certain threshold, such as 29 (out of 63), correctly classify someone as having “severe depression”. If our numbers don’t mean what we think they mean, our analyses don’t either.

Cronbach, L. J. et al. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281.
Flake, J. K. et al. (2022). Construct validity and the validity of replication studies: A systematic review. American Psychologist, 77(4), 576.

You might be thinking that construct validation is not a top concern in your work because you, a principled scientist, are using “validated” scales. If so, you’d be wrong. Validity is not a property of an instrument. Flake et al. (2022) make this point clearly:

Thus, validity is not a binary property of an instrument, but instead a judgment made about the score interpretation based on a body of accumulating evidence that should continue to amass whenever the instrument is in use. Accordingly, ongoing validation is necessary because the same instrument can be used in different contexts or for different purposes and evidence that the interpretation of scores generalizes to those new contexts is needed.

I won’t argue that you must always start from scratch to validate the instruments you select, but it’s important to think critically about why you believe an instrument will produce valid results in your context. For instance, if you are using an instrument originally validated with a sample of 200 white women in one small city in America, what gives you confidence that the numbers produced carry the same meaning in rural India?

PHASES OF CONSTRUCT VALIDATION

Loevinger (1957) outlined three phases of construct validation: substantive, structural, and external. The goal of the substantive phase is to ensure that the content of the instrument is comprehensive and presented in a manner that makes sense to people. Once this phase is satisfied, you move onto the structural phase where you gather data and analyze the psychometric properties of the instrument and its items. If acceptable, you proceed to the external phase where scores generated by the instrument are compared to other instruments or criteria. Table 6.3 provides examples of validity evidence for each phase (inspired by Flake et al. (2017)).

Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(3), 635–694.
Flake, J. K. et al. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8(4), 370–378.
Table 6.3: Stages of construct validation
Phase Validity Evidence Question Examples
Substantive Content validity What topics should be included in the instrument based on theory and prior work? Complete a literature review, talk with experts, conduct focus groups to explore local idioms
Item development How should the construct be assessed? Conduct cognitive interviewing to ensure local understanding of item wording and response options
Structural Item analysis Are the instrument items working as intended? Analyze patterns of responding, select items that discriminate between cases and non-cases
Factor analysis How can the observed variables be summarized or represented by a smaller number of unobserved variables (factors)? Conduct exploratory and/or confirmatory factor analysis
Reliability Are responses to a set of supposedly related items consistent within people and over time? Examine internal consistency and test-retest correlations for evidence of stability
Measurement invariance Does the instrument function equivalently across different groups or conditions? Conduct multiple group factor analysis, item response theory analysis
External Criterion validity How well does the instrument predict or correlate with an external criterion or outcome of interest? Establish concurrent validity, predictive/diagnostic validity
Construct validity How well does the instrument measure the intended construct? Establish convergent/discriminant/known groups validity

Construct Validation Phase 1: Substantive

If adopting or adapting an existing instrument, you can still evaluate whether the instrument has evidence of content validity for your study setting and target population.

The first phase of developing a new instrument is to identify all of the relevant domains (the content) needed to fully assess the construct of interest. The process often starts with a review of the literature, conversations with experts, and potentially focus groups with members of the target population.

For instance, my colleagues and I conducted a study in rural Kenya where we examined how people understood depression in the context of pregnancy and childbirth (Green et al., 2018). We started by reviewing the scholarly literature on how depression was assessed in Kenya and elsewhere. Then we convened groups of women in our target population and asked them to describe what observable features characterize depression (sadness, or huzuni, in Swahili) during the perinatal period. Working with each group, we co-examined the overlap (and lack thereof) of their ideas and existing depression screening tools (see Figure 6.14). A group of Kenyan mental health professionals then gave feedback on the results based on their local clinical expertise.

Figure 6.14: Illustrative sorting of depression cover terms by focus groups, from Green et al. (2018).

Once you know the domains to include, you can proceed to create items that assess these domains. Your instrument will have evidence of content validity if you can demonstrate that it assesses all of the conceptually or theoretically relevant domains and excludes unrelated content.

Check out this helpful “how to” guide for cognitive interviewing.

As part of this process, it’s important to ensure that members of your target population understand the meaning of each item and the response scale. In the Kenya study, we used a technique called cognitive interviewing whereby we asked members of the target population to describe the meaning of each item and suggest improvements.

Construct Validation Phase 2: Structural

The structural phase comes after you collect pilot data from a sample drawn from your target population. In this phase, you’ll quantitatively evaluate how people respond to the items, identify items that appear to best represent the latent construct(s), and examine whether the items are measured consistently and equivalently across groups.

Item Analysis

A common initial exploratory data analysis practice is to plot the response distributions of each item. If you ask people to rate their agreement with a statement like, “I feel sad”, and if 100% of people in your sample respond “strongly agree”, the item has zero variance. When all or nearly all of your sample responds the same way to an item, that item tells you nothing useful. The appropriate next step is to drop the item or conduct additional cognitive interviewing to modify the item in a way that will elicit variation in responses.

You might decide to keep an item with low variability if it’s a critical item for clinical detection like suicidal ideation.

If you have data that enable you to plot response distributions by group, you can also examine the extent to which items distinguish between groups. Figure 6.15 shows a hypothetical example where 100 people responded to three items, each measured on a 4-point scale from “Never” to “Often”.

Figure 6.15: Visual example of item analysis.

item_1 has very little variability. Almost everyone responded “Never”. This item does not tell us much, and I might decide to drop or improve it. item_2 has more variability, but it does not distinguish between cases and non-cases. In combination with other variables it might be useful, so I might decide to keep it unless I need to trim the overall length of the questionnaire. item_3 looks the most promising. It elicits variability in responses, and a larger proportion of the Cases group endorsed the item.

Factor Analysis

Factor analysis is a statistical method that helps us understand hidden patterns and relationships within a large set of data. It looks for commonalities among different variables and groups them into smaller, meaningful categories called factors. By doing this, factor analysis simplifies complex data and allows us to uncover the underlying dimensions or concepts that are influencing the observed patterns.

There are two main types of factor analysis: exploratory factor analysis, or EFA, and confirmatory factor analysis, or CFA. EFA is used when we have little prior knowledge about the underlying structure of the variables. It helps in identifying the number of factors and the pattern of relationships among variables.

On the other hand, CFA is conducted when we have a pre-specified hypothesis or theory about the factor structure and seeks to confirm whether the observed data align with the proposed model. CFA tests the fit of the predetermined model and assesses the validity of the measurement instrument.

Flora et al. (2017) discuss when you might use EFA vs CFA:

Flora, D. B. et al. (2017). The purpose and practice of exploratory and confirmatory factor analysis in psychological research: Decisions for scale development and validation. Canadian Journal of Behavioural Science, 49(2), 78.

Researchers developing an entirely new scale should use EFA to examine the dimensionality of the items…CFA should be used when researchers have strong a priori hypotheses about the factor pattern underlying a set of observed variables.

EFA Example

To get a glimpse of what this means, let’s imagine that, as members of the HAP study team, we created the BDI-II items from scratch and wanted to examine the dimensionality of the items. One way to do this is to use an R package like {lavaan} to fit EFA models with 1, 2, or 3 factors.

    lavaan::efa(data = df, 
              nfactors = 1:3, 
              rotation = "oblimin",
              estimator = "WLSMV",
              ordered = TRUE)

Table 6.4 displays the factor loadings for the 2-factor model. Factor loadings represent the strength and direction of the relationship between observed variables (i.e., the BDI-II items) and the underlying factors. Think of factor loadings as indicators of how closely each variable is associated with a particular factor. Higher positive factor loadings suggest a strong positive relationship, indicating that the variable is more representative of that factor, while lower or negative factor loadings indicate a weaker or opposite association. These loadings help us understand which variables are most important in measuring a specific factor and contribute to our overall understanding of the underlying structure of the data.

Table 6.4: Factor loadings for BDI-II items in a 2-factor exploratory factor analysis model. Data from the HAP trial control arm at the 3-month endline
Factor loadings
item label f1 f2
bdi03 Low energy 0.71
bdi06 Sadness 0.67
bdi05 Loss of interest 0.66
bdi02 Tiredness 0.54
bdi10 Agitation 0.48
bdi16 Concentration 0.46
bdi08 Loss of pleasure 0.45
bdi07 Crying 0.43
bdi14 Failure 0.89
bdi13 Guilty feelings 0.78
bdi20 Suicidal thoughts 0.73
bdi15 Worthlessness 0.62
bdi12 Self-dislike 0.6
bdi19 Pessimism 0.57
bdi18 Punishment 0.55
bdi17 Indecisiveness 0.49
bdi11 Self-criticism 0.45
bdi01 Sleep
bdi04 Appetite
bdi09 Irritability
Loadings less than 0.40 (absolute value) are not presented.

What we see is a cluster of items that load strongly on factor 1, a cluster of items that load strongly on factor 2, and a few items such as bdi04 (appetite) that are not strongly associated with either factor. The software does not know how to label these factors qualitatively, so it just names them f1 and f2. It’s up to us to examine the pattern of loadings and determine whether the factors have a clear meaning. My sense is that f1 captures the affective dimension of depression (e.g., sadness, crying), whereas f2 is about negative cognition (e.g., guilty feelings, self-dislike).

But is a 2-factor model the best way to represent the data? There are many ways we could try to answer this question, but unfortunately there is no consensus about what approach is best. In the code below that produces Figure 6.16, I use the Method Agreement procedure as implemented in the {parameters} package which (currently) looks across 19 different approaches and tallies the votes for the number of factors to extract. The winner is a 1-factor model.

    efa_n_factors <- parameters::n_factors(df)
    library(see)
    plot(efa_n_factors)

Figure 6.16: Assessing the number of factors to retain for the EFA model.

CFA Example

Remember that for this EFA example, we fancied ourselves as HAP team members who created the BDI-II items from scratch. In reality, of course, the BDI-II has been around for a long time, and CFA is probably a better choice for our situation. While many research groups have proposed multi-factor solutions for the BDI-II (Beck et al., 1988), we know the instrument is scored as a 1-factor model (consistent with our EFA results, yay!). To demonstrate a strength of CFA, however, let’s compare two different 1-factor models: a parallel model where items are weighted equally (feeling sad contributes the same as feeling suicidal) and a congeneric model where items are optimally weighted.

# congeneric model (optimally weighted)
# https://osf.io/8fzj4
  model_1f_congeneric <- '
  # all loadings are uniquely estimated
  # first loading is set to 1 by default and must be freed
    total_factor =~ NA*bdi01 + bdi02 + bdi03 + bdi04 + bdi05 + 
                    bdi06 + bdi07 + bdi08 + bdi09 + bdi10 +
                    bdi11 + bdi12 + bdi13 + bdi14 + bdi15 + 
                    bdi16 + bdi17 + bdi18 + bdi19 + bdi20
                    
  # constrain factor variance to 1#
    total_factor~~1*total_factor
  '

  fit_1f_congeneric <- lavaan::sem(
    model_1f_congeneric, data = df, ordered = TRUE
  )
  
# parallel model (equally weighted)
# https://osf.io/2gzty
  model_1f_parallel <- '
  # fix all factor loadings to 1
    total_factor =~ 1*bdi01 + 1*bdi02 + 1*bdi03 + 1*bdi04 + 1*bdi05 + 
                    1*bdi06 + 1*bdi07 + 1*bdi08 + 1*bdi09 + 1*bdi10 +
                    1*bdi11 + 1*bdi12 + 1*bdi13 + 1*bdi14 + 1*bdi15 + 
                    1*bdi16 + 1*bdi17 + 1*bdi18 + 1*bdi19 + 1*bdi20
    
  # constrain all residual variances to same value
    bdi01~~theta*bdi01
    bdi02~~theta*bdi02
    bdi03~~theta*bdi03
    bdi04~~theta*bdi04
    bdi05~~theta*bdi05
    bdi06~~theta*bdi06
    bdi07~~theta*bdi07
    bdi08~~theta*bdi08
    bdi09~~theta*bdi09
    bdi10~~theta*bdi10
    bdi11~~theta*bdi11
    bdi12~~theta*bdi12
    bdi13~~theta*bdi13
    bdi14~~theta*bdi14
    bdi15~~theta*bdi15
    bdi16~~theta*bdi16
    bdi17~~theta*bdi17
    bdi18~~theta*bdi18
    bdi19~~theta*bdi19
    bdi20~~theta*bdi20
  '
  
  fit_1f_parallel <- lavaan::sem(
    model_1f_parallel, data = df, ordered = TRUE
  )

Table 6.5 shows different metrics for evaluating the fit of the models to the data. Evaluating model fit is an advanced topic, so I’ll simply note that the parallel (equally weighted) model fit shows mixed results. The Comparative Fit Index (CFI) value is greater than 0.90 as recommended, but the root mean square error of approximation value (RMSEA), which should be low, is above the commonly used cutoff of 0.08. The congeneric (optimally weighted) model looks better based on the same fit indices. The likelihood ratio test (not shown) confirms this.

Table 6.5: Comparing fit indices between the parallel and congeneric models.
Fit indices
model npar chisq cfi tli agfi rmsea
parallel (equally weighted) 62 704.5171 0.9013458 0.9002963 0.8709023 0.10812584
congeneric (optimally weighted) 80 263.2575 0.9821879 0.9800924 0.9466522 0.04831517
Measurement invariance

Measurement invariance is another aspect of measurement that we can investigate in a CFA framework. Measurement invariance means that an instrument consistently measures the same underlying construct across different groups or conditions. Let’s say we wanted to compare BDI-II scores by gender. Ideally, the BDI-II items will be equally valid and reliable for all genders, so that any differences observed in the scores truly reflect actual gender differences in depression, rather than differences caused by the instrument itself.

Reliability

Classical test theory is a framework for evaluating and understanding the characteristics of tests and assessments, such as the BDI-II questionnaire. According to classical test theory, every observed score consists of two parts: the true score and measurement error. The true score represents person’s actual ability or trait being measured (e.g., depression severity), while measurement error includes various factors that can introduce variability into the observed scores.

Measurement error can be random or systematic. Random error is noise—unpredictable and inconsistent variations that occur in measurements. Even the most precise physics instruments have some random error, whether from human error or equipment limitations. When random error is high—that is, when the noise is greater than the signal—measurements are unreliable because they are inconsistent. Reliability refers to the consistency of measurements.

Systematic error, on the other hand, is bias, and bias in measurement takes us away from the true score in a particular direction. Systematic error results in artificially inflated or deflated scores. Measurement bias contributes to unreliability, but it’s main victim is validity.

A common teaching example is that a bathroom scale is valid if it correctly measures your weight, and is reliable if it gives you the same reading if you step off and back on. Validity and reliability work together to ensure sound measurement, but as you can see, they are independent concepts. A scale that consistently tells you, someone who weighs 65kg, that you weigh 80kg (±0.10kg) demonstrates reliable measurement, but not valid measurement. Consistency is reliability, regardless of being right or wrong.

For an in-depth look at reliability, see this chapter by William Revelle.

There is no one test of an instrument’s reliability because variation in measurement can come from many different sources: items, time, raters, form, etc. Therefore, we can assess different aspects of reliability of measurement, including test-retest reliability, internal consistency reliability, and inter-rater reliability, to name a few.

Reliability: Test-retest

An instrument is said to exhibit good test-retest reliability if it maintains roughly the same ordering between people when the instrument is repeatedly administered to the same individuals under the same conditions. Often this is benchmarked as a correlation of at least 0.70.

When I first learned about test-retest reliability, I thought it meant that an instrument would return the same answer from one assessment to the next in the absence of change—that perfect reliability meant identical scores at time 1 and time 2. That’s not quite right though. Two sets of scores obtained from a group of people can be perfectly reliable, meaning they preserve the group order perfectly and have a correlation coefficient of 1.0, but also have zero agreement. I created Figure 6.17 to make this clear.

Figure 6.17: Test-retest reliability and agreement are not the same.

Each panel plots mock BDI-II sum scores for 10 people measured four days apart. An individual’s point falls on the diagonal line if they have identical scores at time 1 and time 2.

Panel 1 illustrates perfect reliability and perfect agreement. The correlation coefficient is 1.0, and the ordering of people is perfectly preserved. Now look at Panel 2. This is also a scenario with perfect reliability; the ordering is preserved, but no one has identical scores at time 1 and time 2. There is zero agreement because there is a time effect!

Panel 3 is more typical of what researchers present in papers claiming an instrument is reliable. In this example, there is very little agreement in scores from time 1 to time 2, but for the most part the ordering of depression severity is unchanged among the sample.

Finally, in Panel 4 I simulated random values for time 1 and time 2, so any agreement is by chance. Reliability is low (0.43), and the ordering is not preserved. If your instrument shows evidence of low test-retest reliability like this over a short period where you expect stable scores, you have to wonder what the instrument is measuring.

Test-retest period

A key decision in assessing test-retest reliability is how long to wait in between administrations. If your instrument assesses depression symptoms in the past 2 weeks, you can’t wait more than 2 weeks because your respondent’s frame of reference will change too much and symptoms of depression can come and go. On the other hand, you shouldn’t re-administer the instrument the same day or even the next day because your respondent will likely recall what they said in an effort to appear consistent.

Reliability: Internal consistency

Repeating administrations of an instrument with the same people to assess test-retest reliability is not always feasible, so in 1951 the educational psychologist Lee Cronbach came up with a way to measure what he called internal consistency reliability in a single administration. Rather than measuring the correlation in scale scores at multiple time points, internal consistency reliability evaluates how closely the scale items ascertained in a single administration are related to each other. If the items are not highly correlated with each other, it’s unlikely that they are measuring the same latent construct. In other words, the items are not internally consistent when it comes to measuring the construct.

The most commonly used index of internal consistency reliability is Cronbach’s alpha, but it has many detractors (McNeish, 2018). It’s still widely used because it’s easy to calculate with any statistics software. The basic approach is to divide the mean covariance between items by the mean item variance:

\[\alpha = (N*\bar{c}) / (\bar{v}+(N−1)*\bar{c})\]

where 𝑁 equals the number of items, \(\bar{c}\) is the mean covariance between items, and \(\bar{v}\) is the mean item variance. This means that Cronbach’s alpha quantifies how much of the overall variability is due to the relationships between the items. It can range from 0 to 1 where 1 indicates perfect reliability.

Use the Kuder-Richardson 20 formula (KR20) if you have binary variables.

Before we calculate Cronbach’s alpha, let’s glimpse the HAP data. As a reminder, this data comes from participants in the control arm, collected three months after the treatment arm completed the HAP program.

Refer to Figure 6.8 to see a visualization of the item correlation matrix.

# A tibble: 236 × 20
   bdi01 bdi02 bdi03 bdi04 bdi05 bdi06 bdi07 bdi08 bdi09 bdi10 bdi11 bdi12 bdi13
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     2     1     2     1     1     3     3     3     0     3     2     2     2
 2     2     2     1     0     1     3     3     3     0     2     3     1     2
 3     3     3     3     3     3     3     3     2     3     3     3     2     2
 4     1     2     2     0     1     3     3     3     3     3     3     1     0
 5     1     3     3     1     3     3     3     3     3     3     3     2     2
 6     2     3     2     1     1     1     3     2     1     2     3     2     2
 7     2     2     3     2     3     3     3     3     3     3     3     3     3
 8     3     3     3     2     2     3     0     2     3     2     3     1     3
 9     2     3     3     3     3     2     3     3     3     3     3     1     3
10     3     2     1     2     3     3     3     3     3     3     3     3     3
# ℹ 226 more rows
# ℹ 7 more variables: bdi14 <dbl>, bdi15 <dbl>, bdi16 <dbl>, bdi17 <dbl>,
#   bdi18 <dbl>, bdi19 <dbl>, bdi20 <dbl>

Now we can use the {psych} package to calculate Cronbach’s alpha (William Revelle, 2023). We get a lot of output, so focus first on the top line with the raw alpha value of 0.87. Most reviewers are trained to question your approach if you report an alpha value lower than 0.70, so some researchers will look to see if alpha can be improved by dropping any items. That doesn’t appear to be the case in the HAP data.

William Revelle. (2023). Psych: Procedures for psychological, psychometric, and personality research. Northwestern University.
psych::alpha(df)

Reliability analysis   
Call: psych::alpha(x = df)

  raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
      0.87      0.87    0.89      0.26   7 0.012  1.4 0.66     0.26

    95% confidence boundaries 
         lower alpha upper
Feldt     0.85  0.87   0.9
Duhachek  0.85  0.87   0.9

 Reliability if an item is dropped:
      raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
bdi01      0.87      0.87    0.89      0.27 6.9    0.012 0.0115  0.27
bdi02      0.87      0.87    0.89      0.27 6.9    0.012 0.0108  0.26
bdi03      0.87      0.87    0.88      0.26 6.7    0.012 0.0113  0.26
bdi04      0.88      0.88    0.89      0.27 7.1    0.012 0.0098  0.27
bdi05      0.87      0.87    0.89      0.27 6.9    0.012 0.0111  0.26
bdi06      0.86      0.86    0.88      0.25 6.3    0.013 0.0109  0.25
bdi07      0.87      0.87    0.88      0.26 6.5    0.012 0.0114  0.26
bdi08      0.87      0.87    0.88      0.25 6.5    0.012 0.0115  0.25
bdi09      0.87      0.87    0.89      0.26 6.8    0.012 0.0114  0.26
bdi10      0.87      0.87    0.88      0.26 6.5    0.012 0.0114  0.25
bdi11      0.87      0.87    0.88      0.26 6.6    0.012 0.0118  0.25
bdi12      0.87      0.87    0.88      0.26 6.5    0.012 0.0111  0.25
bdi13      0.87      0.87    0.88      0.26 6.5    0.012 0.0108  0.25
bdi14      0.86      0.86    0.88      0.25 6.4    0.013 0.0100  0.25
bdi15      0.87      0.87    0.88      0.25 6.5    0.012 0.0110  0.25
bdi16      0.86      0.87    0.88      0.25 6.4    0.013 0.0113  0.25
bdi17      0.86      0.86    0.88      0.25 6.3    0.013 0.0107  0.25
bdi18      0.87      0.87    0.88      0.26 6.7    0.012 0.0114  0.26
bdi19      0.87      0.87    0.88      0.26 6.6    0.012 0.0113  0.26
bdi20      0.87      0.87    0.88      0.26 6.6    0.012 0.0113  0.25

 Item statistics 
        n raw.r std.r r.cor r.drop mean   sd
bdi01 236  0.42  0.43  0.37   0.34 1.37 1.12
bdi02 236  0.40  0.41  0.36   0.33 1.89 1.15
bdi03 236  0.48  0.49  0.45   0.41 1.42 1.07
bdi04 236  0.31  0.32  0.25   0.23 1.09 1.07
bdi05 236  0.43  0.43  0.39   0.35 1.39 1.27
bdi06 236  0.68  0.68  0.67   0.62 1.85 1.20
bdi07 236  0.61  0.59  0.57   0.53 1.58 1.41
bdi08 236  0.60  0.60  0.58   0.54 1.47 1.21
bdi09 236  0.46  0.47  0.42   0.39 0.89 1.19
bdi10 236  0.59  0.59  0.56   0.52 1.73 1.28
bdi11 236  0.54  0.53  0.50   0.46 1.89 1.38
bdi12 236  0.58  0.59  0.56   0.52 1.28 1.18
bdi13 236  0.58  0.59  0.56   0.52 1.19 1.22
bdi14 236  0.66  0.66  0.65   0.60 1.24 1.27
bdi15 236  0.60  0.60  0.58   0.53 1.06 1.24
bdi16 236  0.63  0.63  0.61   0.57 1.83 1.17
bdi17 236  0.68  0.67  0.66   0.62 1.57 1.31
bdi18 236  0.54  0.52  0.49   0.45 1.27 1.42
bdi19 236  0.55  0.54  0.51   0.47 1.29 1.28
bdi20 236  0.52  0.55  0.51   0.48 0.43 0.65

Non missing response frequency for each item
         0    1    2    3 miss
bdi01 0.32 0.17 0.31 0.19    0
bdi02 0.19 0.14 0.25 0.42    0
bdi03 0.22 0.38 0.17 0.23    0
bdi04 0.39 0.26 0.20 0.14    0
bdi05 0.36 0.22 0.10 0.32    0
bdi06 0.19 0.25 0.10 0.47    0
bdi07 0.42 0.05 0.08 0.46    0
bdi08 0.33 0.16 0.23 0.28    0
bdi09 0.55 0.22 0.02 0.21    0
bdi10 0.29 0.11 0.17 0.42    0
bdi11 0.33 0.03 0.07 0.57    0
bdi12 0.40 0.12 0.29 0.19    0
bdi13 0.45 0.13 0.21 0.21    0
bdi14 0.44 0.16 0.13 0.28    0
bdi15 0.52 0.14 0.12 0.22    0
bdi16 0.22 0.11 0.27 0.39    0
bdi17 0.33 0.17 0.10 0.40    0
bdi18 0.53 0.05 0.04 0.38    0
bdi19 0.42 0.15 0.14 0.29    0
bdi20 0.63 0.33 0.01 0.03    0

Now that I’ve shown you Cronbach’s alpha, I’ll let you know that many measurement researchers will tell you that we should move on from it, or at least use it more critically (McNeish, 2018). A good alternative is omega. Omega isn’t as commonly included in software programs as Cronbach’s alpha is, but several R packages will estimate one or more variants of omega. See McNeish (2018) and Revelle (2023) for in-depth tutorials.

McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23(3), 412.
Revelle, W. (2023). Using R and the psych package to find ω.

Cronbach’s alpha is also known as an index of tau-equivalent reliability because one assumption is that each item contributes equally to the total score.

MBESS::ci.reliability(df)
$est
[1] 0.8770798

$se
[1] 0.01158661

$ci.lower
[1] 0.8543705

$ci.upper
[1] 0.8997891

$conf.level
[1] 0.95

$type
[1] "omega"

$interval.type
[1] "robust maximum likelihood (wald ci)"

Cronbach’s alpha will often underestimate reliability, but in the HAP data we see that the omega estimate, 0.88, is pretty close to the alpha value, 0.87. Regardless of the method, it looks like the BDI-II items consistently measured the same construct in the trial sample.

I get a slightly different alpha value that what Patel et al. (2017) report because I’m only using data from the control arm for these examples.

Reliability: Inter-rater

Another form of reliability is inter-rater reliability, a measure of the extent to which two (or more) observers (raters) agree on what they observe. Imagine that you and I watch a new movie and a friend asks us for our reviews. I say it was ‘just OK’, but you say the movie was ‘great’. Our ratings are not reliable (not consistent).

Now you might say that a problem with this example is that movie reviews are subjective, and you’d be right. But this is true of many phenomena we might want to assess with an observational instrument.

For instance, in the HAP trial researchers wanted to document whether the therapy sessions were implemented with fidelity—did lay counselors deliver the program as the intervention designers intended? If not, then the trial results would be hard to interpret, especially if the trial found that HAP is not efficacious. But how do we know if a counselor delivered the therapy with fidelity to the design? That is to say, how do we rate therapy quality?

A common approach, employed in the HAP trial, is to train staff to observe a random sample of sessions (live or recorded) and rate the counselors using a standard observational rating system (instrument). In the HAP trial, sessions were rated by fellow lay counselors (peers), supervisors (experts), and an independent observer. Observers rated audio recordings of selected sessions using a 25-item instrument called the Quality of the Healthy Activity Programme (Q-HAP) scale (Singla et al., 2014). The Q-HAP consists of 15 treatment-specific items and 10 general skills items, each rated by observers on a 5 point scale from ‘not done’ (0) to ‘excellent’ (4). See Figure 6.18 for an example.

Figure 6.18: Quality of the Healthy Activity Programme instrument.

The key reliability issue is to establish that the observers are consistent raters. For instance, when listening to the same session, do different observers see and record the same things? If yes, they are reliable observers and can be trusted to rate sessions independently, increasing the number of sessions a team can rate.

Remember, reliability is not the same thing as validity. Whether the Q-HAP is a valid measure of therapy quality is a different question. The Q-HAP developers do not really engage with this question, writing, “Regarding validity, however, because scales were derived from instruments which are used by other psychological treatment researchers worldwide, we have assumed that they possess a degree of validity by extension.”

McGraw, K. O. et al. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30.

Singla et al. (2014) quantified inter-rater reliability with the intra-class correlation coefficient, or ICC. The ICC is an index of how much of the variability in the measurements is due to true differences between subjects/items and how much is due to disagreement among raters. There are 10 different forms of ICC (McGraw et al., 1996), and it’s important to pick the right one for your purpose.

Singla et al. (2014) calculated ICC(2,3). ICC(2,3) is used to assess the reliability and consistency of measurements made by multiple observers on multiple subjects/items. It is often employed when the raters are selected randomly from a larger pool, and each subject or item is assessed by different combinations of these raters. ICC values can range from 0 to 1, where values closer to 1 indicate high agreement.

Singla, D. R. et al. (2014). Improving the scalability of psychological treatments in developing countries: An evaluation of peer-led therapy quality assessment in goa, india. Behaviour Research and Therapy, 60, 53–59.

Singla et al. (2014) reported that the peer observers had moderate agreement on the Q-HAP treatment specific subscale (ICC(2,3) = .616, N = 97) and the general skills sub scale (ICC(2,3) = .622, N = 189). The results for expert observers were similar.

Construct Validation Phase 3: External

The final phase of developing a new instrument is to evaluate how it performs against external criteria or in relation to existing tools. Two main buckets of validity in this phase are construct validity and criterion validity.

Construct validity

Construct validity is a framework for evaluating whether an instrument measures the intended theoretical construct. Let’s review several types of construct validity, including convergent, discriminant, and known groups.

Construct: Convergent and discriminant validity

Psychologists in particular like to think about nomological validity and talk in terms of convergent and discriminant validity. Establishing evidence for nomological validity means showing that your new instrument is positively correlated with theoretically related constructs (convergent validity) and uncorrelated (or only weakly correlated) with theoretically unrelated constricts (discriminant validity).

‘Nomological’ from the Greek ‘nomos’ meaning ‘law’. The nomological network is the theoretical map of how different constructs are related. For instance, in a nomological network about mental health, depression and anxiety are distinct disorders but have theoretical overlap and are often comorbid.

For instance, a measure of depression should not be strongly associated with a measure of narcissism because we don’t consider depression and narcissism to be theoretically linked. But we would expect a measure of depression to be associated to some degree with a measure of anxiety because they are often co-morbid, meaning that people with depression symptoms also commonly report symptoms of anxiety.

In practice, when developing or adapting an instrument, you should design a validation study that includes measures of conceptually related and unrelated constructs so you can evaluate if the correlations are as predicted. If not, you have more work ahead.

Construct: Known groups validity

Another method for establishing evidence of construct validity is via known groups validity. With known groups validity you examine whether your instrument distinguishes between groups of people who are already known to differ on the construct of interest. For example, if you were developing a new measure of family conflict, you could recruit families who have been referred for services and a comparison group of families who have not been referred, administer your new questionnaire to both groups, and compare the group-level results. Known groups validity would predict that referred families would score higher on average on your measure of family distress compared to families not referred for support. See Puffer et al. (2021) for this approach to recruitment (but not analysis).

Puffer, E. S. et al. (2021). Development of the family togetherness scale: A mixed-methods validation study in kenya. Frontiers in Psychology, 12, 662991.

Known groups validity is similar to diagnostic accuracy, discussed below as a type of criterion validity. A key difference is that diagnostic accuracy involves analyzing classification predictions for individuals whereas known groups methods look at differences in group averages.

Criterion validity

Criterion validity assesses how well an instrument relates to an external criterion or a gold standard. There are three main types of criterion validity: concurrent validity, predictive validity, and diagnostic accuracy.

Criterion: Concurrent validity

Concurrent validity examines the relationship between the new instrument and some criterion that is assessed at the same time. For example, Beck et al. (1988) reviewed 35 studies that compared BDI scores to four existing measures of depression and reported strong, positive correlations.

Beck, A. T. et al. (1988). Psychometric properties of the beck depression inventory: Twenty-five years of evaluation. Clinical Psychology Review, 8(1), 77–100.

Concurrent and convergent validity are similar. They both involve comparing the new instrument to other instruments measured at the same time. Concurrent compares to an existing gold standard measure of the same construct, while convergent compares to measures of related constructs. We’re splitting hairs in my view.

Criterion: Predictive validity

Predictive validity assesses how well an instrument predicts future outcomes or behaviors. A common application is evaluating the utility of a new hiring test to identify job candidates most likely to succeed in a particular role. A test has good predictive ability if the results correlate with job performance measured at a later time.

Criterion: Diagnostic accuracy

Diagnostic accuracy refers to the ability of an instrument to correctly identify individuals with or without a particular condition or disease. The new instrument under investigation is referred to as the index test, and the existing gold standard test is known as the criterion. For instance, if you are developing a new rapid diagnostic test for a bacterial infection that returns results in minutes, your rapid test would be the index test and the bacteria culture test would be the criterion or the gold standard.

Returning to an earlier example, Green et al. (2018) developed a new perinatal depression screening questionnaire in Kenya—the Perinatal Depression Screening (PDEPS)—and evaluated its diagnostic accuracy by comparing PDEPS scores (the index test) to the results of separate clinical interviews conducted by Kenyan counselors (gold standard) who were blind to women’s questionnaire data. In this study, 193 pregnant and postpartum women completed the new screening questionnaire and separately participated in a clinical interview within three days. Clinical interviewers identified 10/193 women who met diagnostic criteria for Major Depressive Episode.

Figure 6.19: Example confusion matrix.

Figure 6.19 displays a confusion matrix for the study. The analysis found that a score of 13 or greater on the PDEPS correctly identified 90% of true cases. It only missed 1 out of 10 cases. This is known as sensitivity or the true positive rate. This cutoff on the PDEPS also correctly identified 90% of non-cases. This is known as specificity or the true negative rate. A score of 13 is the optimal cutoff that maximizes sensitivity and specificity.

For some applications or in certain settings you might choose to prioritize sensitivity over specificity, or vice versa. For instance, if false negatives are very costly for individuals and society, you might choose a cutoff that favors high sensitivity.

CROSS-CULTURAL VALIDITY

If you are not yet ready to agree with my opening statement that “measurement is one of the hardest parts of science”, please allow me to tell you what else you need to consider if you wish to develop or adapt instruments for use in cross-cultural contexts.

Kohrt et al. (2011) is a fantastic guide for our discussion. This paper is motivated by the fact that there are few culturally adapted and validated instruments for assessing child mental health in low-income settings. This lack of instruments is a barrier to designing, delivering, and assessing services for children and families. To remedy this situation, the authors propose six criteria for evaluating the cross-cultural validity of instruments that are applicable to any topic.

  1. What is the purpose of the instrument?
  2. What is the construct to be measured?
  3. What are the contents of the construct?
  4. What are the idioms used to identify psychological symptoms and behaviors?
  5. How should questions and responses be structured?
  6. What does a score on the instrument mean?

What is the purpose of the instrument?

The authors start by reminding us that validity is not an inherent property of an instrument, adding that validity can vary by setting, population, and purpose. The last point is often overlooked. It’s important to design instruments to be fit for purpose. If your objective is to measure the efficacy of an intervention, let’s say, adapting an instrument validated for measuring the prevalence of disease may not be ideal. Start by defining your purpose.

What is the construct to be measured?

We covered this question about construct validity extensively in the previous section, but it’s worth noting the authors’ distinction between three different types of constructs: (i) local constructs, (ii) Western psychiatric constructs; and (iii) cross-cultural constructs.

In their formulation, local constructs are the unique ways that the target population conceptualizes an issue. For instance, in the perinatal depression study mentioned earlier (Green et al., 2018), a local construct that emerged in focus group discussions was that women experiencing depression often feel like they just want to go back to their maternal home.7 This is not a characteristic of depression that you will find in any commonly used screening instruments, but it has relevance to this population.

Green, E. P. et al. (2018). Developing and validating a perinatal depression screening tool in kenya blending western criteria with local idioms: A mixed methods study. Journal of Affective Disorders, 228, 49–59.

Local constructs are also known as idioms of distress or culture-bound syndromes.

Western psychiatric constructs—which for our purposes we’ll recast more broadly as standard global health indicators—have their origin outside of the target population and may or may not be relevant. Kohrt et al. (2011) give the example of posttraumatic stress disorder as a Western construct with no perfectly synonymous concepts in Nepal. In contrast, cross-cultural constructs are universally recognized phenomena with some degree of shared meaning across settings and cultural groups.

What are the contents of the construct?

Answering this question about content validity requires a close inspection of the instrument’s elements for relevancy. For instance, Kohrt et al. (2011) found that a common item on ADHD screening instruments in high-income countries, “stands quietly when in line”, is not applicable in Nepal because this behavior is not a universal expectation of children. Thus it’s not a sign of hyperactivity in this context and should not be included in a screening instrument.

Whether you are creating a new instrument or adapting an existing measure, a few rounds of qualitative research will help you to explore which elements like this should be included or dropped. See the previous example linked to Figure 6.14.

What are the idioms used to identify psychological symptoms and behaviors?

Once you know what domains an instrument should assess, it’s important to get the language right. Translation is never sufficient for establishing validity, but it’s a critical piece of the process.

A standard recommendation is to conduct forward translation, blinded back translation, and reconciliation. In this approach, a translator fluent in both languages translates the items from the original language to the target language. Next, a new linguist translates the target language product back to the original language, but does so blinded, without seeing the original text. In the final step, the translators work together to compare the original text and the back-translated text, resolve disagreements, and improve the translations to ensure semantic equivalence.

How should questions and responses be structured?

Another important consideration is technical equivalence, or ensuring that the possible response sets are understood in the same way across groups or settings. For instance, if adapting an instrument that asks people to respond on a 4-point scale—never, rarely, sometimes, often—it’s helpful to verify that this format is understood by the target population once translated. Cognitive interviewing is a good approach.

What does a score on the instrument mean?

This final question comes up in the external phase of construct validation that we discussed previously. There is not much more to say here, but Kohrt et al. (2011) make a point about the lack of criterion validation that bears repeating for anyone interested in conducting a prevalence study:

Kohrt, B. A. et al. (2011). Validation of cross-cultural child mental health and psychosocial research instruments: Adapting the depression self-rating scale and child PTSD symptom scale in nepal. BMC Psychiatry, 11(1), 1.

Ultimately, for prevalence studies, diagnostic validation is crucial. The misapplication of instruments that have not undergone diagnostic validation to make prevalence claims is one of the most common errors in global mental health research.

THREATS TO CONSTRUCT VALIDITY

Shadish et al. (2002) outlined 14 threats to construct validity that are presented in Table 6.6. See Matthay et al. (2020) for a translation of these threats to DAGs. I find that I often return to this list when designing a new study to consider whether there are threats lurking in my measurement strategy.

Shadish, W. R. et al. (2002). Experimental and quasi-experimental designs for generalized causal inference. Cengage Learning.
Matthay, E. C. et al. (2020). A graphical catalog of threats to validity: Linking social science with epidemiology. Epidemiology, 31(3), 376.
Table 6.6: Threats to construct validity, adapted from Shadish et al. (2002) and Matthay and Glymour (2020).
Threat Name Definition
Inadequate explication of constructs Failure to adequately explicate a construct may lead to incorrect inferences about the causal relationship of interest.
Construct confounding Exposures or treatments usually involve more than one construct, and failure to describe all the constructs may result in incomplete construct inference.
Confounding constructs with levels of constructs Inferences made about the constructs in a study fail to respect the limited range of the construct that was actually studied, i.e., effect estimates are extrapolated beyond the range of the observed data.
Mono-operation bias Any one operationalization (measurement or intervention implementation) of a construct both underrepresents the construct of interest and measures irrelevant constructs, complicating the attribution of observed effects.
Mono-method bias When all operationalizations (measurements or intervention implementations) use the same method (e.g., selfreport), that method is part of the construct actually studied.
Treatment sensitive factorial structure The structure of a measure may change as result of treatment. This change may be hidden if the same scoring is always used.
Reactive self-report changes Self-reports can be affected by participant motivation to be in a treatment condition. This motivation may change after assignment is made.
Compensatory equalization When treatment provides desirable goods or services, administrators, staff, or constituents may provide compensatory goods or services to those not receiving treatment. This action must then be included as part of the treatment construct description.
Compensatory rivalry Participants not receiving treatment may be motivated to show they can do as well as those receiving treatment. This action must then be included as part of the treatment construct description.
Resentful demoralization Participants not receiving a desirable treatment may be so resentful or demoralized that they may respond more negatively than otherwise. This response must then be included as part of the treatment construct description.
Reactivity to the experimental situation Participant responses reflect not just treatments and measures but also participants’ perceptions of the experimental situation. These perceptions are part of the treatment construct actually tested.
Experimenter expectancies The experimenter can influence participant responses by conveying expectations about desirable responses. These expectations are part of the treatment construct as actually tested.
Novelty and disruption effects Participants may respond unusually well to a novel innovation or unusually poorly to one that disrupts their routine. This response must then be included as part of the treatment construct description.
Treatment diffusion Participants may receive services from a condition to which they were not assigned, making construct descriptions of both conditions more difficult.

6.6 Measurement, Schmeasurement

Two Psychologists, Four Beers, Episode 32, Measurement Schmeasurement.

“Measurement, Schmeasurement” is the title of a great paper by Flake et al. (2020). The subtitle is, “Questionable Measurement Practices and How to Avoid Them”. The authors’ thesis is that measurement is a critical part of science, but questionable measurement practices undermine the validity of many studies and ultimately slow the progress of science. They define questionable measurement practices as:

decisions researchers make that raise doubts about the validity of measure use in a study, and ultimately the study’s final conclusions.

These decisions, they argue, stem from ignorance, negligence, and in some cases misrepresentation. According to Flake et al. (2020), questionable measurement practices live on because many researchers have an attitude of, “measurement, schmeasurement”. In other words, who cares.

If you are not a native English speaker, measurement, schmeasurement might be a confusing phrase. It’s an example of shm-reduplication used to indicate lack of interest or derision. Science, schmiance. See ghr.link/shm for more examples.

I’m persuaded by this argument, having been in the room when investigators have spent weeks thinking about research design only to uncritically “throw in” a bunch of measures at the end. The thinking is often, why not, we’re going to the trouble of doing the study, let’s measure everything. Sometimes a student or colleague needs a project, and they’re invited to add something to the survey battery. In situations like this, the validity of measurement is never at the forefront. Doing it right is hard, and hey, measurement, schmeasurement, right?

Flake et al. (2020) believe that fixing this problem begins with greater transparency and better reporting about measurement decisions. They also offer six questions to consider in the design phase and to report in publications:

Flake, J. K. et al. (2020). Measurement schmeasurement: Questionable measurement practices and how to avoid them. Advances in Methods and Practices in Psychological Science, 3(4), 456–465.
  1. What is your construct?
  2. Why and how did you select your measure?
  3. What measure did you use to operationalize the construct?
  4. How did you quantify your measure?
  5. Did you modify the scale? And if so, how and why?
  6. Did you create a measure on the fly? If so, justify your decision and report all measurement details for the new measure along with any validity evidence.

If you ask and answer these questions for your next study, I’m convinced that you will improve your measurement strategy and thus strengthen the validity of your conclusions.


  1. This is true in many other nations as well.↩︎

  2. The coding is designed to match the International Classification of Diseases (ICD-10) coding system.↩︎

  3. This is not bad considering these are global estimates!↩︎

  4. The authors don’t indicate this in their manuscript.↩︎

  5. Non-compliance is not limited to designs with a treatment group and a control group. It also applies to studies comparing two active treatments.↩︎

  6. Patel et al. (2017) excluded one item about sex for cultural reasons, so they used a modified BDI-II with 20 of the 21 questions.↩︎

  7. It’s customary for women in this culture to marry and live with her husband’s family.↩︎