Which type of validity would involve measuring whether a variable correlates with an expected outcome?

Principles of clinical outcome assessment

Nicholas Bellamy, in Rheumatology (Sixth Edition), 2015

Criterion validity

Criterion validity is assessed by statistically testing a new measurement technique against an independent criterion or standard (concurrent validity) or against a future standard (predictive validity). Criterion validity is an estimate of the extent to which a measure agrees with a gold standard (i.e., an external criterion of the phenomenon being measured). The major problem in criterion validity testing, for questionnaire-based measures, is the general lack of gold standards. Indeed, some purported gold standards may not themselves provide completely accurate estimates of the true value of a phenomenon. In contrast, electromechanical devices such as those evaluating strength, resistance, and range of movement can more often be validated using standard calibration techniques found in the engineering literature.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780323091381000024

Assessment

Yiyun Shou, ... Hui-Fang Chen, in Comprehensive Clinical Psychology (Second Edition), 2022

4.02.5.1.3 Criterion Validity

Criterion validity indicates how well the scores or responses of a test converge with criterion variables with which the test is supposed to converge (Cronbach and Meehl, 1955). There are numerous uses and definitions of criterion validity in the literature depending on how one defines “criterion variables”, and the term is often mixed with several other key validity terms such as concurrent validity or convergent validity. For the purposes of the present article, criterion variables are defined as other measures of the same construct, conceptually relevant constructs or conceptually relevant behaviors or performances. Criterion validity is concurrent when the scores of a test and criterion variables are obtained at the same time (often called concurrent validity), or predictive/postdictive when the criterion variables are measured after/before the current test (often called predictive/postdictive validity) (Grimm and Widaman, 2012).

Criterion validity can be tested in various situations and for various purposes. For example, a psychologist may wish to propose a shorter form of a test to replace the original, longer test for the purpose of greater time efficiency. Criterion and concurrent validity of the short form can be demonstrated by its correlation with the original test. A psychologist may wish to evaluate a self-report test of a mental disorder, and the concurrent validity of the test can be assessed by comparing the scores of the test with a clinical diagnosis that is made at the same time. For predictive criterion validity, researchers often examine how the results of a test, such as intelligence or depression, predict a highly relevant outcome, such as educational achievement or suicide attempts, which are assessed at some point in the future. Bivariate correlations between the scores of the test and criterion variables are often used to evaluate criterion validity, whereas regression methods can be used if researchers would like to control background variables, such as gender and age, when examining how well the test scores predict the criterion variable.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128186978001102

Survey Research Methods

A. Fink, in International Encyclopedia of Education (Third Edition), 2010

Criterion Validity

Criterion validity compares responses to future performance or to those obtained from other, more well-established surveys. Criterion validity is made up two subcategories: predictive and concurrent. Predictive validity refers to the extent to which a survey measure forecasts future performance. A graduate school entry examination that predicts who will do well in graduate school has predictive validity. Concurrent validity is demonstrated when two assessments agree or a new measure is compared favorably with one that is already considered valid. For example, to establish the concurrent validity of a new survey, the survey researcher can either administer the new and validated measure to the same group of respondents and compare the responses, or administer the new instrument to the respondents and compare the responses to experts' judgment. A high correlation between the new survey and the criterion means concurrent validity. Establishing concurrent validity is useful when a new measure is created that claims to be better (shorter, cheaper, and fairer).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947002967

Validity, Data Sources

Michael P. McDonald, in Encyclopedia of Social Measurement, 2005

Criterion Validity

Criterion validity is the comparison of a measure against a single measure that is supposed to be a direct measure of the concept under study. Perhaps the simplest example of the use of the term validity is found in efforts of the American National Election Study (ANES) to validate the responses of respondents to the voting question on the post-election survey. Surveys, including the ANES, consistently estimate a measure of the turnout rate that is unreliable and biased upwards. A greater percentage of people respond that they voted than official government statistics of the number of ballots cast indicate.

To explore the reliability of the measure of turnout, ANES compared a respondent's answer to the voting question against actual voting records. A respondent's registration was also validated. While this may sound like the ideal case of validating a fallible human response to an infallible record of voting, the actual records are not without measurement error. Some people refuse to provide names or give incorrect names, either on registration files or to the ANES. Votes may be improperly recorded. Some people live outside the area where surveyed and records were left unchecked. In 1984, ANES even discovered voting records in a garbage dump. The ANES consistently could not find voting records for 12–14% of self-reported voters. In 1991, the ANES revalidated the 1988 survey and found 13.7% of the revalidated cases produced different results than the cases initially validated in 1989. These discrepancies reduced the confidence in the reliability of the ANES validation effort and, given the high costs of validation, the ANES decided to drop validation efforts on the 1992 survey.

The proceeding example is of criterion validity, where the measure to be validated is correlated with another measure that is a direct measure of the phenomenon of concern. Positive correlation between the measure and the measure it is compared against is all that is needed for evidence that a measure is valid. In some sense, criterion validity is without theory. “If it were found that accuracy in horseshoe pitching correlated highly with success in college, horseshoe pitching would be a valid measure of predicting success in college” (Nunnally, as quoted in the work of Carmines and Zeller). Conversely, no correlation, or worse negative correlation, would be evidence that a measure is not a valid measure of the same concept.

As the example of ANES vote validation demonstrates, criterion validity is only as good as the validity of the reference measure to which one is making a comparison. If the reference measure is biased, then valid measures tested against it may fail to find criterion validity. Ironically, two similarly biased measures will corroborate one another, so a finding of criterion validity is no guarantee that a measure is indeed valid.

Carmines and Zeller argue that criterion validation has limited use in the social sciences because often there exists no direct measure to validate against. That does not mean that criterion validation may be useful in certain contexts. For example, Schrodt and Gerner compared machine coding of event data against that of human coding to determine the validity of the coding by computer. The validity of the machine coding is important to these researchers, who identify conflict events by automatically culling through large volumes of newspaper articles. As similar large-scale data projects emerge in the information age, criterion validation may play an important role in refining the automated coding process.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000463

Collecting Data Through Measurement in Experimental-Type Research

Elizabeth DePoy PhD, MSW, OTR, Laura N. Gitlin PhD, in Introduction to Research (Fifth Edition), 2016

Concurrent

In concurrent criterion validity, there is a known standardized instrument or other criterion that measures/demonstrates the same underlying concept of interest (lexically defined in the same way). It is not unusual for researchers to develop instrumentation to measure concepts, even if prior instrumentation already exists.1

Assume you are interested in developing your own measure of self-esteem. To establish concurrent validity, you will administer your instrument along with an accepted and validated instrument measuring the same concept to the same sample of individuals. The extent of agreement between the two measures, expressed as a correlation coefficient, will tell you whether your scale is accurately measuring the same construct operationalized in the validated scale.

This form of validity can only be used if another criterion (existing validated instrument) exists. If so, the concurrent form is only as good as the validity of the selected criterion.

You have developed a measure of physical functioning that you believe is more precise than existing measures. You are interested in examining the relationship between your measure and previously evaluated instruments, and you expect a strong relationship because they should be measuring the same underlying construct.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780323261715000173

Epidemiology

Martin Prince, in Core Psychiatry (Third Edition), 2012

Criterion validity

Testing of criterion validity requires a ‘gold standard’, technically the very thing that one is setting out to measure. In psychiatry there are generally no biologically based criterion measures as, for example, bronchoscopy and biopsy for carcinoma of the bronchus. Much research is currently orientated to the identification of such ‘biomarkers’. The first measures developed for psychiatric research were compared with the criterion or ‘gold standard’ of a competent psychiatrist's clinical diagnosis. Currently, the most commonly used paradigm in psychiatric research is the clinician semi-structured interview, generally the Schedules for Clinical Assessment in Neuropsychiatry (SCAN) or the Structured Clinical Interview for DSM Disorders (SCID), applying ICD-10 or DSM-IV diagnoses. There are several problems that are implicit in this process (Kessler et al 2004). First, the reliability of clinician semi-structured interviews is by no means perfect, particularly in community-based research, and this random error will set an upper limit on the validity coefficients that are likely to be observed. Second, repetition of comprehensive mental state assessments has been shown to be associated with systematic underreporting of symptoms on the second interview compared with the first, again tending towards an underestimate of true validity. Third, the DSM and ICD diagnostic criteria themselves are not fully operationalized, and differing judgments made in the algorithms accompanying the test assessment and gold standard research interviews may be another source of discrepancy. Fourth, Cohen's kappa varies with the prevalence of the disorder even when specificity and sensitivity are constant, limiting the utility of this validity coefficient in comparing the validity of a measure across different populations.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780702033971000094

Validity

Emma K. Stokes, in Rehabilitation Outcome Measures, 2011

What do the statistics tell us?

The output of criterion validity and convergent validity (an aspect of construct validity discussed later) will be validity coefficients. These are products of correlating the scores obtained on the new instrument with a gold standard or with existing measurements of similar domains. The validity coefficients can range from −1 to +1.

Pearson-product moment correlation (PPMCC) – consideration of the linear relationship between OM where the OM measures interval or ratio data, e.g. time, range of motion in degrees. If the data from two sets of measurements are plotted against one another, the PPMCC illustrates the closeness of the points to a straight line: +1 or −1 indicating a string relationship, and 0 indicating no relationship at all

Spearman's rank order correlation. When x and y are not linearly related, but show a consistently increasing or decreasing trend in rank, a non-parametric correlation such as Spearman's (ρ) may be employed. It assumes that the data are ordinal, e.g. many of the OMs in rehabilitation

Kendall's rank order correlation (τ). This coefficient may be applied in cases where Spearman's rho is appropriate.

Phi coefficient (ψ). Used in the analysis of data which is dichotomous i.e. presence of absence of a disease.

Box 5.2 indicates how we might interpret these results for clinical practice.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978044306915400005X

Pharmacology

H P Rang, in Drug Discovery and Development (Second Edition), 2013

The choice of model

Apart from resource limitations, regulatory constraints on animal experimentation, and other operational factors, what governs the choice of disease model?

As discussed in Chapter 2, naturally occurring diseases produce a variety of structural biochemical abnormalities, and these are often displayed separately in animal models. For example, human allergic asthma involves: (a) an immune response; (b) increased airways resistance; (c) bronchial hyperreactivity; (d) lung inflammation; and (e) structural remodelling of the airways. Animal models, mainly based on guinea pigs, whose airways behave similarly to those of humans, can replicate each of these features, but no single model reproduces the whole spectrum. The choice of animal model for drug discovery purposes, therefore, depends on the therapeutic effect that is being sought. In the case of asthma, existing bronchodilator drugs effectively target the increased airways resistance, and steroids reduce the inflammation, and so it is the other components for which new drugs are particularly being sought.

A similar need for a range of animal models covering a range of therapeutic targets applies in many disease areas.

Validity criteria

Obviously an animal model produced in a laboratory can never replicate exactly a spontaneous human disease state, so on what basis can we assess its ‘validity’ in the context of drug discovery?

Three types of validity criteria were originally proposed by Willner (1984) in connection with animal models of depression. These are:

Face validity

Construct validity

Predictive validity.

Face validity refers to the accuracy with which the model reproduces the phenomena (symptoms, clinical signs and pathological changes) characterizing the human disease.

Construct validity refers to the theoretical rationale on which the model is based, i.e. the extent to which the aetiology of the human disease is reflected in the model. A transgenic animal model in which a human disease-producing mutation is replicated will have, in general, good construct validity, even if the manifestations of the human disorder are not well reproduced (i.e. it has poor face validity).

Predictive validity refers to the extent to which the effect of manipulations (e.g. drug treatment) in the model is predictive of effects in the human disorder. It is the most pragmatic of the three and the most directly relevant to the issue of predicting therapeutic efficacy, but also the most limited in its applicability, for two main reasons. First, data on therapeutic efficacy are often sparse or non-existent, because no truly effective drugs are known (e.g. for Alzheimer's disease, septic shock). Second, the model may focus on a specific pharmacological mechanism, thus successfully predicting the efficacy of drugs that work by that mechanism but failing with drugs that might prove effective through other mechanisms. The knowledge that the first generation of antipsychotic drugs act as dopamine receptor antagonists enabled new drugs to be identified by animal tests reflecting dopamine antagonism, but these tests cannot be relied upon to recognize possible ‘breakthrough’ compounds that might be effective by other mechanisms. Thus, predictive validity, relying as it does on existing therapeutic knowledge, may not be a good basis for judging animal models where the drug discovery team's aim is to produce a mechanistically novel drug. The basis on which predictive validity is judged carries an inevitable bias, as the drugs that proceed to clinical trials will normally have proved effective in the model, whereas drugs that are ineffective in the model are unlikely to have been developed. As a result, there are many examples of tests giving ‘false positive’ expectations, but very few false negatives, giving rise to a commonly held view that conclusions from pharmacological tests tend to be overoptimistic.

Some examples

We conclude this discussion of the very broad field of animal models of disease by considering three disease areas, namely epilepsy, psychiatric disorders and stroke. Epilepsy-like seizures can be produced in laboratory animals in many different ways. Many models have been described and used successfully to discover new anti-epileptic drugs (AEDs). Although the models may lack construct validity and are weak on face validity, their predictive validity has proved to be very good. With models of psychiatric disorders, face validity and construct validity are very uncertain, as human symptoms are not generally observable in animals and because we are largely ignorant of the cause and pathophysiology of these disorders; nevertheless, the predictive validity of available models of depression, anxiety and schizophrenia has proved to be good, and such models have proved their worth in drug discovery. In contrast, the many available models of stroke are generally convincing in terms of construct and face validity, but have proved very unreliable as predictors of clinical efficacy. Researchers in this field are ruefully aware that despite many impressive effects in laboratory animals, clinical successes have been negligible.

Epilepsy models

The development of antiepileptic drugs, from the pioneering work of Merritt and Putnam, who in 1937 developed phenytoin, to the present day, has been highly dependent on animal models involving experimentally induced seizures, with relatively little reliance on knowledge of the underlying physiological, cellular or molecular basis of the human disorder. Although existing drugs have significant limitations, they have brought major benefits to sufferers from this common and disabling condition – testimony to the usefulness of animal models in drug discovery.

Human epilepsy is a chronic condition with many underlying causes, including head injury, infections, tumours and genetic factors. Epileptic seizures in humans take many forms, depending mainly on where the neural discharge begins and how it spreads.

Some of the widely used animal models used in drug discovery are summarized in Table 11.2. The earliest models, namely the maximal electroshock (MES) test and the pentylenetetrazol-induced seizure (PTZ) test, which are based on acutely induced seizures in normal animals, are still commonly used. They model the seizure, but without distinguishing its localization and spread, and do not address either the chronicity of human epilepsy or its aetiology (i.e. they score low on face validity and construct validity). But, importantly, their predictive validity for conventional antiepileptic drugs in man is very good, and the drugs developed on this basis, taken regularly to reduce the frequency of seizures or eliminate them altogether, are of proven therapeutic value. Following on from these acute seizure models, attempts have been made to replicate the processes by which human epilepsy develops and continues as a chronic condition with spontaneous seizures, i.e. to model epileptogenesis (Löscher, 2002; White, 2002) by the use of models that show greater construct and face validity. This has been accomplished in a variety of ways (see Table 11.2) in the hope that such models would be helpful in developing drugs capable of preventing epilepsy. Such models have thrown considerable light on the pathogenesis of epilepsy, but have not so far contributed significantly to the development of improved antiepileptic drugs. Because there are currently no drugs known to prevent epilepsy from progressing, the predictive validity of epileptogenesis models remains uncertain.

Psychiatric disorders

Animal models of psychiatric disorders are in general problematic, because in many cases the disorders are defined by symptoms and behavioural changes unique to humans, rather than by measurable physiological, biochemical or structural abnormalities. This is true in conditions such as schizophrenia, Tourette's syndrome and autism, making face validity difficult to achieve. Depressive symptoms, in contrast, can be reproduced to some extent in animal models (Willner and Mitchell, 2002), and face validity is therefore stronger. The aetiology of most psychiatric conditions is largely unknown2, making construct validity questionable.

Models are therefore chosen largely on the basis of predictive validity, and suffer from the shortcomings mentioned above. Nonetheless, models for some disorders, particularly depression, have proved very valuable in the discovery of new drugs. Other disorders, such as autism and Tourette's syndrome, have proved impossible to model so far, whereas models for others, such as schizophrenia (Lipska and Weinberger, 2000; Moser et al., 2000), have been described but are of doubtful validity. The best prediction of antipsychotic drug efficacy comes from pharmacodynamic models reflecting blockade of dopamine and other monoamine receptors, rather than from putative disease models, with the result that drug discovery has so far failed to break out of this mechanistic straitjacket.

Stroke

Many experimental procedures have been devised to produce acute cerebral ischaemia in laboratory animals, resulting in long-lasting neurological deficits that resemble the sequelae of strokes in humans (Small and Buchan, 2000). Interest in this area has been intense, reflecting the fact that strokes are among the commonest causes of death and disability in developed countries, and that there are currently no drugs that significantly improve the recovery process. Studies with animal models have greatly advanced our understanding of the pathophysiological events. Stroke is no longer seen as simple anoxic death of neurons, but rather as a complex series of events involving neuronal depolarization, activation of ion channels, release of excitatory transmitters, disturbed calcium homeostasis leading to calcium overload, release of inflammatory mediators and nitric oxide, generation of reactive oxygen species, disturbance of the blood–brain barrier and cerebral oedema (Dirnagl et al., 1999). Glial cells, as well as neurons, play an important role in the process. Irreversible loss of neurons takes place gradually as this cascade builds up, leading to the hope that intervention after the primary event – usually thrombosis – could be beneficial. Moreover, the biochemical and cellular events involve well-understood signalling mechanisms, offering many potential drug targets, such as calcium channels, glutamate receptors, scavenging of reactive oxygen species and many others. Ten years ago, on the basis of various animal models with apparently good construct and face validity and a range of accessible drug targets, the stage seemed to be set for major therapeutic advances. Drugs of many types, including glutamate antagonists, calcium and sodium channel blocking drugs, anti-inflammatory drugs, free radical scavengers and others, produced convincing degrees of neuroprotection in animal models, even when given up to several hours after the ischaemic event. Many clinical trials were undertaken (De Keyser et al., 1999), with uniformly negative results. The only drug currently known to have a beneficial – albeit small – effect is the biopharmaceutical ‘clot-buster’ tissue plasminogen activator (TPA), widely used to treat heart attacks. Stroke models thus represent approaches that have revealed much about pathophysiology and have stimulated intense efforts in drug discovery, but whose predictive validity has proved to be extremely poor, as the drug sensitivity of the animal models seems to be much greater than that of the human condition. Surprisingly, it appears that whole-brain ischaemia models show better predictive validity (i.e. poor drug responsiveness) than focal ischaemia models, even though the latter are more similar to human strokes.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780702042997000111

Measurement of pelvic floor muscle function and strength, and pelvic organ prolapse

Kari Bø, ... James A. Ashton-Miller, in Evidence-Based Physical Therapy for the Pelvic Floor (Second Edition), 2015

Validity

Several investigators have studied criterion validity of vaginal palpation comparing vaginal palpation and vaginal squeeze pressure (McKey and Dougherty, 1986; Hahn et al., 1996; Isherwood and Rane, 2000; Bø and Finckenhagen, 2001; Jarvis et al., 2001; Kerschan-Schindel et al., 2002).

Isherwood and Rane (2000) compared vaginal palpation using the Oxford grading system and compared it with an arbitrary scale on a perineometer from 1 to 12. They found a high kappa of 0.73. In contrast, Bø and Finckenhagen (2001) found a kappa of 0.37 comparing the Oxford grading system with vaginal squeeze pressure. Heitner (2000) concluded that lift was most reliably tested with palpation, and that all other measures of muscle function were better tested with EMG.

Hahn et al. (1996) found that there was a better correlation of vaginal palpation and pressure measurement in continent than in incontinent women (r = 0.86 and 0.75, respectively). This was supported by Morin et al. (2004) comparing vaginal palpation with dynamometry, finding r = 0.73 in continent and r = 0.45 in incontinent women, respectively.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780702044434000054

Research methods, statistics and evidence-based practice

Andrew M. McIntosh, ... Stephen M. Lawrie, in Companion to Psychiatric Studies (Eighth Edition), 2010

3 Appraising the evidence

The evidence needs to be critically appraised for its scientific validity and clinical importance. Validity criteria are essentially the same as the questions to be answered in critical appraisal (see Sackett et al 1997); while clinical importance can be determined by some of the summary measures EBM practitioners find useful (particularly for treatment studies).

Appraisal example: Although you know that Evidence-Based Medicine only selects for inclusion treatment trials with random allocation, clinically important outcome measures and consistent data analysis, no system is infallible, and therefore you evaluate the paper for yourself (Fairburn et al 1995), following the checklist for treatment studies (Box 9.10).

The paper compares the outcome after cognitive-behavioural therapy (CBT), interpersonal therapy (IPT) and behaviour therapy. Treatment allocation was random (although the paper does not mention whether or not the randomisation list was concealed); 90% of the patients were interviewed at follow-up and the groups were analysed as randomised; the treatment was not blind (but outcome assessment was); and the groups were treated equally other than with the interventions of interest and did not differ significantly at the start of the trial. You decide therefore that the study is valid.

At this point, it is worth briefly reviewing some of the measures of clinical effectiveness and how to calculate them for treatment studies. We are primarily interested in comparing the proportion of patients treated with a new treatment who get the outcome of interest – or the experimental event rate (EER) – with the proportion of patients treated with an alternative (standard) treatment who get the outcome of interest — or control event rate (CER). The difference between these two outcome rates is the absolute risk reduction (ARR), i.e. CER − EER (for an undesired outcome) expressed as a percentage. This tells us the difference in the number of patients with a specific outcome for every 100 patients treated in either way. The next term to introduce transforms this ARR into a more clinically useful number – the number needed to treat (NNT) – which is simply the reciprocal of the ARR and tells us how many such patients we would need to treat in a particular way so as to avoid one outcome event. As a rough rule of thumb, NNTs of less than 10 usually denote a powerful and important treatment effect.

The results given in the paper are rates of still satisfying diagnostic criteria for bulimia at the end of the study: 37% for CBT, 28% for IPT and 86% for simple behavioural therapy. The ARR and NNT compared with simple behavioural therapy are therefore 49% (86 − 37%) and 2 (95% confidence interval 1 to 4) for CBT, and 58% (86 − 28%) and 2 (1 to 3) for IPT. There seems little to choose from between CBT and IPT, but the summary states that patients receiving CBT were less likely to have symptoms than those receiving either IPT or behavioural therapy, and that CBT complete remission rates were highest.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780702031373000097

What are the four types of validity?

Table of contents.
Construct validity..
Content validity..
Face validity..
Criterion validity..

Which type of validity examines whether a test can predict an outcome?

Criterion validity (or criterion-related validity) measures how well one measure predicts an outcome for another measure. A test has this type of validity if it is useful for predicting performance or behavior in another situation (past, present, or future).

What type of validity is a measure of how well a particular test correlates with a previously validated measure?

Concurrent Validity – operationalization's ability to distinguish between groups it theoretically should be able to. This is where a test correlates well with a measure that has been previously validated.

What type of validity is predictive validity?

Predictive validity is a type of criterion validity, which refers to how well the measurement of one variable can predict the response of another variable.