Deposition & Cross-examination Questions on Tests & Psychometrics

home » assessment » deposition & cross-examination questions

Deposition & Cross-Examination Questions on Tests & Psychometrics

Kenneth S. Pope, Ph.D., ABPP
James N. Butcher, Ph.D.
Joyce Seelen, Esq.

NOTE: For those interested, here are some related resources on this website:

This chapter was written to provide guide for attorneys and to help expert witnesses prepare for depositions and cross-examination. Practicing answers to the types of questions in this chapter can enhance the expert's expertise by allowing him or her to identify and strengthen weaknesses. Practice can also enable experts to become more comfortable and articulate in responding to a carefully prepared, vigorous, informed cross-examination.

The chapter from which this section is excerpted organizes over 100 questions into 14 basic categories, moving from information about initial contacts, financial factors, and the expert's background to the details of the expert's professional opinions. The following section presents and discusses questions focusing on tests per se and psychometrics.

What is a psychological test?

It is surprising that those who administer, score, and interpret standardized psychological tests may never have thought carefully about what a test is. Initial inquiry at this fundamental level may help an attorney to begin assessing the degree to which an individual has genuine expertise and understands the nature of testing as opposed to following a "cookbook'' method of test use or "improvising'' opinions. The individual's response may also help the attorney to assess the degree to which the individual can communicate clearly and concretely to a judge or jury. Some individuals may be quite knowledgeable in the area of psychometrics and inferences from test data but may be incapable of putting their knowledge into words that can be understood by those without special training (see chapter 4).

One possible answer to this initial question was proposed by Cronbach in the original edition of his classic text (1960; see also Cronbach, 1990): "A test is a systematic procedure for comparing the behavior of two or more persons'' (p. 21). Note that the "behavior'' may be oral (e.g., an individual telling what he or she sees when looking at a Rorschach card) or written (e.g., marking down "true'' or "false'' responses on the MMPI).

One of the most important aspects of the definition suggested by Cronbach is that the procedure for comparing behavior is systematic. For many tests, the system used to measure and compare behavior is standardized. The MMPI, like the Rorschach, the WAIS-III, and the Halstead-Reitan Neuropsychological Test Battery, is a standardized test. A standardized test presents a standardized set of questions or other stimuli (such as inkblots) under generally standardized conditions; responses from the individual are collected in a standardized format and scored and interpreted according to certain standardized norms or criteria. The basic idea is that all individuals who take the test take the same test under the same conditions. Obviously, not all aspects are exactly equivalent. One individual may take the test during the day in a large room; another may take the test at night in a small room. The assumption is, however, that in all essential respects (i.e., those that might significantly affect test performance), all individuals are taking the "same'' test.

Because characteristics of the individual taking the test or the testing circumstances may significantly influence test results and interpretations, experts must be aware of the research literature that addresses these factors. For some tests, it may tend to make a difference whether the examiner and the examinee are similar or different in terms of gender, race, or age. For most popular tests, systematic investigations have indicated which factors need to be taken into account in the scoring or interpretation of the tests so that extraneous or confounding factors do not distort the results.

Later sections of this chapter focus in more detail on such aspects as administration (e.g., whether the expert followed the standard procedures for administering the test or whether special individual characteristics or testing circumstances were adequately taken into account and discussed in the forensic report); the basic issue in this section is assessing the deponent's understanding and ability to communicate the fundamental nature of a standardized test.

Are you able to distinguish retrospective accuracy from predictive accuracy?

[Alternative or follow-up questions could involve distinguishing sensitivity from specificity, or Type I error from Type II error, as noted in the Glossary of this book.]

This is a simple yes-or-no question. If the expert indicates understanding of these concepts, the attorney may want to ask a few follow-up questions to ensure that the answer is accurate.

If the expert replies "no," then the attorney may consider a subsequent question such as, "So would it be fair to say that you did not take these two concepts into account in your assessment?'' If the witness has indicated inability to distinguish between the two concepts, he or she is in a particularly poor position to assert subsequently that the concepts were taken into account in the assessment.

If the witness does indicate that although he or she is unable to distinguish the two forms of accuracy, he or she nevertheless took them into account in the assessment, the attorney may ask the witness to explain the meaning of these two seemingly contradictory statements and how the two forms of accuracy were taken into account in the assessment.

On the other hand, if the witness testifies that it would be a fair statement that retrospective and predictive accuracy were not taken into account in the assessment, then the attorney may ask additional questions to clarify that the witness has no information to provide regarding the two forms of accuracy, cannot discuss any of his or her professional opinions in terms of these forms of accuracy, did not weigh (when selecting the test or tests to be administered) the types of available tests or evaluate the test results in light of these forms of accuracy, and so on.

The two concepts are simple but are crucial to understanding testing that is based on standardized instruments such as the MMPI-2 or MMPI-A. Assume that a hypothetical industrial firm announces that they have developed a way to use the MMPI-2 to identify employees who have shoplifted. According to their claims (which one should greet with skepticism), the MMPI-2, as they score and interpret it, is now a test of shoplifting. Predictive accuracy begins with the test score. This hypothetical new MMPI-2 score (or profile) will be either positive (suggesting that the employee who took the test is a shoplifter) or negative (suggesting that the individual is not a nonshoplifter). The predictive accuracy of this new test is the probability, given a positive score, that the employee actually is a shoplifter, and the probability, if the employee has a negative score, that the individual is not a shoplifter. Thus, the predictive accuracy, as the name implies, refers to the degree (expressed as a probability) that a test is accurate in classifying individuals or in predicting whether or not they have a specific condition, characteristic, and so on.

Retrospective accuracy, on the other hand, begins not with the test but with the specific condition or characteristic that the test is purported to measure. In the example above, the retrospective accuracy of this hypothetical MMPI-2 shoplifting test denotes the degree (expressed as a probability) that an employee who is a shoplifter will be correctly identified (i.e., caught) by the test.

Confusing the "directionality'' of the inference (e.g., the likelihood that those who score positive on a hypothetical predictor variable will fall into a specific group versus the likelihood that those in a specific group will score positive on the predictor variable) is, in a more general sense, a cause of numerous errors in assessment and in testimony on assessment, assessment instruments, and assessment techniques. Cross-examination must carefully explore the degree to which testimony may be based on such misunderstandings.

Psychologist Robyn Dawes (1988a) provided a vivid example. Assume that the predictor is cigarette smoking (i.e., whether an individual smokes cigarettes) and that what is predicted is the development of lung cancer. Dawes observes that there is around a 99% chance (according to the actuarial tables) that an individual who has lung cancer is a chronic smoker. This impressive statistic seems to indicate or imply that whether one is a chronic smoker might be an extremely effective predictor of whether he or she will develop lung cancer. But the chances that a chronic smoker will develop lung cancer are (again, according to the actuarial tables) only around 10%.

Using these same statistics in another context, an expert witness might indicate reasonable certainty that, on the basis of a defendant's showing a particular MMPI profile, the defendant is a rapist. The witness's foundation for such an assertion might be that a research study of 100 rapists indicated that virtually all of them showed that particular MMPI profile (similar to the statistics indicating that virtually all people with lung cancer have been smokers). The problem is in trying to make the prediction in the other direction: What percentage of all individuals (i.e., a comprehensive and representative sample that includes a full spectrum of nonrapists as well as rapists) showing that particular MMPI profile are not rapists? Without this latter information (based on solid, independently conducted research published in peer-reviewed scientific or professional journals), there is no way to determine whether the particular MMPI profile is effective in identifying rapists. To borrow again from the statistics on lung cancer, it may indeed be true that a 99 or 100% of a sample of rapists showed the particular profile, but it may also be true that only about 10% of the individuals who show that profile are rapists. Thus, the evidence that the witness is presenting would actually suggest that there is a 90% chance that the defendant was not a rapist.

The confusion of predictive and retrospective accuracy may be related to the logical fallacy known as affirming the consequent. In this fallacy, the fact that x implies y is erroneously used as a basis for inferring that y implies x. Logically, the fact that all versions of the MMPI are standardized psychological tests does not imply that all standardized psychological tests are versions of the MMPI.

When selecting a standardized psychological assessment instrument, what aspects of validity do you consider?

Expertise in MMPI-2 or MMPI-A administration, scoring, and interpretation requires at least a basic knowledge of validity issues (see, e.g., Standards for Educational and Psychological Testing). Although follow-up questions—keyed to the content and detail of the initial response—are necessary, beginning inquiry in the area of validity by asking an open-ended question during the deposition can enable an attorney to obtain a general idea of how knowledgeable the deponent is in this area.

The attorney can assess the degree to which the deponent's initial response addresses the various kinds of validity. Although there are a variety of ways in which validity can be viewed and assessed, Cronbach (1960) set forth four basic types.

Predictive validity indicates the degree to which test results are accurate in forecasting some future outcome. For example, the MMPI-2 may be administered to all individuals who seek services from a community mental health center. The results may be used to predict which patients will be able to participate in and benefit from group therapy. Research to validate the MMPI-2's predictive validity for this purpose would explore possible systematic relationships between MMPI-2 profiles and patient responses to group therapy. The responses to group therapy might be measured in any number of ways, including the group therapist's evaluation of the patient's participation and progress, the patient's own self-report or self-assessment, and careful evaluation by independent clinicians.

Concurrent validity indicates the degree to which test results provide a basis for accurately assessing some other current performance or condition. For example, a clinician or researcher might develop the hypothesis that certain MMPI-2 profiles are pathognomonic signs of certain clinical diagnostic groups. (A pathognomonic sign is one whose presence always and invariably indicates the presence of a clinical diagnosis.) To validate (or invalidate) this hypothesis, MMPI-2 profiles might be compared with the diagnoses as currently determined in a clinic by more detailed, comprehensive, and time-consuming methods of assessment (e.g., extended clinical interviews conducted by independent clinicians in conjunction with a history of the individuals and a comprehensive battery of other psychological and neuropsychological tests). If the MMPI-2 demonstrates adequate concurrent validity in terms of this hypothesis, the MMPI-2 could be substituted—at least in certain situations—for the more elaborate and time-consuming methods of assessing diagnosis.

Content validity indicates the degree to which a test, as a subset of a wider category of performance, adequately or accurately represents the wider category of which it is a subset. For example, the bar examination and the psychology licensing examination supposedly measure some of the basic knowledge, skills, or abilities necessary to practice as an attorney or a psychologist. The degree to which such examinations accurately reflect or represent this larger domain is the content validity.

Construct validity indicates the degree to which a test accurately indicates the presence of a presumed characteristic that is described (or hypothesized) by some theoretical or conceptual framework. For example, a researcher might develop a theory that there are four basic interpersonal styles that attorneys use in developing rapport with juries. According to this theory, each attorney uses the one basic style that is most consistent with his or her core personality. The researcher then hypothesizes that these styles can be identified according to an attorney's MMPI-2 profile (i.e., the researcher theorizes that one set of MMPI-2 profiles indicates a Type One core personality and a Type One interpersonal style for developing rapport with a jury, another set of MMPI-2 profiles indicates a Type Two core personality and a Type Two interpersonal style, etc.). Assessing the validity of such possible indicants of a theoretical construct is a complex task that involves attention to other external sources of information thought to be relevant to the construct, intercorrelations of test items, and examination of individual differences in responding to the test items (see Standards for Educational and Psychological Testing).

Conceptualizations about test validity continue to emerge and constructs continue to evolve. Those interested in reviewing the evolving understanding of test validity are encouraged to read Geisinger's (1992) fascinating account.

When selecting psychological assessment instruments, what aspects of reliability do you consider?

A basic knowledge of reliability issues is also—as with validity issues—fundamental to expertise in MMPI-2 or MMPI-A administration, scoring, and interpretation (see, e.g., Standards for Educational and Psychological Testing). Again, an open-ended question may be the best approach to this area of inquiry during the deposition.

Reliability refers to the degree to which a test produces results that are free of measuring errors. If there were no measuring errors at all, then it is reasonable to assume that test results would be consistent.

Reliability is another way of describing how consistent the results of a test are. Consider the following hypothetical situation. For the purposes of the example, assume that there are two completely identical people. If they are completely identical and if a test (such as the MMPI-2 or MMPI-A) were completely reliable (i.e., free from any measuring errors), then both people should produce the same responses to the test. However, now assume that one of these two identical people takes the test at nine a.m. when she is rested and alert. The other person takes the same test at two a.m. when she has just been awakened from a sound sleep and is tired and groggy. Differences in test results between these two otherwise identical people might be due purely to the times at which the test was administered. If the test were supposedly a measure of personality (such as the MMPI-2) and if the personalities of these hypothetically identical people were the same, then different test results do not actually represent a difference in personality but rather a difference or error in measurement (i.e., the time or conditions under which the test was administered).

Statistical techniques have been developed that indicate the degree to which a test is reliable. Such statistical analyses are often reported in the form of reliability coefficients. The coefficient will be a number that falls in the range of zero (for no reliability) to one (indicating perfect reliability).

The coefficient may indicate the reliability between subsequent administrations of the same test (e.g., administering the MMPI-2 to a group of individuals and then administering the same MMPI-2 to the same group 1 week or 1 month later). Reports of this type of reliability will often refer to the test-retest reliability (or the coefficient of stability). They may indicate, using a coefficient of equivalence, the reliability between different forms of the same test. For example, a large group of individuals might be randomly divided in half. One half would be given the original MMPI, and the other half would be given the MMPI-2; 1 week later, the half that took the original MMPI would take the MMPI-2 and vice versa. Reliability between subsequent administrations (perhaps under different conditions) of the same test is often termed stability; reliability between different forms of the same test is often termed equivalence (Cronbach, 1960).

In some instances, test items will be divided independently into two halves as a way to estimate the reliability of the test. This method estimates the split-half reliability. The resulting coefficient, often measured using a statistical method known as the Spearman-Brown formula, is often termed the coefficient of internal consistency.

What types of scales were involved in the various tests and methods of assessment that you considered in selecting the instruments and diagnostic frameworks that you used in the case at hand?

Different forms of measurement use different scales. The scales can refer to scores on a test or to the categories into which test responses fall. There are four basic types of scales.

The first type of scale is termed nominal. As the Latin root (nomen, meaning "name'') from which we derive a number of similar English words (e.g., nominate, denomination, and nomenclature) implies, nominal scales simply provide names to two or more different categories, types, kinds, or classes. A two-category nominal scale might be invented to describe the various individuals in a courtroom: participants and observers. The same population might be described using a more detailed nominal scale with categories such as jurors, prosecution team, defense team, and so on. Note that the categories are listed in no particular order. Assigning an individual to a particular group on a nominal scale indicates only that the individual is in a group that is different from the others.

If placement of an individual (or object, verbal response, etc.) into a particular group indicates that an individual (or object, etc.) is in a different group from all others, then there can be no overlap among groups. That is to say, the groups must be mutually exclusive: Placement in one group indicates that the person, object, response, and so on, does not belong in any of the other groups. Thus, the four categories of mammals, living things, humans, and whales do not constitute a nominal scale in the sense used here because the categories are not mutually exclusive (i.e., a particular individual may be placed in more than one of the categories). Individuals who take the MMPI are asked to use a nominal scale in responding to each of the items; the scale has two values: "true'' and "false.''

The second type of scale does place its categories in a particular order and is termed an ordinal scale. For example, an attorney might evaluate all the cases he or she has ever tried and sort them into three categories: "easy,'' "moderate,'' and "difficult.'' The scale indicates that cases in the middle group were harder for the attorney to try than cases in the easy group, but there is no information about how much harder. The scale only places the items (in this instance, legal cases) in three ordered categories, each category having more (or less) of a particular attribute (such as difficulty) than the others.

The third type of scale is a particular kind of ordinal scale in which the interval between each group is the same. An example of an interval scale is any listing of the days of the week: Wednesday, Thursday, Friday, Saturday, Sunday, and so on. When events are classified according to this interval scale, it is clear that the temporal distance between Wednesday and Thursday is the same as that between Saturday and Sunday or any other two consecutive days. An important characteristic of an interval scale (that sets it apart from the fourth type of scale described below) is that there is no absolute or meaningful zero point. Some people may begin their week on Mondays, others on Saturdays, still others on Sundays; from time to time a "3-day weekend'' leads into a week that "begins'' on Tuesday. The Fahrenheit scale for measuring temperature is an example of an interval scale: the zero on the scale is arbitrary.

The fourth type of scale is a scale of equal intervals in which the zero point is absolute, and it is termed a ratio scale. An example of a ratio scale is one's weight. The zero point is not arbitrary. As the name of the scale implies, the ratios may be meaningfully compared. For example, a person who weighs 100 pounds is twice as heavy as a person who weighs 50 pounds, and a person who weighs 200 pounds is twice as heavy as a person who weighs 100 pounds. Such ratios do not hold for the other three types of scales. For example, because the zero point on the Fahrenheit scale is arbitrary, one cannot accurately state that 40 degrees is twice as hot as 20 degrees or that 200 degrees is twice as hot as 100 degrees.

The deponent's explanation of these different types of scales and their meaning for different assessment instruments that were considered (e.g., the MMPI-2 or MMPI-A, the WAIS-III, the Rorschach, a sentence-completion test) will indicate the degree to which he or she understands this aspect of psychological assessment and can communicate it effectively to a judge or jury.

What is an arithmetic mean? What is a median? What is a mode?

These concepts are central to understanding psychological assessment in general and the MMPI-2 or MMPI-A in particular. Without an understanding of these concepts, there can be no understanding of the T-scores on which the MMPI-2 or MMPI-A is based.

The mean is one of three major ways of describing the central tendency of a distribution of scores (i.e., the "center'' around which all the other scores seem to cluster). The arithmetic mean can be defined statistically as the sum of all scores divided by the number of scores. In other words, the mean is the arithmetic average of the scores. The median, which is the second measure of central tendency, is that number that is in the "middle'' of the distribution: half of the scores fall below the median, and the other half of the scores fall above the median. The third measure of central tendency is the mode, which indicates the score that appears most often. If there were seven IQ scores—98, 100, 102, 102, 103, 103, and 103—then the mode would be 103 because it appears most often (i.e., three times out of seven).

These concepts are easily misunderstood. For example, an otherwise knowledgeable psychiatrist, Karl Menninger (1945), took other people to task for their statistical ignorance when he wrote:

Fortunately for the world, and thanks to the statisticians (for this, of course, is a mathematically inevitable conclusion), there are as many people whose intelligence is above the average as there are persons whose intelligence is below the average. Calamity howlers, ignorant of arithmetic, are heard from time to time proclaiming that two-thirds of the people have less than average intelligence, failing to recognize the self-contradiction of the statement. (p. 199)

While it is possible that the number of people whose intelligence is above average is exactly the same as the number of people whose intelligence is below average, there is no necessary self-contradiction in the statement that he criticizes. In the common I.Q. tests, the average I.Q. is 100. But this number, which is a mean, does not necessarily represent the median. Consider a population of 3 people: the first has an IQ of 90, the second has an IQ of 90, and the third also has an IQ of 120. The average IQ for this population is 100 (i.e., 90 + 90 + 120 = 300, and 300 divided by 3 = 100), but two-thirds of the population have less than average intelligence.

What is a standard deviation? What is variance?

The standard deviation is one way of describing the "scatter'' of a distribution of scores—the degree to which the scores vary from the central tendency. These measures of scatter or dispersion are, like the concepts of central tendency described in the previous section, essential to understanding the T scales on which the MMPI instruments are based.

The statistical formula for the standard deviation is somewhat complicated. Each score is subtracted from the mean to produce a deviation from the mean. Each of these deviations is squared. (A number is squared when the number is multiplied by itself. The square of 2—that is to say, 2 times 2—is 4; the square of 3—that is, 3 times 3—is 9; the square of 4 is 16.) These squared deviations are then added together into a total sum of squares. The total sum of squares is then divided by the number of scores. [Footnote: This is the formula for determining the variance or standard deviation in descriptive statistics. In inferential statistics, the sum of squares is divided not by the number of scores but rather by the number of scores minus one. In descriptive statistics, one is simply trying to describe the scores or numbers that are available (e.g., the IQ scores of the children in one sixth-grade classroom). In inferential statistics, one is trying to use the scores or numbers that are available—called the sample—as a basis for drawing inferences about a wider group of scores or numbers—called the population (e.g., attempting to use the IQ scores of the sample of children in one sixth-grade classroom to infer or estimate the IQ scores for the population of all sixth-grade students in the school system).]

This total sum of squares divided by the number of scores (or the number of scores minus one) is the variance (i.e., the degree to which the scores vary from or vary around the mean). The standard deviation is the square root of the variance. The larger the standard deviation, the farther the scores tend to fall from the mean.

What is a T score, and what are its psychometric properties?

Understanding the nature of the T score is essential to understanding the MMPI instruments (see chapter 2). Both the original MMPI and the revised versions (MMPI-2 and MMPI-A) are based on T scales, although there are significant differences between the original and later versions that will serve as the focus of subsequent questions.

The raw scores (e.g. of the content scales) of the MMPI-based measures are translated—through statistical methods—into a T-score distribution. A T scale is a distribution of scores in which the mean, as previously described, is 50 and the standard deviation, described in the previous section, is 10.

If the T scale describes a normal distribution, the distribution is said to fall into a bell-shaped curve. In the normal distribution, 68% of the scores fall within one standard deviation of the mean; 95% fall within two standard deviations of the mean; and 99% fall within three standard deviations of the mean. These percentages apply only to a normal or normalized T scale and not necessarily to a linear T scale or a uniform T scale (for information about the T scale and its various forms, see chapter 2 and the Glossary).

Most of the original MMPI validity and clinical scales were derived according to the formula for linear T scores (Dahlstrom et al., 1972) except L and F, for which the mean values were arbitrarily set. Each of these scales was separately derived, and each has a slightly different skew. Thus, the distributions are not uniform nor are they normal. This is to say, a particular T score does not fall at the same percentile rank across all scales (Colligan et al., 1983).

The original MMPI's lack of uniformity among the clinical scales has been somewhat problematic (e.g., when comparing scores on different scales). In MMPI-2 and MMPI-A, however, this lack of uniformity was resolved by developing linear T scores that did possess uniformity across given percentile values. This scale norming, referred to as uniform T scores, is described in the MMPI-2 manual (Butcher et al., 1989) and is discussed extensively by psychologists Auke Tellegen and Yossef Ben-Porath (1992b).

[Back to Top]

Psychological Assessment: Clinical & Forensic