Fluent Essays - DEBATING ABILITY TESTING

DEBATING ABILITY TESTING

DEBATING ABILITY TESTING - Psychology

PLACE ORDER

See Chapters 5 and 6 from textbooks attached and the required articles attached, and view the IQ: A history of deceit video. This is link to video. https://fod.infobase.com/OnDemandEmbed.aspx?Token=52818&aid=18596&Plt=FOD&loid=0&w=640&h=480&ref Present at least two viewpoints debating professional approaches to assessment used in psychology for assigned age group Adults age 61 and older. In addition to the required readings attached, research a minimum of one peer-reviewed article on ability testing research at is pertains to Adults age 61 and older. Briefly compare and discuss at least two theories of intelligence and the most-up-to-date version of two intelligence tests related to those theories. Analyze challenges related to testing individuals Adults age 61 and older and describe any special ethical and sociocultural issues which must be considered. Analyze and provide evidence from research on the validity of the tests you selected that supports or opposed using those specific intelligence tests with your assigned populations. Present the pros and cons of individual versus group intelligence testing. Summarize the implications of labelling and mislabeling individuals in Adults age 61 and older as a result of testing and assessment. Required Resources Text Gregory, R. J. (2014). Psychological testing: History, principles, and applications (7th ed.). Boston, MA: Pearson. Chapter 5: Theories and Individual Tests of Intelligence and Achievement Chapter 6: Group Tests and Controversies in Ability Testing Articles Ekinci, B. (2014). The Relationship among Sternberg's triarchic abilities, Gardner's multiple intelligences, and academic achievement. Social Behavior & Personality, 42(4), 625-633. doi: 10.2224/sbp.2014.42.4.625 The full-text version of this article can be accessed through the EBSCOhost database in the University of Arizona Global Campus Library. The author presents a discussion of the relationships among Sternberg’s triarchic abilities (STA), Gardner’s multiple intelligences, and the academic achievement of children attending primary schools. The article serves as an example of an empirical investigation of theoretical intellectual constructs. Fletcher, J. M., Francis, D. J., Morris, R. D., & Lyon, G. R. (2005). Evidence-based assessment of learning disabilities in children and adolescents. Journal of Clinical Child and Adolescent Psychiatry, 34(3), 506-522. Retrieved from the EBSCOhost database. The authors of the article review the reliability and validity of four approaches to the assessment of children and adolescents with learning disabilities. Hampshire, A., Highfield, R. R., Parkin, B. L., & Owen, A. M. (2012). Fractionating human intelligence. Neuron, 76(6). 1225–1237. doi: 10.1016/j.neuron.2012.06.022 The full-text version of this article can be accessed through the ProQuest database in the University of Arizona Global Campus Library. The authors compare factor models of individual differences in performance with factor models of brain functional organization to demonstrate that different components of intelligence have analogs in distinct brain networks. Healthwise Staff. (2014). Mental health assessment (Links to an external site.) . Retrieved from http://www.webmd.com/mental-health/mental-health-assessment This online article presents information on the purposes of mental health assessments and what examinees and family members may expect during mental health assessment visits. McDermott, P. A., Watkins, M. W., & Rhoad, A. M. (2014). Whose IQ is it?—Assessor bias variance in high-stakes psychological assessment. Psychological Assessment, 26(1), 207-214. doi: 10.1037/a0034832 The full-text version of this article can be accessed through the EBSCOhost database in the University of Arizona Global Campus Library. Assessor bias occurs when a significant portion of the examinee’s test score actually reflects differences among the examiners who perform the assessment. The authors examine the extent of assessor bias in the administration of the Wechsler Intelligence Scale for Children—Fourth Edition (WISC–IV) and explore the implications of this phenomenon. Rockstuhl, T., Seiler, S., Ang, S., Van Dyne, L., & Annen, H. (2011). Beyond general intelligence (IQ) and emotional intelligence (EQ): The Role of cultural intelligence (CQ) on cross-border leadership effectiveness in a globalized world. Journal of Social Issues, 67(4). 825-840. Retrieved from the EBSCOhost database. This article represents a contemporary, real-world application of intellectual testing. The authors discuss the implication of the research on the relationship among general intelligence (IQ), emotional intelligence (EQ), cultural intelligence (CQ) and cross-border leadership effectiveness. Multimedia de Rossier, L. (Producer) & Boutinard-Rouelle, P. (Director). (2011). IQ: A history of deceit (Links to an external site.) [Video file]. Retrieved from https://fod.infobase.com/OnDemandEmbed.aspx?Token=52818&aid=18596&Plt=FOD&loid=0&w=640&h=480&ref The full version of this video is available through the Films on Demand database in the University of Arizona Global Campus Library. This program reviews the history of intelligence assessment 5.1 DEFINITIONS OF INTELLIGENCE Before we discuss definitions of intelligence, we need to clarify the nature of definition itself. Sternberg (1986) makes a distinction between operational and “real” definitions that is important in this context. An operational definition defines a concept in terms of the way it is measured. Boring (1923) carried this viewpoint to its extreme when he defined intelligence as “what the tests test.” Believe it or not, this was a serious proposal, designed largely to short-circuit rampant and divisive disagreements about the definition of intelligence. Operational definitions of intelligence suffer from two dangerous shortcomings (Sternberg, 1986). First, they are circular. Intelligence tests were invented to measure intelligence, not to define it. The test designers never intended for their instruments to define intelligence. Second, operational definitions block further progress in understanding the nature of intelligence, because they foreclose discussion on the adequacy of theories of intelligence. This second problem—the potentially stultifying effects of relying on operational definitions of intelligence—casts doubt on the common practice of affirming the concurrent validity of new tests by correlating them with old tests. If established tests serve as the principal criterion against which new tests are assessed, then the new tests will be viewed as valid only to the extent that they correlate with the old ones. Such a conservative practice drastically curtails innovation. The operational definition of intelligence does not allow for the possibility that new tests or conceptions of intelligence may be superior to the existing ones. We must conclude, then, that operational definitions of intelligence leave much to be desired. In contrast, a real definition is one that seeks to tell us the true nature of the thing being defined (Robinson, 1950; Sternberg, 1986). Perhaps the most common way—but by no means the only way—of producing real definitions of intelligence is to ask experts in the field to define it. Expert Definitions of Intelligence Intelligence has been given many real definitions by prominent researchers in the field. In the following, we list several examples, paraphrased slightly for editorial consistency. The reader will note that many of these definitions appeared in an early but still influential symposium, “Intelligence and Its Measurement,” published in the Journal of Educational Psychology (Thorndike, 1921). Other definitions stem from a modern update of this early symposium, What Is Intelligence?, edited by Sternberg and Detterman (1986). Intelligence has been defined as the following: • Spearman (1904, 1923): a general ability that involves mainly the eduction of relations and correlates. • Binet and Simon (1905): the ability to judge well, to understand well, to reason well. • Terman (1916): the capacity to form concepts and to grasp their significance. • Pintner (1921): the ability of the individual to adapt adequately to relatively new situations in life. • Thorndike (1921): the power of good responses from the point of view of truth or fact. • Thurstone (1921): the capacity to inhibit instinctive adjustments, flexibly imagine different responses, and realize modified instinctive adjustments into overt behavior. • Wechsler (1939): The aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with the environment. • Humphreys (1971): the entire repertoire of acquired skills, knowledge, learning sets, and generalization tendencies considered intellectual in nature that are available at any one period of time. • Piaget (1972): a generic term to indicate the superior forms of organization or equilibrium of cognitive structuring used for adaptation to the physical and social environment. • Sternberg (1985a, 1986): the mental capacity to automatize information processing and to emit contextually appropriate behavior in response to novelty; intelligence also includes metacomponents, performance components, and knowledge- acquisition components (discussed later). • Eysenck (1986): error-free transmission of information through the cortex. • Gardner (1986): the ability or skill to solve problems or to fashion products that are valued within one or more cultural settings. • Ceci (1994): multiple innate abilities that serve as a range of possibilities; these abilities develop (or fail to develop, or develop and later atrophy) depending upon motivation and exposure to relevant educational experiences. • Sattler (2001): intelligent behavior reflects the survival skills of the species, beyond beyond those associated with basic physiological processes. The preceding list of definitions is representative although definitely not exhaustive. For one thing, the list is exclusively Western and omits several cross-cultural conceptions of intelligence. Eastern conceptions of intelligence, for example, emphasize benevolence, humility, freedom from conventional standards of judgment, and doing what is right as essential to intelligence. Many African conceptions of intelligence place heavy emphasis on social aspects of intelligence such as maintaining harmonious and stable intergroup relations (Sternberg & Kaufman, 1998). The reader can consult Bracken and Fagan (1990), Sternberg (1994), and Sternberg and Detterman (1986) for additional ideas. Certainly, this sampling of views is sufficient to demonstrate that there appear to be as many definitions of intelligence as there are experts willing to define it! In spite of this diversity of viewpoints, two themes recur again and again in expert definitions of intelligence. Broadly speaking, the experts tend to agree that intelligence is (1) the capacity to learn from experience and (2) the capacity to adapt to one’s environment. That learning and adaptation are both crucial to intelligence stands out with poignancy in certain cases of mental disability in which persons fail to possess one or the other capacity in sufficient degree (Case Exhibit 5.1). CASE EXHIBIT 5.1 Learning and Adaptation as Core Functions of Intelligence Persons with mental disability often demonstrate the importance of experiential learning and environmental adaptation as key ingredients of intelligence. Consider the case history of a 61-year-old newspaper vendor with moderate mental retardation well known to local mental health specialists. He was an interesting if not eccentric gentleman who stored canned goods in his freezer and cursed at welfare workers who stopped by to see how he was doing. In spite of his need for financial support from a state agency, he was fiercely independent and managed his own household with minimal supervision from case workers. Thus, in some respects he maintained a tenuous adaptation to his environment. To earn much-needed extra income, he sold a local 25-cent newspaper from a streetside newsstand. He recognized that a quarter was proper payment and had learned to give three quarters in change for a dollar bill. He refused all other forms of payment, an arrangement that his customers could accept. But one day the price of the newspaper was increased to 35 cents, and the newspaper vendor was forced to deal with nickels and dimes as well as quarters and dollar bills. The amount of learning required by this slight shift in environmental demands exceeded his intellectual abilities, and, sadly, he was soon out of business. His failed efforts highlight the essential ingredients of intelligence: learning from experience and adaptation to the environment. How well do intelligence tests capture the experts’ view that intelligence consists of learning from experience and adaptation to the environment? The reader should keep this question in mind as we proceed to review major intelligence tests in the topics that follow. Certainly, there is cause for concern: Very few contemporary intelligence tests appear to require the examinee to learn something new or to adapt to a new situation as part and parcel of the examination process. At best, prominent modern tests provide indirect measures of the capacities to learn and adapt. How well they capture these dimensions is an empirical question that must be demonstrated through validational research. Layperson and Expert Conceptions of Intelligence Another approach to understanding a construct is to study its popular meaning. This method is more scientific than it may appear. Words have a common meaning to the extent that they help provide an effective portrayal of everyday transactions. If laypersons can agree on its meaning, a construct such as intelligence is in some sense “real” and, therefore, potentially useful. Thus, asking persons on the street, “What does intelligence mean to you?” has much to recommend it. Sternberg, Conway, Ketron, and Bernstein (1981) conducted a series of studies to investigate conceptions of intelligence held by American adults. In the first study, people in a train station, entering a supermarket, and studying in a college library were asked to list behaviors characteristic of different kinds of intelligence. In a second study—the only one discussed here—both laypersons and experts (mainly academic psychologists) rated the importance of these behaviors to their concept of an “ideally intelligent” person. The behaviors central to expert and lay conceptions of intelligence turned out to be very similar, although not identical. In order of importance, experts saw verbal intelligence, problem-solving ability, and practical intelligence as crucial to intelligence. Laypersons regarded practical problemsolving ability, verbal ability, and social competence to be the key ingredients in intelligence. Of course, opinions were not unanimous; these conceptions represent the consensus view of each group. In their conception of intelligence, experts place more emphasis on verbal ability than problem solving, whereas laypersons reverse these priorities. Nonetheless, experts and laypersons alike consider verbal ability and problem solving to be essential aspects of intelligence. As the reader will see, most intelligence tests also accent these two competencies. Prototypical examples would be vocabulary (verbal ability) and block design (problem solving) from the Wechsler scales, discussed later. We see then that everyday conceptions of intelligence are, in part, mirrored quite faithfully by the content of modern intelligence tests. Some disagreement between experts and laypersons is also evident. Experts consider practical intelligence (sizing up situations, determining how to achieve goals, awareness and interest in the world) an essential constituent of intelligence, whereas laypersons identify social competence (accepting others for what they are, admitting mistakes, punctuality, and interest in the world) as a third component. Yet, these two nominations do share one property in common: Contemporary tests generally make no attempt to measure either practical intelligence or social competence. Partly, this reflects the psychometric difficulties encountered in devising test items relevant to these content areas. However, the more influential reason intelligence tests do not measure practical intelligence or social competence is inertia: Test developers have blindly accepted historically incomplete conceptions of intelligence. Until recently, the development of intelligence testing has been a conservative affair, little changed since the days of Binet and the Army Alpha and Beta tests for World War I recruits. There are some signs that testing practices may soon evolve, however, with the development of innovative instruments. For example, Sternberg and colleagues have proposed innovative tests based on his model of intelligence. Another interesting instrument based on a new model of intelligence is the Everyday Problem Solving Inventory (Cornelius & Caspi, 1987). In this test, examinees must indicate their typical response to everyday problems such as failing to bring money, checkbook, or credit card when taking a friend to lunch. Many theorists in the field of intelligence have relied on factor analysis for the derivation or validation of their theories. In fact, it is not an overstatement to say that perhaps the majority of the theories in this area have been impacted by the statistical tools of factor analysis, which provide ways to portion intelligence into its subcomponents. One of the most compelling theories of intelligence, the Cattell-Horn-Carroll theory reviewed later, would not exist without factor analysis. Thus, before summarizing theories, we provide a brief review of this essential statistical tool. 5.2 A PRIMER OF FACTOR ANALYSIS Broadly speaking, there are two forms of factor analysis: confirmatory and exploratory. In confirmatory factor analysis, the purpose is to confirm that test scores and variables fit a certain pattern predicted by a theory. For example, if the theory underlying a certain intelligence test prescribed that the subtests belong to three factors (e.g., verbal, performance, and attention factors), then a confirmatory factor analysis could be undertaken to evaluate the accuracy of this prediction. Confirmatory factor analysis is essential to the validation of many ability tests. The central purpose of exploratory factor analysis is to summarize the interrelationships among a large number of variables in a concise and accurate manner as an aid in conceptualization (Gorsuch, 1983). For instance, factor analysis may help a researcher discover that a battery of 20 tests represents only four underlying variables, called factors. The smaller set of derived factors can be used to represent the essential constructs that underlie the complete group of variables. Perhaps a simple analogy will clarify the nature of factors and their relationship to the variables or tests from which they are derived. Consider the track-and-field decathlon, a mixture of 10 diverse events including sprints, hurdles, pole vault, shot put, and distance races, among others. In conceptualizing the capability of the individual decathlete, we do not think exclusively in terms of the participant’s skill in specific events. Instead, we think in terms of more basic attributes such as speed, strength, coordination, and endurance, each of which is reflected to a different extent in the individual events. For example, the pole vault requires speed and coordination, while hurdle events demand coordination and endurance. These inferred attributes are analogous to the underlying factors of factor analysis. Just as the results from the 10 events of a decathlon may boil down to a small number of underlying factors (e.g., speed, strength, coordination, and endurance), so too may the results from a battery of 10 or 20 ability tests reflect the operation of a small number of basic cognitive attributes (e.g., verbal skill, visualization, calculation, and attention, to cite a hypothetical list). This example illustrates the goal of factor analysis: to help produce a parsimonious description of large, complex data sets. We will illustrate the essential concepts of factor analysis by pursuing a classic example concerned with the number and kind of factors that best describe student abilities. Holzinger and Swineford (1939) gave 24 ability-related psychological tests to 145 junior high school students from Forest Park, Illinois. The factor analysis described later was based on methods outlined in Kinnear and Gray (1997). It should be intuitively obvious to the reader that any large battery of ability tests will reflect a smaller number of basic, underlying abilities (factors). Consider the 24 tests depicted in Table 5.1. Surely some of these tests measure common underlying abilities. For example, we would expect Sentence Completion, Word Classification, and Word Meaning (variables 7, 8, and 9) to assess a factor of general language ability of some kind. In like manner, other groups of tests seem likely to measure common underlying abilities—but how many abilities or factors? And what is the nature of these underlying abilities? Factor analysis is the ideal tool for answering these questions. We follow the factor analysis of the Holzinger and Swineford (1939) data from beginning to end. TABLE 5.1 The 24 Ability Tests Used by Holzinger and Swineford (1939) 1.Visual Perception 2.Cubes 3.Paper Form Board 4.Flags 5.General Information 6.Paragraph Comprehension 7.Sentence Completion 8.Word Classification 9.Word Meaning 10.Add Digits 11.Code (Perceptual Speed) 12.Count Groups of Dots 13.Straight and Curved Capitals 14.Word Recognition 15.Number Recognition 16.Figure Recognition 17.Object-Number 18.Number-Figure 19.Figure-Word 20.Deduction 21.Numerical Puzzles 22.Problem Reasoning 23.Series Completion 24.Arithmetic Problems The Correlation Matrix The beginning point for every factor analysis is the correlation matrix, a complete table of intercor-relations among all the variables.1 The correlations between the 24 ability variables discussed here can be found in Table 5.2. The reader will notice that variables 7, 8, and 9 do, indeed, intercorrelate quite strongly (correlations of .62, .69, and .53), as we suspected earlier. This pattern of intercorrelations is presumptive evidence that these variables measure something in common; that is, it appears that these tests reflect a common underlying factor. However, this kind of intuitive factor analysis based on a visual inspection of the correlation matrix is hopelessly limited; there are just too many intercorrelations for the viewer to discern the underlying patterns for all the variables. Here is where factor analysis can be helpful. Although we cannot elucidate the mechanics of the procedure, factor analysis relies on modern high-speed computers to search the correlation matrix according to objective statistical rules and determine the smallest number of factors needed to account for the observed pattern of intercorrelations. The analysis also produces the factor matrix, a table showing the extent to which each test loads on (correlates with) each of the derived factors, as discussed in the following section. The Factor Matrix and Factor Loadings The factor matrix consists of a table of correlations called factor loadings. The factor loadings (which can take on values from −1.00 to +1.00) indicate the weighting of each variable on each factor. For example, the factor matrix in Table 5.3 shows that five factors (labeled I, II, III, IV, and V) were derived from the analysis. Note that the first variable, Series Completion, has a strong positive loading of .71 on factor I, indicating that this test is a reasonably good index of factor I. Note also that Series Completion has a modest negative loading of −.11 on factor II, indicating that, to a slight extent, it measures the opposite of this factor; that is, high scores on Series Completion tend to signify low scores on factor II, and vice versa. TABLE 5.2 The Correlation Matrix for 24 Ability Variables 1 2 3 4 5 6 7 8 9 1011121314151617181920212223 2 32 3 4032 4 472331 5 32292523 6 3423273362 7 301622346672 8 33173839585362 9 3320183372716953 10126 8 103120252917 1131159 11343523302848 123115141622101827115943 13492432333431354028415451 141310187 282924252617351320 1524137 1323251718251524171437 16412726321929183024123112284133 17181 1819212723262729362819343532 183726212526171625213235353221333445 19271131141925232727192911262119263236 2037293034404445434517202524302739263017 Note: Decimals omitted. Source: Reprinted with permission from Holzinger, K., & Harman, H. (1941). Factor analysis: A synthesis of factorial methods. Chicago: University of Chicago Press. Copyright © 1941 The University of Chicago Press. The factors may seem quite mysterious, but in reality they are conceptually quite simple. A factor is nothing more than a weighted linear sum of the variables; that is, each factor is a precise statistical combination of the tests used in the analysis. In a sense, a factor is produced by “adding in” carefully determined portions of some tests and perhaps “subtracting out” fractions of other tests. What makes the factors special is the elegant analytical methods used to derive them. Several different methods exist. These methods differ in subtle ways beyond the scope of this text; the reader can gather a sense of the differences by examining names of 213731173532263136274140364318233517363341 22412325384439403648163019282425282732344637 2347353834444341505026253538242636292730514550 242821202542434439425341413630172633413737453843 procedures: principal components factors, principal axis factors, method of unweighted least squares, maximum-likelihood method, image factoring, and alpha factoring (Tabachnick & Fidell, 1989). Most of the methods yield highly similar results. The factor loadings depicted in Table 5.3 are nothing more than correlation coefficients between variables and factors. These correlations can be interpreted as showing the weight or loading of each factor on each variable. For example, variable 9, the test of Word Meaning, has a very strong loading (.69) on factor I, modest negative loadings (−.45 and −.29) on factors II and III, and negligible loadings (.08 and .00) on factors IV and V. TABLE 5.3 The Principal Axes Factor Analysis for 24 Variables Factors I II III IV V 23 Series Completion 0.71-0.110.140.110.07 8 Word Classification 0.70-0.24-0.15-0.11-0.13 5 General Information 0.70-0.32-0.34-0.040.08 9 Word Meaning 0.69-0.45-0.290.080.00 Geometric Representation of Factor Loadings 6 Paragraph Comprehension 0.69-0.42-0.260.08-0.01 7 Sentence Completion 0.68-0.42-0.36-0.05-0.05 24 Arithmetic Problems 0.670.20-0.23-0.04-0.11 20 Deduction 0.64-0.190.130.060.28 22 Problem Reasoning 0.64-0.150.110.05-0.04 21 Numerical Puzzles 0.620.240.10-0.210.16 13 Straight and Curved Capitals 0.620.280.02-0.36-0.07 1 Visual Perception 0.62-0.010.42-0.21-0.01 11 Code (Perceptual Speed) 0.570.44-0.200.040.01 18 Number–Figure 0.550.390.200.15-0.11 16 Figure Recognition 0.530.080.400.310.19 4 Flags 0.51-0.180.32-0.23-0.02 17 Object–Number 0.490.27-0.030.47-0.24 2 Cubes 0.40-0.080.39-0.230.34 12 Count Groups of Dots 0.480.55-0.14-0.330.11 10 Add Digits 0.470.55-0.45-0.190.07 3 Paper Form Board 0.44-0.190.48-0.12-0.36 14 Word Recognition 0.450.09-0.030.550.16 15 Number Recognition 0.420.140.100.520.31 19 Figure–Word 0.470.140.130.20-0.61 It is customary to represent the first two or three factors as reference axes in two- or three- dimensional space.2 Within this framework the factor loadings for each variable can be plotted for examination. In our example, five factors were discovered, too many for simple visualization. Nonetheless, we can illustrate the value of geometric representation by oversimplifying somewhat and depicting just the first two factors (Figure 5.1). In this graph, each of the 24 tests has been plotted against the two factors that correspond to axes I and II. The reader will notice that the factor loadings on the first factor (I) are uniformly positive, whereas the factor loadings on the second factor (II) consist of a mixture of positive and negative. FIGURE 5.1 Geometric Representation of the First Two Factors from 24 Ability Tests The Rotated Factor Matrix An important point in this context is that the position of the reference axes is arbitrary. There is nothing to prevent the researcher from rotating the axes so that they produce a more sensible fit with the factor loadings. For example, the reader will notice in Figure 5.1 that tests 6, 7, and 9 (all language tests) cluster together. It would certainly clarify the interpretation of factor I if it were to be redirected near the center of this cluster (Figure 5.2). This manipulation would also bring factor II alongside interpretable tests 10, 11, and 12 (all number tests). Although rotation can be conducted manually by visual inspection, it is more typical for researchers to rely on one or more objective statistical criteria to produce the final rotated factor matrix. Thurstone’s (1947) criteria of positive manifold and simple structure are commonly applied. In a rotation to positive manifold, the computer program seeks to eliminate as many of the negative factor loadings as possible. Negative factor loadings make little sense in ability testing, because they imply that high scores on a factor are correlated with poor test performance. In a rotation to simple structure, the computer program seeks to simplify the factor loadings so that each test has significant loadings on as few factors as possible. The goal of both criteria is to produce a rotated factor matrix that is as straightforward and unambiguous as possible. FIGURE 5.2 Geometric Representation of the First Two Rotated Factors from 24 Ability Tests The rotated factor matrix for this problem is shown in Table 5.4. The particular method of rotation used here is called varimax rotation. Varimax should not be used if the theoretical expectation suggests that a general factor may occur. Should we expect a general factor in the analysis of ability tests? The answer is as much a matter of faith as of science. One researcher may conclude that a general factor is likely and, therefore, pursue a different type of rotation. A second researcher may be comfortable with a Thurstonian viewpoint and seek multiple ability factors using a varimax rotation. We will explore this issue in more detail later, but it is worth pointing out here that a researcher encounters many choice points in the process of conducting a factor analysis. It is not surprising, then, that different researchers may reach different conclusions from factor analysis, even when they are analyzing the same data set. The Interpretation of Factors Table 5.4 indicates that five factors underlie the intercorrelations of the 24 ability tests. But what shall we call these factors? The reader may find the answer to this question disquieting, because at this juncture we leave the realm of cold, objective statistics and enter the arena of judgment, insight, and presumption. In order to interpret or name a factor, the researcher must make a reasoned judgment about the common processes and abilities shared by the tests with strong loadings on that factor. For example, in Table 5.4 it appears that factor I is verbal ability, because the variables with high loadings stress verbal skill (e.g., Sentence Completion loads .86, Word Meaning loads .84, and Paragraph Comprehension loads .81). The variables with low loadings also help sharpen the meaning of factor I. For example, factor I is not related to numerical skill (Numerical Puzzles loads .18) or spatial skill (Paper Form Board loads .16). Using a similar form of inference, it appears that factor II is mainly numerical ability (Add Digits loads .85, Count Groups of Dots loads .80). Factor III is less certain but appears to be a visual-perceptual capacity, and factor IV appears to be a measure of recognition. We would need to analyze the single test on factor V (Figure-Word) to surmise the meaning of this factor. TABLE 5.4 The Rotated Varimax Factor Matrix for 24 Ability Variables Factors I II III IV V 7 Sentence Completion 0.860.150.130.030.07 9 Word Meaning 0.840.060.150.180.08 6 Paragraph Comprehension 0.810.070.160.180.10 5 General Information 0.790.220.160.12-0.02 8 Word Classification 0.650.220.280.030.21 22 Problem Reasoning 0.430.120.380.230.22 10 Add Digits 0.180.85-0.100.09-0.01 12 Count Groups of Dots 0.020.800.200.030.00 11 Code (Perceptual Speed) 0.180.640.050.300.17 13 Straight and Curved Capitals 0.190.600.40-0.050.18 Note: Boldfaced entries signify subtests loading strongly on each factor. These results illustrate a major use of factor analysis, namely, the identification of a small number of marker tests from a large test battery. Rather than using a cumbersome battery of 24 tests, a researcher could gain nearly the same information by carefully selecting several tests with strong loadings on the five factors. For example, the first factor is well represented by 24 Arithmetic Problems 0.410.540.120.160.24 21 Numerical Puzzles 0.180.520.450.160.02 18 Number-Figure 0.000.400.280.380.36 1 Visual Perception 0.170.210.690.100.20 2 Cubes 0.090.090.650.12-0.18 4 Flags 0.260.070.60-0.010.15 3 Paper Form Board 0.16-0.090.57-0.050.49 23 Series Completion 0.420.240.520.180.11 20 Deduction 0.430.110.470.35-0.07 15 Number Recognition 0.110.090.120.74-0.02 14 Word Recognition 0.230.100.000.690.10 16 Figure Recognition 0.070.070.460.590.14 17 Object-Number 0.150.25-0.060.520.49 19 Figure-Word 0.160.160.110.140.77 test 7, Sentence Completion (.86) and test 9, Word Meaning (.84); the second factor is reflected in … CHAPTER 6 Group Tests and Controversies in Ability Testing TOPIC 6A Group Tests of Ability and Related Concepts 6.1 Nature, Promise, and Pitfalls of Group Tests 6.2 Group Tests of Ability 6.3 Multiple Aptitude Test Batteries 6.4 Predicting College Performance 6.5 Postgraduate Selection Tests 6.6 Educational Achievement Tests The practical success of early intelligence scales such as the 1905 Binet-Simon test motivated psychologists and educators to develop instruments that could be administered simultaneously to large numbers of examinees. Test developers were quick to realize that group tests allowed for the efficient evaluation of dozens or hundreds of examinees at the same time. As reviewed in an earlier chapter, one of the first uses of group tests was for screening and assignment of military personnel during World War I. The need to quickly test thousands of Army recruits inspired psychologists in the United States, led by Robert M. Yerkes, to make rapid advances in psychometrics and test development (Yerkes, 1921). Many new applications followed immediately—in education, industry, and other fields. In Topic 6A, Group Tests of Ability and Related Concepts, we introduce the reader to the varied applications of group tests and also review a sampling of typical instruments. In addition, we explore a key question raised by the consequential nature of these tests—can examinees boost their scores significantly by taking targeted test preparation courses? This is but one of many unexpected issues raised by the widespread use of group tests. In Topic 6B, Test Bias and Other Controversies, we continue a reflective theme by looking into test bias and other contentious issues in testing. 6.1 NATURE, PROMISE, AND PITFALLS OF GROUP TESTS Group tests serve many purposes, but the vast majority can be assigned to one of three types: ability, aptitude, or achievement tests. In the real world, the distinction among these kinds of tests often is quite fuzzy (Gregory, 1994a). These instruments differ mainly in their functions and applications, less so in actual test content. In brief, ability tests typically sample a broad assortment of proficiencies in order to estimate current intellectual level. This information might be used for screening or placement purposes, for example, to determine the need for individual testing or to establish eligibility for a gifted and talented program. In contrast, aptitude tests usually measure a few homogeneous segments of ability and are designed to predict future performance. Predictive validity is foundational to aptitude tests, and often they are used for institutional selection purposes. Finally, achievement tests assess current skill attainment in relation to the goals of school and training programs. They are designed to mirror educational objectives in reading, writing, math, and other subject areas. Although often used to identify educational attainment of students, they also function to evaluate the adequacy of school educational programs. Whatever their application, group tests differ from individual tests in five ways: • Multiple-choice versus open-ended format • Objective machine scoring versus examiner scoring • Group versus individualized administration • Applications in screening versus remedial planning • Huge versus merely large standardization samples These differences allow for great speed and cost efficiency in group testing, but a price is paid for these advantages. Although the early psychometric pioneers embraced group testing wholeheartedly, they recognized fully the nature of their Faustian bargain: Psychologists had traded the soul of the individual examinee in return for the benefits of mass testing. Whipple (1910) summed up the advantages of group testing but also pointed to the potential perils: Most mental tests may be administered either to individuals or to groups. Both methods have advantages and disadvantages. The group method has, of course, the particular merit of economy of time; a class of 50 or 100 children may take a test in less than a fiftieth or a hundredth of the time needed to administer the same test individually. Again, in certain comparative studies, e.g., of the effects of a week’s vacation upon the mental efficiency of school children, it becomes imperative that all S’s should take the tests at the same time. On the other hand, there are almost sure to be some S’s in every group that, for one reason or another, fail to follow instructions or to execute the test to the best of their ability. The individual method allows E to detect these cases, and in general, by the exercise of personal supervision, to gain, as noted above, valuable information concerning S’s attitude toward the test. In sum, group testing poses two interrelated risks: (1) some examinees will score far below their true ability, owing to motivational problems or difficulty following directions and (2) invalid scores will not be recognized as such, with undesirable consequences for these atypical examinees. There is really no simple way to entirely avoid these risks, which are part of the trade-off for the efficiency of group testing. However, it is possible to minimize the potentially negative consequences if examiners scrutinize very low scores with skepticism and recommend individual testing for these cases. We turn now to an analysis of group tests in a variety of settings, including cognitive tests for schools and clinics, placement tests for career and military evaluation, and aptitude tests for college and postgraduate selection. 6.2 GROUP TESTS OF ABILITY Multidimensional Aptitude Battery-II (MAB-II) The Multidimensional Aptitude Battery-II (MAB-II; Jackson, 1998) is a recent group intelligence test designed to be a paper-and- pencil equivalent of the WAIS-R. As the reader will recall, the WAIS-R is a highly respected instrument (now replaced by the WAIS-III), in its time the most widely used of the available adult intelligence tests. Kaufman (1983) noted that the WAIS-R was “the criterion of adult intelligence, and no other instrument even comes close.” However, a highly trained professional needs about 1½ hours just to administer the Wechsler adult test to a single person. Because professional time is at a premium, a complete Wechsler intelligence assessment—including administration, scoring, and report writing—easily can cost hundreds of dollars. Many examiners have long suspected that an appropriate group test, with the attendant advantages of objective scoring and computerized narrative report, could provide an equally valid and much less expensive alternative to individual testing for most persons. The MAB-II was designed to produce subtests and factors parallel to the WAIS-R but employing a multiple-choice format capable of being computer scored. The apparent goal in designing this test was to produce an instrument that could be administered to dozens or hundreds of persons by one examiner (and perhaps a few proctors) with minimal training. In addition, the MAB-II was designed to yield IQ scores with psychometric properties similar to those found on the WAIS-R. Appropriate for examinees from ages 16 to 74, the MAB-II yields 10 subtest scores, as well as Verbal, Performance, and Full Scale IQs. Although it consists of original test items, the MAB-II is mainly a sophisticated subtest-by- subtest clone of the WAIS-R. The 10 subtests are listed as follows: The reader will notice that Digit Span from the WAIS-R is not included on the MAB-II. The reason for this omission is largely practical: There would be no simple way to present a Digit-Span-like subtest in paper-and-pencil format. In any case, the omission is not serious. Digit Span has the lowest correlation with overall WAIS-R IQ, and it is widely recognized that this subtest makes a minimal contribution to the measurement of general intelligence. The only significant deviation from the WAIS-R is the replacement of Block Design with a Spatial subtest on the MAB-II. In the Spatial subtest, examinees must mentally perform spatial rotations of figures and select one of five possible rotations presented as their answer (Figure 6.1). Only mental rotations are involved (although “flipped-over” versions of the original Verbal Performance Information Digit Symbol Comprehension Picture Completion Arithmetic Spatial Similarities Picture Arrangement Vocabulary Object Assembly stimulus are included as distractor items). The advanced items are very complex and demanding. The items within each of the 10 MAB-II subtests are arranged in order of increasing difficulty, beginning with questions and problems that most adolescents and adults find quite simple and proceeding upward to items that are so difficult that very few persons get them correct. There is no penalty for guessing and examinees are encouraged to respond to every item within the time limit. Unlike the WAIS-R in which the verbal subtests are untimed power measures, every MAB-II subtest incorporates elements of both power and speed: Examinees are allowed only seven minutes to work on each subtest. Including instructions, the Verbal and Performance portions of the MAB-II each take about 50 minutes to administer. The MAB-II is a relatively minor revision of the MAB, and the technical features of the two versions are nearly identical. A great deal of psychometric information is available for the original version, which we report here. With regard to reliability, the results are generally quite impressive. For example, in one study of over 500 adolescents ranging in age from 16 to 20, the internal consistency reliability of Verbal, Performance, and Full Scale IQs was in the high .90s. Test–retest data for this instrument also excel. In a study of 52 young psychiatric patients, the individual subtests showed reliabilities that ranged from .83 to .97 (median of .90) for the Verbal scale and from .87 to .94 (median of .91) for the Performance scale (Jackson, 1984). These results compare quite favorably with the psychometric standards reported for the WAIS-R. Factor analyses of the MAB-II are broadly supportive of the construct validity of this instrument and its predecessor (Lee, Wallbrown, & Blaha, 1990). Most recently, Gignac (2006) examined the factor structure of the MAB-II using a series of confirmatory factor analyses with data on 3,121 individuals reported in Jackson (1998). The best fit to the data was provided by a nested model consisting of a first- order general factor, a first-order Verbal Intelligence factor, and a first-order Performance Intelligence factor. The one caveat of this study was that Arithmetic did not load specifically on the Verbal Intelligence factor independent of its contribution to the general factor. FIGURE 6.1 Demonstration Items from Three Performance Tests of the Multidimensional Aptitude Battery-II (MAB) Source: Reprinted with permission from Jackson, D. N. (1984a). Manual for the Multidimensional Aptitude Battery. Port Huron, MI: Sigma Assessment Systems, Inc. (800) 265–1285. Other researchers have noted the strong congruence between factor analyses of the WAIS-R (with Digit Span removed) and the MAB. Typically, separate Verbal and Performance factors emerge for both tests (Wallbrown, Carmin, & Barnett, 1988). In a large sample of inmates, Ahrens, Evans, and Barnett (1990) observed validity-confirming changes in MAB scores in relation to education level. In general, with the possible exception that Arithmetic does not contribute reliably to the Verbal factor, there is good justification for the use of separate Verbal and Performance scales on this test. In general, the validity of this test rests upon its very strong physical and empirical resemblance to its parent test, the WAIS-R. Correlational data between MAB and WAIS-R scores are crucial in this regard. For 145 persons administered the MAB and WAIS-R in counterbalanced fashion, correlations between subtests ranged from .44 (Spatial/Block Design) to .89 (Arithmetic and Vocabulary), with a median of .78. WAIS-R and MAB IQ correlations were very healthy, namely, .92 for Verbal IQ, .79 for Performance IQ, and .91 for Full Scale IQ (Jackson, 1984a). With only a few exceptions, correlations between MAB and WAIS-R scores exceed those between the WAIS and the WAIS-R. Carless (2000) reported a similar, strong overlap between MAB scores and WAIS-R scores in a study of 85 adults for the Verbal, Performance, and Full Scale IQ scores. However, she found that 4 of the 10 MAB subtests did not correlate with the WAIS-R subscales they were designed to represent, suggesting caution in using this instrument to obtain detailed information about specific abilities. Chappelle et al. (2010) obtained MAB-II scores for military personnel in an elite training program for AC-130 gunship operators. The officers who passed training (N = 59) and those who failed training (N = 20) scored above average (mean Full Scale IQs of 112.5 and 113.6, respectively), but there were no significant differences between the two groups on any of the test indices. This is a curious result insofar as IQ typically demonstrates at least mild predictive potential for real world vocational outcomes. Further research on the MABII as a predictor of real world results would be desirable. The MAB-II shows great promise in research, career counseling, and personnel selection. In addition, this test could function as a screening instrument in clinical settings, as long as the examiner views low scores as a basis for follow- up testing with an individual intelligence test. Examiners must keep in mind that the MAB-II is a group test and, therefore, carries with it the potential for misuse in individual cases. The MAB-II should not be used in isolation for diagnostic decisions or for placement into programs such as classes for intellectually gifted persons. A Multilevel Battery: The Cognitive Abilities Test (CogAT) One important function of psychological testing is to assess students’ abilities that are prerequisite to traditional classroom-based learning. In designing tests for this purpose, the psychometrician must contend with the obvious and nettlesome problem that school-aged children differ hugely in their intellectual abilities. For example, a test appropriate for a sixth grader will be much too easy for a tenth grader, yet impossibly difficult for a third grader. The answer to this dilemma is a multilevel battery, a series of overlapping tests. In a multi- level battery, each group test is designed for a specific age or grade level, but adjacent tests possess some common content. Because of the overlapping content with adjacent age or grade levels, each test possesses a suitably low floor and high ceiling for proper assessment of students at both extremes of ability. Virtually every school system in the United States uses at least one nationally normed multilevel battery. The Cognitive Abilities Test (CogAT) is one of the best school-based test batteries in current use (Lohman & Hagen, 2001). A recent revision of the test is the CogAT Multilevel Edition, Form 6, released in 2001. Norms for 2005 also are available. We discuss this instrument in some detail. The CogAT evolved from the Lorge-Thorndike Intelligence Tests, one of the first group tests of intelligence intended for widespread use within school systems. The CogAT is primarily a measure of scholastic ability but also incorporates a nonverbal reasoning battery with items that bear no direct relation to formal school instruction. The two primary batteries, suitable for students in kindergarten through third grade, are briefly discussed at the end of this section. Here we review the multilevel edition intended for students in 3rd through 12th grade. The nine subtests of the multilevel CogAT are grouped into three areas: Verbal, quantitative, and nonverbal, each including three subtests. Representative items for the subtests of the CogAT are depicted in Figure 6.2. The tests on the Verbal Battery evaluate verbal skills and reasoning strategies (inductive and deductive) needed for effective reading and writing. The tests on the Quantitative Battery appraise quantitative skills important for mathematics and other disciplines. The Nonverbal Battery can be used to estimate cognitive level of students with limited reading skill, poor English proficiency, or inadequate educational exposure. For each CogAT subtest, items are ordered by difficulty level in a single test booklet. However, entry and exit points differ for each of eight overlapping levels (A through H). In this manner, grade-appropriate items are provided for all examinees. The subtests are strictly timed, with limits that vary from 8 to 12 minutes. Each of the three batteries can be administered in less than an hour. However, the manual recommends three successive testing days for younger children. For older children, two batteries should be administered the first day, with a single testing period the next. FIGURE 6.2 Subtests and Representative Items of the Cognitive Abilities Test, Form 6 Note: These items resemble those on the CogAT 6. Correct answers: 1: B. yogurt (the only dairy product). 2: D. swim (fish swim in the ocean). 3: E. bottom (the opposite of top). 4: A. I is greater than II (4 is greater than 2). 5: C. 26 (the algorithm is add 10, subtract 5, add 10 . . .). 6: A. −1 (the only answer that fits) 7: A (four-sided shape that is filled in). 8: D (same shape, bigger to smaller). 9: E (correct answer). Raw scores for each battery can be transformed into an age-based normalized standard score with mean of 100 and standard deviation of 15. In addition, percentile ranks and stanines for age groups and grade level are also available. Interpolation was used to determine fall, winter, and spring grade-level norms. The CogAT was co-normed (standardized concurrently) with two achievement tests, the Iowa Tests of Basic Skills and the Iowa Tests of Educational Development. Concurrent standardization with achievement measures is a common and desirable practice in the norming of multilevel intelligence tests. The particular virtue of joint norming is that the expected correspondence between intelligence and achievement scores is determined with great precision. As a consequence, examiners can more accurately identify underachieving students in need of remediation or further assessment for potential learning disability. The reliability of the CogAT is exceptionally good. In previous editions, the Kuder- Richardson-20 reliability estimates for the multilevel batteries averaged .94 (Verbal), .92 (Quantitative), and .93 (Nonverbal) across all grade levels. The six-month test–retest reliabilities for alternate forms ranged from .85 to .93 (Verbal), .78 to .88 (Quantitative), and .81 to .89 (Nonverbal). The manual provides a wealth of information on content, criterion-related, and construct validity of the CogAT; we summarize only the most pertinent points here. Correlations between the CogAT and achievement batteries are substantial. For example, the CogAT verbal battery correlates in the .70s to .80s with achievement subtests from the Iowa Tests of Basic Skills. The CogAT batteries predict school grades reasonably well. Correlations range from the .30s to the .60s, depending on grade level, sex, and ethnic group. There does not appear to be a clear trend as to which battery is best at predicting grade point average. Correlations between the CogAT and individual intelligence tests are also substantial, typically ranging from .65 to .75. These findings speak well for the construct validity of the CogAT insofar as the Stanford-Binet is widely recognized as an excellent measure of individual intelligence. Ansorge (1985) has questioned whether all three batteries are really necessary. He points out that correlations among the Verbal, Quantitative, and Nonverbal batteries are substantial. The median values across all grades are as follows: Since the Quantitative battery offers little uniqueness, from a purely psychometric point of view there is no justification for including it. Nonetheless, the test authors recommend use of all batteries in hopes that differences in performance will assist teachers in remedial Verbal and Quantitative 0.78 Nonverbal and Quantitative 0.78 Verbal and Nonverbal 0.72 planning. However, the test authors do not make a strong case for doing this. A study by Stone (1994) provides a notable justification for using the CogAT as a basis for student evaluation. He found that CogAT scores for 403 third graders provided an unbiased prediction of student achievement that was more accurate than teacher ratings. In particular, teacher ratings showed bias against Caucasian and Asian American students by underpredicting their achievement scores. Raven’s Progressive Matrices (RPM) First introduced in 1938, Raven’s Progressive Matrices (RPM) is a nonverbal test of inductive reasoning based on figural stimuli (Raven, Court, & Raven, 1986, 1992). This test has been very popular in basic research and is also used in some institutional settings for purposes of intellectual screening. RPM was originally designed as a measure of Spearman’s g factor (Raven, 1938). For this reason, Raven chose a special format for the test that presumably required the exercise of g. The reader is reminded that Spearman defined g as the “eduction of correlates.” The term eduction refers to the process of figuring out relationships based on the perceived fundamental similarities between stimuli. In particular, to correctly answer items on the RPM, examinees must identify a recurring pattern or relationship between figural stimuli organized in a 3 × 3 matrix. The items are arranged in order of increasing difficulty, hence the reference to progressive matrices. Raven’s test is actually a series of three different instruments. Much of the confusion about validity, factorial structure, and the like stems from the unexamined assumption that all three forms should produce equivalent findings. The reader is encouraged to abandon this unwarranted hypothesis. Even though the three forms of the RPM resemble one another, there may be subtle differences in the problem- solving strategies required by each. The Coloured Progressive Matrices is a 36-item test designed for children from 5 to 11 years of age. Raven incorporated colors into this version of the test to help hold the attention of the young children. The Standard Progressive Matrices is normed for examinees from 6 years and up, although most of the items are so difficult that the test is best suited for adults. This test consists of 60 items grouped into 5 sets of 12 progressions. The Advanced Progressive Matrices is similar to the Standard version but has a higher ceiling. The Advanced version consists of 12 problems in Set I and 36 problems in Set II. This form is especially suitable for persons of superior intellect. Large sample U.S. norms for the Coloured and Standard Progressive Matrices are reported in Raven and Summers (1986). Separate norms for Mexican American and African American children are included. Although there was no attempt to use a stratified random-sampling procedure, the selection of school districts was so widely varied that the American norms for children appear to be reasonably sound. Sattler (1988) summarizes the relevant norms for all versions of the RPM. Raven, Court, and Raven (1992) produced new norms for the Standard Progressive Matrices, but Gudjonsson (1995) has raised a concern that these data are compromised because the testing was not monitored. For the Coloured Progressive Matrices, split- half reliabilities in the range of .65 to .94 are reported, with younger children producing lower values (Raven, Court, & Raven, 1986). For the Standard Progressive Matrices, a typical split- half reliability is .86, although lower values are found with younger subjects (Raven, Court, & Raven, 1983). Test–retest reliabilities for all three forms vary considerably from one sample to the next (Raven, 1965; Raven et al., 1986). For normal adults in their late teens or older, reliability coefficients of .80 to .93 are typical. However, for preteen children, reliability coefficients as low as .71 are reported. Thus, for younger subjects, RPM may not possess sufficient reliability to warrant its use for individual decision making. Factor-analytic studies of the RPM provide little, if any, support for the original intention of the test to measure a unitary construct (Spearman’s g factor). Studies of the Coloured Progressive Matrices reveal three orthogonal factors (e.g., Carlson & Jensen, 1980). Factor I consists largely of very difficult items and might be termed closure and abstract reasoning by analogy. Factor II is labeled pattern completion through identity and closure. Factor III consists of the easiest items and is defined as simple pattern completion (Carlson & Jensen, 1980). In sum, the very easy and the very hard items on the Coloured Progressive Matrices appear to tap different intellectual processes. The Advanced Progressive Matrices breaks down into two factors that may have separate predictive validities (Dillon, Pohlmann, & Lohman, 1981). The first factor is composed of items in which the solution is obtained by adding or subtracting patterns (Figure 6.3a). Individuals performing well on these items may excel in rapid decision making and in situations where part–whole relationships must be perceived. The second factor is composed of items in which the solution is based on the ability to perceive the progression of a pattern (Figure 6.3b). Persons who perform well on these items may possess good mechanical ability as well as good skills for estimating projected movement and performing mental rotations. However, the skills represented by each factor are conjectural at this point and in need of independent confirmation. A huge body of published research bears on the validity of the RPM. The early data are well summarized by Burke (1958), while later findings are compiled in the current RPM manuals (Raven & Summers, 1986; Raven, Court, & Raven, 1983, 1986, 1992). In general, validity coefficients with achievement tests range from the .30s to the .60s. As might be expected, these values are somewhat lower than found with more traditional (verbally loaded) intelligence tests. Validity coefficients with other intelligence tests range from the .50s to the .80s. FIGURE 6.3 Raven’s Progressive Matrices: Typical Items Also, as might be expected, the correlations tend to be higher with performance than with verbal tests. In a massive study involving thousands of schoolchildren, Saccuzzo and Johnson (1995) concluded that the Standard Progressive Matrices and the WISC-R showed approximately equal predictive validity and no evidence of differential validity across eight different ethnic groups. In a lengthy review, Raven (2000) discusses stability and variation in the norms for the Raven’s Progressive Matrices across cultural, ethnic, and socioeconomic groups over the last 60 years. Indicative of the continuing interest in this venerable instrument, Costenbader and Ngari (2001) describe the standardization of the Coloured Progressive Matrices in Kenya. Further indicating the huge international popularity of the test, Khaleefa and Lynn (2008) provide standardization data for 6- to 11-year-old children in Yemen. Even though the RPM has not lived up to its original intentions of measuring Spearman’s g factor, the test is nonetheless a useful index of nonverbal, figural reasoning. The recent updating of norms was a much-welcomed development for this well-known test, in that many American users were leary of the outdated and limited British norms. Nonetheless, adult norms for the Standard and Advanced Progressive Matrices are still quite limited. The RPM is particularly valuable for the supplemental testing of children and adults with hearing, language, or physical disabilities. Often these examinees are difficult to assess with traditional measures that require auditory attention, verbal expression, or physical manipulation. In contrast, the RPM can be explained through pantomime, if necessary. Moreover, the only output required of the examinee is a pencil mark or gesture denoting the chosen alternative. For these reasons, the RPM is ideally suited for testing persons with limited command of the English language. In fact, the RPM is about as culturally reduced as possible: The test protocol does not contain a single word in any language. Mills and Tissot (1995) found that the Advanced Progressive Matrices identified a higher proportion of minority children as gifted than did a more traditional measure of academic aptitude (the School and College Ability Test). Bilker, Hansen, Brensinger, and others (2012) developed a … Journal of Social Issues, Vol. 67, No. 4, 2011, pp. 825--840 Beyond General Intelligence (IQ) and Emotional Intelligence (EQ): The Role of Cultural Intelligence (CQ) on Cross-Border Leadership Effectiveness in a Globalized World Thomas Rockstuhl∗ Nanyang Technological University Stefan Seiler Swiss Military Academy at ETH Zurich Soon Ang Nanyang Technological University Linn Van Dyne Michigan State University Hubert Annen Swiss Military Academy at ETH Zurich Emphasizing the importance of cross-border effectiveness in the contemporary globalized world, we propose that cultural intelligence—the leadership capabil- ity to manage effectively in culturally diverse settings—is a critical leadership competency for those with cross-border responsibilities. We tested this hypothesis with multisource data, including multiple intelligences, in a sample of 126 Swiss military officers with both domestic and cross-border leadership responsibilities. Results supported our predictions: (1) general intelligence predicted both domes- tic and cross-border leadership effectiveness; (2) emotional intelligence was a stronger predictor of domestic leadership effectiveness, and (3) cultural intelli- gence was a stronger predictor of cross-border leadership effectiveness. Overall, ∗Correspondence concerning this article should be sent to Thomas Rockstuhl, Block S3, 01C-108 Nanyang Business School, Nanyang Technological University, Nanyang Avenue, Singapore 639798 [e-mail: [email protected]]. 825 C© 2011 The Society for the Psychological Study of Social Issues 826 Rockstuhl et al. results show the value of cultural intelligence as a critical leadership competency in today’s globalized world. Globalization is a reality in the 21st century workplace. As a consequence, leaders must function effectively in cross-border situations as well as in domestic contexts. Leaders working in cross-border contexts must cope effectively with contrasting economic, political, and cultural practices. As a result, careful selec- tion, grooming, and development of leaders who can operate effectively in our globalized environment is a pressing need for contemporary organizations (Avolio, Walumbwa, & Weber, 2009). To date, research on leadership effectiveness has been dominantly domestic in focus, and does not necessarily generalize to global leaders (Gregersen, Morrison, & Black, 1998; House, Hanges, Javidan, Dorfman, & Gupta, 2004). Hence, there is a critical need for research that extends our understanding of how differences in context (domestic vs. cross-border) require different leadership capabilities (Johns, 2006). As we build our arguments, we emphasize the importance of matching leadership capabilities to the specific context. Global leaders, like all leaders, are responsible for performing their job re- sponsibilities and accomplishing their individual goals. Accordingly, general ef- fectiveness, defined as the effectiveness of observable actions that managers take to accomplish their goals (Campbell, McCloy, Oppler, & Sager, 1993), is im- portant for global leaders. We use the term “general” in describing this type of effectiveness because it makes no reference to culture or cultural diversity. Thus, it applies to all leader jobs. Going beyond general effectiveness, it is crucial to recognize the unique responsibilities that leaders have when their jobs are international in scope and involve cross-border responsibilities (Spreitzer, McCall, & Mahoney, 1997). Lead- ership in cross-border contexts requires leaders to (1) adopt a multicultural per- spective rather than a country-specific perspective; (2) balance local and global demands which can be contradictory; and (3) work with multiple cultures si- multaneously rather than working with one dominant culture (Bartlett & Goshal, 1992). Thus, we define cross-border effectiveness as the effectiveness of ob- servable actions that managers take to accomplish their goals in situations char- acterized by cross-border cultural diversity. This aspect of global leaders’ ef- fectiveness explicitly recognizes and emphasizes the unique challenges of het- erogeneous national, institutional, and cultural contexts (Shin, Morgeson, & Campion, 2007). Effective leadership depends on the ability to solve complex technical and social problems (Mumford, Zaccaro, Harding, Jacobs, & Fleishman, 2000). Given important differences in domestic and cross-border contexts, it is unlikely that leadership effectiveness is the same in domestic contexts as in cross-border con- texts. In this article, we aim to shed light on these differences by focusing on Cultural Intelligence and Cross-Border Leadership Effectiveness 827 ways that leadership competencies are similar and different in their relevance to different contexts (domestic vs. cross-border). Cultural Intelligence and Cross-Border Leadership Effectiveness When leaders work in cross-border contexts, the social problems of leadership are especially complex because cultural background influences prototypes and schemas about appropriate leadership behaviors. For example, expectations about preferred leadership styles (House et al., 2004), managerial behaviors (Shin et al., 2007), and the nature of relationships (Yeung & Ready, 1995) are all influenced by culture. Thus, effective cross-border leadership requires the ability to function in culturally diverse contexts. Although general intelligence (Judge, Colbert, & Ilies, 2004) as well as emo- tional intelligence (Caruso, Meyer, & Salovey, 2002) have been linked to lead- ership effectiveness in domestic contexts, neither deals explicitly with the ability to function in cross-border contexts. To address the unique aspects of culturally diverse settings, Earley and Ang (2003) drew on Sternberg and Detterman’s (1986) multidimensional perspective on intelligence to develop a conceptual model of cul- tural intelligence (CQ). Ang and colleagues (Ang & Van Dyne, 2008; Ang et al., 2007) defined CQ as an individual’s capability to function effectively in situations characterized by cultural diversity. They conceptualized CQ as a multidimen- sional concept comprising metacognitive, cognitive, motivational, and behavioral dimensions. Metacognitive CQ is an individual’s level of conscious cultural awareness dur- ing intercultural interactions. It involves higher level cognitive strategies—such as developing heuristics and guidelines for social interaction in novel cultural settings—based on deep-level information processing. Those with high metacog- nitive CQ are consciously aware of the cultural preferences and norms of different societies prior and during interactions. They question cultural assumptions and adjust their mental models about intercultural experiences (Triandis, 2006). Whereas metacognitive CQ focuses on higher order cognitive processes, cog- nitive CQ is knowledge of norms, practices, and conventions in different cultures acquired from education and personal experience. This includes knowledge of cultural universals as well as knowledge of cultural differences. Those with high cognitive CQ have sophisticated mental maps of culture, cultural environments, and how the self is embedded in cultural contexts. These knowledge structures provide them with a starting point for anticipating and understanding cultural systems that shape and influence patterns of social interaction within a culture. Motivational CQ is the capability to direct attention and energy toward learn- ing about and operating in culturally diverse situations. Kanfer and Heggestad (1997, p. 39) argued that motivational capacities “provide agentic control of af- fect, cognition, and behavior that facilitate goal accomplishment.” Expectations 828 Rockstuhl et al. and the value associated with successfully accomplishing a task (Eccles & Wigfield, 2002) influence the direction and magnitude of energy channeled to- ward that task. Those with high motivational CQ direct attention and energy toward cross-cultural situations based on their intrinsic interest in cultures (Deci & Ryan, 1985) and confidence in intercultural effectiveness (Bandura, 2002). Finally, behavioral CQ is the capability to exhibit culturally appropriate verbal and nonverbal actions when interacting with people from other cultures. Behav- ioral CQ also includes judicious use of speech acts—using culturally appropriate words and phrases in communication. Those with high behavioral CQ demonstrate flexibility in their intercultural interactions and adapt their behaviors to put others at ease and facilitate effective interactions. Rooted in differential biological bases (Rockstuhl, Hong, Ng, Ang, & Chiu, 2011), metacognitive, cognitive, motivational, and behavioral CQ represent qual- itatively different facets of overall CQ—the capability to function and manage effectively in culturally diverse settings (Ang & Van Dyne, 2008; Ang et al., 2007). Accordingly, the four facets are distinct capabilities that together form a higher level overall CQ construct. Offermann and Phan (2002) offered three theoretical reasons for why leaders with high CQ capabilities are better able to manage the culturally diverse ex- pectations of their followers in cross-border contexts (Avolio et al., 2009). First, awareness during intercultural interactions allows leaders to understand the impact of their own culture and background. It gives them insights into how their own values may bias their assumptions about behaviors in the workplace. It enhances awareness of the expectations they hold for themselves and others in leader – follower relationships. Second, high CQ causes leaders to pause and verify the accuracy of their cultural assumptions, consider their knowledge of other cul- tures, and hypothesize about possible values, biases, and expectations that may apply to intercultural interactions. Third, leaders with high CQ combine their rich understanding of self and others with motivation and behavioral flexibility in ways that allow them to adapt their leadership behaviors appropriately to specific cross-cultural situations. In addition to managing diverse expectations as a function of cultural dif- ferences, leaders in cross-border contexts also need to effectively manage the exclusionary reactions that can be evoked by cross-cultural contact (Torelli, Chiu, Tam, Au, & Keh, 2011). Social categorization theory (Tajfel, 1981; Turner, 1987) theorizes that exclusionary reactions to culturally diverse others are initially driven by perceptions of dissimilarity and viewing others as members of the out-group. Research demonstrates, however, that those with high CQ are more likely to de- velop trusting relationships with culturally diverse others and less likely to engage in exclusionary reactions (Rockstuhl & Ng, 2008). Consistent with our earlier emphasis on matching capabilities to the context, their results also demonstrated that CQ did not influence trust when partners were culturally homogeneous. Cultural Intelligence and Cross-Border Leadership Effectiveness 829 An increasing amount of research demonstrates the importance of CQ for performance effectiveness in cross-border contexts (for reviews, see Ang, Van Dyne, & Tan, 2011; Ng, Van Dyne, & Ang, in press). This includes expatriate per- formance in international assignments (Chen, Kirkman, Kim, Farh, & Tangirala, 2010), successful intercultural negotiations (Imai & Gelfand, 2010), leadership potential (Kim & Van Dyne, 2011), and leadership effectiveness in culturally diverse work groups (Groves & Feyerherm, 2011). To summarize, theory and research support the notion that leaders with high CQ should be more effective at managing expectations of culturally diverse others and minimizing exclusionary reactions that can occur in cross-border contexts. Thus, we hypothesize that general intelligence will predict leadership effectiveness in domestic contexts and in cross-border contexts; emotional intelligence will be a stronger predictor of leadership effectiveness in domestic contexts; and cultural intelligence will be a stronger predictor of leadership effectiveness in cross-border contexts. Method We tested our hypotheses with field data from 126 military leaders and their peers studying at the Swiss Military Academy at ETH Zurich. CQ has special relevance to leadership in military settings because armed forces throughout the world are increasingly involved in international assignments (Ang & Ng, 2007). We obtained data from professional officers in a 3-year training program that focused on developing domestic and cross-border leadership capabilities. Thus, the sample allows comparison of leadership effectiveness across contexts. During the program officers completed domestic assignments (e.g., physical education, group projects, and general military and leadership military training) as well and cross-border assignments (e.g., international support operations for the UN in former Yugoslavia and international civil-military collaboration training with U.S., EU, and Croatian armed forces). Military contexts represent high-stakes settings where leadership effectiveness has broad implications for countries, regions, and in some cases, the world. Poor-quality leadership can exacerbate tensions and heighten conflict between groups. In addition, it is essential that military leaders overcome initial exclusionary reactions that can be triggered when interacting with people from different cultures in high-stress situations. As a result, gaining a better understanding of general and cross-border leadership effectiveness in this setting should have important practical implications. All 126 participants (95% response rate) were male Caucasians with average previous leadership experience of 6.44 years (SD = 4.79). On average, they had lived in 1.45 different countries (SD = .91). They had been studying and working together on a daily basis for at least 7 months prior to the study. 830 Rockstuhl et al. Procedure Two peers in the program, selected based on cultural diversity, provided rat- ings of general and cross-border leadership effectiveness, such that those with French, Italian, or Rhaeto-Romansh background were rated by peers who had a German background and vice versa. We designed the data collection using peers for the assessment of leadership effectiveness for four reasons. First, all participants had extensive previous leadership experience in the military and were knowl- edgeable observers in these contexts. Second, military mission goals were clearly specified, and thus peers could readily observe both domestic and cross-border effectiveness in terms of mission completion. Third, participants worked closely together and had numerous opportunities to observe peers’ leadership effective- ness across general and cross-border contexts. Finally, Viswesvaran, Schmidt, and Ones (2002) showed in their meta-analysis of convergence between peer and supervisory ratings that leadership is one job performance dimension for which ratings from these two sources are interchangeable. Participants provided data on cultural intelligence, emotional intelligence, and demographic background. In addition, we obtained archival data on general mental ability and personality. This multisource approach is a strength of the design. Measures Peers assessed general leadership effectiveness and cross-border leadership effectiveness with six items each (1 = strongly disagree; 7 = strongly agree). Ex- isting leadership effectiveness measures (e.g., Ng, Ang, & Chan, 2008; Offermann, Bailey, Vasilopoulos, Seal, & Sass, 2004) do not distinguish explicitly between general and cross-border effectiveness. Thus, we reviewed the literature on general leadership effectiveness, developed six general leadership items, and then wrote parallel items that focused specifically on leadership effectiveness in culturally diverse contexts. Independent ratings by three subject matter experts (1 = not at all repre- sentative, 2 = somewhat, 3 = highly representative) provided face validity for the items (intraclass correlation = .83). Exploratory factor analysis (pilot sample #1: n = 95) showed two distinct factors (74.49% explained variance), and confirma- tory factor analysis (CFA) (pilot sample #2: n = 189) demonstrated acceptable fit: � 2 (53df ) = 94.69, p < .05, RMSEA = .066. In the substantive sample, interrater agreement (rWG(J) = .71 – 1.00) supported aggregation of peer ratings for general (� = .91) and cross-border leadership effectiveness (� = .93). We assessed CQ with the previously validated 20-item CQS (Cultural Intel- ligence Scale: Ang et al., 2007), which is highly reliable and generalizable across samples and cultures (Van Dyne, Ang, & Koh, 2008). Sample items include: I check the accuracy of my cultural knowledge as I interact with people from Cultural Intelligence and Cross-Border Leadership Effectiveness 831 different cultures; and I alter my facial expressions when a cross-cultural inter- action requires it (� = .89). CFA analysis of a second-order model demonstrated good fit to the data: � 2 (40df ) = 58.13, p < .05, RMSEA = .061), so we averaged the four factors to create our measure of overall CQ. We assessed EQ with 19 items (Brackett, Rivers, Shiffman, Lerner, & Salovey, 2006) and obtained archival data on general mental ability (the SHL Critical Reasoning Test Battery, 1996) and Big-Five personality (Donnellan, Oswald, Baird, & Lucas, 2006). These controls are important because prior research shows CQ is related to EQ (Moon, 2010), general mental ability (Ang et al., 2007), and personality (Ang, Van Dyne, & Koh, 2006). We also controlled for previous leadership experience (number of years of full-time job experience with the Swiss Military), international experience (num- ber of countries participants had lived in), and age because prior research shows relationships with leadership effectiveness. Results CFA analysis supported the discriminant validity of the 10 constructs (� 2 (186df ) = 255.12, p < .05, RMSEA = .046) and the proposed 10-factor model provided a better fit than plausible alternative models. Table 1 presents descriptive statistics and correlations. Table 2 summarizes hierarchical regression and relative weight analyses (Johnson & LeBreton, 2004). As predicted, IQ was positively related to general leadership effectiveness (� = .23, p < .05) and cross-border leadership effectiveness (� = .18, p < .05), even after controlling for age, leadership experience, international experience, Big-Five personality, EQ, and CQ. Thus, general mental ability had implications for both aspects of leadership effectiveness. In addition and consistent with our predictions, EQ was positively related to general leadership effectiveness (� = .27, p < .05) but not to cross-border leader- ship effectiveness (� = −.07, n.s.), after controlling for age, leadership experience, international experience, Big-Five personality, IQ, and CQ. Relative weight analy- sis demonstrated that EQ predicted 25.7% of the variance in general leadership ef- fectiveness but only 3.5% of the variance in cross-border leadership effectiveness. Thus, EQ has special relevance to leadership effectiveness in domestic contexts but not to leadership effectiveness in cross-border contexts. Finally, CQ was positively related to cross-border leadership effectiveness (� = .24, p < .05) but not to general leadership effectiveness (� = −.11, n.s.), after accounting for the controls. Relative weight analysis showed that CQ predicted 24.7% of the variance in cross-border leadership effectiveness and only 4.7% of the variance in general leadership effectiveness. Thus, results demonstrate the unique importance of CQ to cross-border leadership effectiveness. Results also show that previous international experience predicted both general (� = .30, p < .01) and cross-border leadership effectiveness (� = .35, 832 Rockstuhl et al. T ab le 1. M ea ns ,S ta nd ar d D ev ia ti on s, an d C or re la ti on s V ar ia bl e M S D 1 2 3 4 5 6 7 8 9 10 11 12 1. G en er al le ad er sh ip ef fe ct iv en es sa 5. 13 0. 66 (. 91 ) 2. C ro ss -b or de r le ad er sh ip ef fe ct iv en es sa 4. 41 0. 70 .5 6∗ ∗ (. 93 ) 3. G en er al in te ll ig en ce b 22 .0 6 5. 69 .2 3∗ ∗ .1 4 – 4. E m ot io na l in te ll ig en ce c 4. 82 0. 62 .2 6∗ ∗ .1 5 .2 3∗ ∗ (. 76 ) 5. C ul tu ra l in te ll ig en ce c 5. 01 0. 71 .1 7 .3 3∗ ∗ .1 5 .6 2∗ ∗ (. 89 ) 6. A gr ee ab le ne ss 4. 38 0. 64 .0 1 .0 4 .0 0 .1 1 .0 6 (. 62 ) 7. C on sc ie nt io us ne ss 4. 77 0. 56 −. 06 .0 2 .0 2 −. 05 −. 08 .0 2 (. 77 ) 8. E m ot io na l st ab il it y 4. 53 0. 63 .0 1 .0 1 .1 3 .1 6 −. 06 .2 9∗ ∗ .1 8∗ (. 66 ) 9. E xt ra ve rs io n 4. 52 0. 61 .0 7 .0 9 .1 0 .1 7 .1 5 .2 0∗ .0 6 .1 8∗ (. 77 ) 10 . O pe nn es s to ex pe ri en ce 4. 08 0. 65 .0 6 .1 4 −. 06 .0 9 .2 0∗ .0 2 .0 9 −. 03 .3 7∗ ∗ (. 80 ) 11 . A ge (i n ye ar s) 29 .0 7 3. 96 −. 08 .1 1 −. 21 ∗ .0 2 .0 9 .1 4 −. 13 .0 3 −. 19 ∗ .1 0 – 12 . L ea de rs hi p ex pe ri en ce (i n ye ar s) 6. 44 4. 79 −. 13 −. 04 −. 28 ∗∗ −. 03 .0 1 .1 0 .1 5 .0 4 −. 10 .1 2 .5 5∗ ∗ – 13 . P ri or in te rn at io na l ex pe ri en ce 1. 45 0. 91 .2 3∗ ∗ .3 8∗ ∗ −. 20 ∗ .0 1 .2 5∗ ∗ .0 9 −. 02 −. 21 ∗ .0 0 .0 9 .1 1 .0 6 N o te . N = 12 6. a O bs er ve r re po rt . b P er fo rm an ce ba se d. c S el f- re po rt . ∗ p < .0 5, ∗∗ p < .0 1. Cultural Intelligence and Cross-Border Leadership Effectiveness 833 Table 2. Hierarchical Regression Results (N = 126) General leadership Cross-border leadership effectiveness effectiveness Step 1 Step 2 RW Step 1 Step 2 RW Age (in years) −.06 −.05 2.3% .17 .16 5.6% Leadership experience (in years) −.11 −.04 4.0% −.16 −.11 2.4% Prior international experience .25∗∗ .30∗∗ 32.9% .38∗∗∗ .35∗∗∗ 48.1% Agreeableness −.02 −.03 0.3% −.04 −.04 0.2% Conscientiousness −.07 −.06 1.8% .02 .02 0.1% Emotional stability .07 .01 0.7% .07 .07 0.9% Extraversion .03 .00 0.7% .07 .03 1.3% Openness to experience .05 .06 1.4% .08 .06 3.6% General intelligence .23∗ 25.5% .18∗ 9.5% Emotional intelligence .27∗ 25.7% −.07 3.5% Cultural intelligence −.11 4.7% .24∗ 24.7% F 1.32 2.39∗∗ 3.24∗∗ 3.61∗∗∗ (8,117) (11,114) (8,117) (11,114) �F 1.32 4.89∗∗ 3.24∗∗ 3.94∗∗ (8,117) (3,114) (8,117) (3,114) R2 .08 .19 .18 .26 �R2 .08 .11 .18 .08 Adjusted R2 .02 .11 .13 .19 Note. RW = relative weights in percentage of R2 explained.∗p < .05, ∗∗p < .01, ∗∗∗p < .001. p < .001). Surprisingly, previous leadership experience did not predict general leadership effectiveness (� = −.04, n.s.) or cross-border leadership effectiveness (� = −.11, n.s.) in our study. While this result is inconsistent with earlier research that has demonstrated experience can be an important predictor of leadership suc- cess (Fiedler, 2002), it is also consistent with recent theoretical arguments that experience may not necessarily translate into effectiveness (Ng, Van Dyne, & Ang, 2009). Discussion This study responds to a recent call for research on the unique aspects of global leadership and the competencies that predict global leadership effective- ness (Avolio et al., 2009). As hypothesized, results of our rigorous multisource research design show differences in predictors of general leadership effectiveness compared to cross-border leadership effectiveness. Cross-border leaders must work simultaneously with systems, processes, and people from multiple cultures. 834 Rockstuhl et al. Thus, cultural intelligence—the capability of functioning effectively in multicul- tural contexts (Earley & Ang, 2003)—is a critical competency of effective global leaders. Theoretical Implications Our findings have important theoretical implications. First, as Chiu, Gries, Torelli, and Cheng (2011) point out, the outcomes of globalization are uncertain. Some academics predict a multicultural global village and others expect clashes between civilizations. As the articles in this issue attest, contextual and psycholog- ical factors influence the extent to which intercultural contact activates exclusion- ary or integrative reactions. For example, Morris, Mor, and Mok (2011) highlight the adaptive value and creative benefits of developing a cosmopolitan identity. Our findings complement this perspective by emphasizing the importance of cul- tural intelligence for leadership effectiveness—especially in high-stakes global encounters, such as cross-border military assignments. In addition, our study of- fers another perspective because we emphasize the value of theory and research on the competencies of global leaders that help them perform in global contexts, rather than focusing on psychological reactions to globalization. Focusing on competencies suggests exciting opportunities for future research on the dynamic interaction between globalization and global leaders. A second set of theoretical implications is based on the context-specific rela- tionships demonstrated in this study. Specifically, results suggest that EQ and CQ are complementary because EQ predicted general but not cross-border leadership while CQ predicted cross-border but not general leadership effectiveness. This contrasting pattern reinforces the assertion that domestic leader skillsets do not necessarily generalize to global leader skillsets (Avolio et al., 2009; Caligiuri, 2006). Hence, EQ and CQ are related but distinct forms of social intelligence (Moon, 2010), and each has context-specific relevance to different aspects of global leadership effectiveness. Thus, researchers should match types of intelligences to specifics of the situation to maximize predictive validity of effectiveness. Practical Implications Our findings also have practical implications for the selection and develop- ment of global leaders. First, the significant relationship between general intelli- gence and both forms of leader effectiveness reinforces the utility of intelligence as a selection tool for identifying leadership potential. In addition, the incre- mental validity of emotional and cultural intelligence as predictors of leadership effectiveness, over and above previous experience, personality, and general intel- ligence, confirms predictions that social intelligences also contribute to leadership effectiveness (Riggio, 2002). Accordingly, managers should consider multiple Cultural Intelligence and Cross-Border Leadership Effectiveness 835 forms of intelligence when assessing leadership potential, especially when work roles include responsibility for coordinating complex social interactions. Given the differential predictive validity of EQ and CQ relative to the two types of leadership effectiveness in our study, applying the notion of context sim- ilarity and matching types of intelligence with the leadership context should help organizations enhance their understanding of what predicts global leader effec- tiveness. This finding should also help organizations understand why leaders who are effective in domestic contexts may not be effective in cross-border contexts. These insights should help organizations tailor leadership development opportuni- ties to the competency requirements of the situation. When leaders work primarily in domestic settings, organizations should place more emphasis on developing within-culture capabilities, such as EQ. In contrast, when leaders work exten- sively in international or cross-border settings, organizations should emphasize development of cross-cultural capabilities, such as CQ (Ng, Tan, & Ang, 2011). Limitations and Future Research Despite the strength of our multisource design and support for our predictions, this study has limitations that should help guide future research. First, our cross- sectional design prevents inferences about the causal direction of relationships. Thus, we recommend longitudinal field research that assesses capabilities and leadership effectiveness at multiple points in time. Second, our study was conducted in a military context and all participants were male. Thus, we recommend caution in generalizing our findings to other settings until research can assess whether relationships can be replicated in other contexts. To address this need, we recommend future research on different types of intelligences and different aspects of leadership effectiveness in other vocational settings and different cultures (Gelfand, Erez, & Aycan, 2007). Third, … THE RELATIONSHIPS AMONG STERNBERG’S TRIARCHIC ABILITIES, GARDNER’S MULTIPLE INTELLIGENCES, AND ACADEMIC ACHIEVEMENT BIRSEN EKINCI Marmara University In this study I investigated the relationships among Sternberg’s Triarchic Abilities (STA), Gardner’s multiple intelligences, and the academic achievement of children attending primary schools in Istanbul, Turkey. Participants were 172 children (93 boys and 81 girls) aged between 11 and 12 years. STA Test (STAT) total scores were significantly and positively related to linguistic, logical-mathematical, and intrapersonal test scores. Analytical ability scores were significantly positively related to only logical-mathematical test scores, practical ability scores were only related to intrapersonal test scores, and the STAT subsections were significantly related to each other. After removing the effect of multiple intelligences, the partial correlations between mathematics, social science, and foreign language course grades and creative, practical, analytical, and total STAT scores, were found to be significant for creative scores and total STAT scores, but nonsignificant for practical scores and analytical STAT scores. Keywords: Sternberg’s Triarchic Abilities Test, multiple intelligences, academic achievement, children, intelligence. Since 1980 there has been increasing interest in the role of intelligence in learning and its impact on student achievement. Similarly to education theorists, many researchers on intelligence have been conducting studies to apply theories about intelligence, to education in general and, in particular, to the instructional context of the classroom (Castejón, Gilar, & Perez, 2008). The main difference between contemporary and older approaches to the role of intelligence is that, SOCIAL BEHAVIOR AND PERSONALITY, 2014, 42(4), 625-634 © Society for Personality Research http://dx.doi.org/10.2224/sbp.2014.42.4.625 625 Birsen Ekinci, Atatürk Education Faculty, Marmara University. This study was supported by the Marmara University, Scientific Research Projects Center, research number EGT-D-110913-0387. Correspondence concerning this article should be addressed to: Birsen Ekinci, Atatürk Education Faculty, Department of Primary Education, Marmara University, Göztepe Campus, 34722 Kadiköy, Istanbul, Turkey. Email: [email protected] CHILDREN’S ACADEMIC ACHIEVEMENT626 in earlier conceptualizations, intelligence was described as involving one factor of general mental ability that encompasses the common variance among all the contributing factors. The existence of this general intelligence factor was originally hypothesized by Spearman in 1927 and labeled as “g” (see Jensen, 1998). It was hypothesized that this g factor exists over and above the various abilities that make up intelligence, including verbal, spatial visualization, numerical reasoning, mechanical reasoning, and memory (Carroll, 1993). However, according to contemporary theories, intelligence must be regarded as existing in various forms and the levels of intelligence can be improved through education. The most widely accepted comparative theories of intelligences in recent literature are Gardner’s (1993) multiple intelligences theory and Sternberg’s (1985) triarchic theory of intelligence. Researchers have reported significant differences between student outcomes for classroom instruction conducted following the principles of multiple intelligences, and student outcomes under traditionally designed courses of instruction in science (Özdermir, Güneysu, & Tekkaya, 2006), reading (Al-Balhan, 2006), and mathematics (Douglas, Burton, & Reese-Durham, 2008). Gardner (1993) developed a theory of multiple intelligences that comprises seven distinct areas of skills that each person possesses to different degrees. Linguistic intelligence (LI) is the capacity to use words effectively, either orally or in writing. Logical-mathematical intelligence (LMI) is the capacity to use numbers effectively and to reason well. Spatial intelligence (SI) is the ability to perceive the visual-spatial world accurately and to interpret these perceptions. Bodily-kinesthetic intelligence (KI) involves expertise in using one’s body to express ideas and feelings. Musical intelligence (MI) is the capacity to perceive, discriminate, and express musical forms. Interpersonal intelligence (INPI) is the ability to perceive, and make distinctions in, the moods, intentions, motivations, and feelings of other people. Intrapersonal intelligence (INTI) is self-knowledge and the ability to act adaptively on the basis of that knowledge. Naturalist intelligence (NI) is expertise in the recognition and classification of the numerous species – the flora and fauna – of a person’s environment (Armstrong, 2009). Researchers have addressed the relationship between multiple intelligences and metrics of different abilities, and of various psychological constructs. Reid, Romanoff, Algozzine, and Udall (2000) showed that SI, LI, and LMI were related to scores in a test to measure the nonverbal abilities of pattern completion, reasoning by analogy, serial reasoning, and spatial visualization, among a group of handicapped and nonhandicapped children aged between 5 and 17 years. Furthermore, the effects of multiple intelligences-based teaching strategies on students’ academic achievement have been studied extensively (Al-Balhan, 2006; Douglas et al., 2008; Greenhawk, 1997; Mettetal, Jordan, & Harper, 1997; Özdermir et al., 2006). In addition, some researchers have investigated the relationship between multiple intelligences and academic achievement (McMahon, Rose, & Parks, 2004; Snyder, 1999). McMahon and colleagues CHILDREN’S ACADEMIC ACHIEVEMENT 627 found that, compared with other students, fourth-grade students with higher scores on LMI were more likely to demonstrate reading comprehension scores at, or above, grade level. In a similar study, Snyder reported a positive correlation between high school students’ grade point averages and KI. In the same study results showed that there was a positive correlation between the total score for the Metropolitan Achievement Test-Reading developed by the Psychological Corporation of San Antonio, Texas, USA and the categories of LMI and LI. Sternberg developed the second well-known intelligence theory. According to Sternberg (1999a, 1999b), individuals show their intelligence when they apply the information-processing components of intelligence to cope with relatively novel tasks and situations. Within this approach to intelligence, Sternberg (1985) proposed the triarchic theory of intelligence, according to which there are three different, but interrelated, aspects of intellect: (a) analytic intelligence, (b) creative intelligence, and (c) practical intelligence. Individuals highly skilled in analytical intelligence are adept at analytical thinking, which involves applying the components of thinking to abstract, and often academic, problems. Individuals who have a high degree of creative intelligence are skilled at discovering, creating, and inventing ideas and products. People who have a high level of practical intelligence are good at using, implementing, and applying ideas and products. Sternberg (1997) developed an instrument, the Sternberg Triarchic Abilities Test (STAT), to evaluate triarchically based intelligence. In this instrument each aspect of intelligence is tested through three modes of presentation of problems: verbal, quantitative, and figural. A number of previous researchers have established the construct validity of the STAT (Sternberg, Castejón, Prieto, Hautamäki, & Grigorenko, 2001; Sternberg, Ferrari, Clinkenbeard, & Grigorenko, 1996). Although Sternberg did not intend the STAT to be a measure of general intelligence, as assessed by conventional intelligence tests, in related literature (Brody, 2003) there are contradictory results and opinions on this issue. Sternberg (2000a, 2000b) has claimed that the STAT is independent of measures of general intelligence and a more accurate predictor of academic achievement. However, Gottfredson (2002) pointed out that the data obtained to support this claim are sparse and suggested that the data collected by Sternberg et al. (1996) support the conclusion that the STAT is related to other measures of intelligence and may, in fact, be a measure of general intelligence. The triarchic abilities are related to different intelligence tests scores (e.g., Concept Mastery Test, Watson Glaser Critical Thinking Appraisal, Cattle Culture-Fait Test of g; Sternberg et al., 1996). However, Brody (2003) suggested that although these correlations are substantial, it is likely that they underestimate general intelligence because they were obtained from a sample of high school students who were predominately categorized as gifted, as determined by IQ scores, and these students were, therefore, likely to record a restricted range of scores on the tests. CHILDREN’S ACADEMIC ACHIEVEMENT628 In the present study I hypothesized that both multiple intelligences total scores and STAT total scores would be predictors of academic achievement. Specifically, I hypothesized that the LI and LMI, and the analytical STAT, would be predictors of student success in the subject areas of mathematics, science, social science, and foreign- language learning. Method Participants Participants were 174 randomly selected fifth- and sixth-grade students (81 girls and 93 boys) attending primary school in Istanbul, Turkey. Students’ ages ranged from 11 to 12 years old. Instruments The students completed the Turkish version of Gardner’s Multiple Intelligences Inventory (MII; Saban, 2002) to assess participants’ preferred intelligence within one of the eight categories: LI, LMI, SI, MI, KI, INPI, INTI, and NI. The possible score for the MII ranges from 0 to 80. The individual category in which a student has the highest score is considered to be the type of intelligence in which that student is most skilled. The overall Cronbach’s alpha reliability coefficient in this study was .96, denoting high reliability; .89 for LI; .83 for LMI; .89 for SI; .88 for MI; .78 for KI; .85 for INPI; .85 for INTI; and .84 for NI. The second instrument that I used in this study was Sternberg’s Triarchic Abilities Test (STAT). The test comprises 81 items divided across three subsections designed to measure analytical, creative, and practical abilities. I translated this test into Turkish using the back-translation technique. In order to ensure that the back-translation retained the meaning of the original form, I conducted validity and reliability checks. The Turkish and the English versions of the test were given to 80 bilingual Turkish- and English-speaking students to complete within two weeks. Analyses of scores for the Turkish and English versions of test completed by these students yielded high correlation values (.85 for analytical, .79 for practical, and .81 for creative subsections). The overall alpha reliability coefficient of this test was .89, and for the subsections it was .80 for analytical, .77 for practical, and .78 for creative. Procedure The students completed the instruments during class time and in their classrooms. There was no time limit for completion. Each test session lasted approximately 60 minutes. The parents of the participating children gave permission for the researcher to access the students’ grade point average for mathematics, science, social science, and foreign language courses at the end of the year during which the study was conducted. Each participant received a pen and pencil as a thank-you gift for his/her participation in this study. CHILDREN’S ACADEMIC ACHIEVEMENT 629 Data Analysis The data were analyzed using SPSS version 15 to conduct correlation analysis and multiple regression analysis. Results As shown in Table 1, the children’s STAT total scores (M = 35.34, SD = 9.09) were significantly and positively related to LI (M = 28.98; SD = 7.59), LMI (M = 30.12, SD = 6.87), and INTI (M = 29.10, SD = 7.15) scores (p < .01). Analytical subsection STAT scores (M = 13.76, SD = 3.96) were significantly related to LM intelligence scores (p < .01). STAT practical subsection scores (M = 10.37, SD = 3.06) were significantly correlated only with INTI scores (p < .01). Table 1. Relationships Among STAT Total Scores, Analytical, Practical, and Creative Ability Scores, and Multiple Intelligences Scores LI LMI SI MI KI INPI INTI NI Analytical .303 .413** -.057 .093 .036 .021 .281 -.102 Practical .274 .268 .003 .113 .041 .095 .434** -.109 Creative .291 .540** -.062 .103 .004 -.049 .361* -.098 Total .351* .506** -.051 .123 .031 .019 .425** -.124 Note. ** p < .01, * p < .05. LI = linguistic intelligence, LMI = logical-mathematical intelligence, SI = spatial intelligence, MI = musical intelligence, KI = bodily-kinesthetic intelligence, INPI = interpersonal intelligence, INTI = intrapersonal intelligence, NI = naturalist intelligence. Mathematics course grades (M = 3.78; SD = 1.20) were significantly related to the STAT total (p < .001) and to the STAT analytical (p < .001), practical (p < .01), and creative (p < .01) subsections. Similarly, social science (M = 3.78, SD = 1.10) and science course grades (M = 3.51, SD = 1.40) were significantly related to the STAT total (p < .01) and to the STAT analytical (p < .01) and creative (p < .01) subsections. However, foreign language course grades (M = 3.57, SD = 1.16) were significantly related to all of the subsection scores of the STAT (p < .001; see Table 2). Table 2. Relationships Among STAT Total Scores, Analytical, Practical, and Creative Sub- section Scores, and Academic Success Mathematics Science Social science Foreign language Analytical .536* .395** .304** .454* Practical .461** .264 .269 .451* Creative .491* .378** .307** .442* Total .588* .415** .347** .527* Note. * p < .001, ** p < .01. CHILDREN’S ACADEMIC ACHIEVEMENT630 Mathematics grades of the participants were significantly related to LI (p < .01), LMI (p < .01), INPI (p < .05), and INTI (p < .01) scores. Similarly, students’ course grades for science were significantly related to LI (p < .05), LMI (p < .01), and INTI (p < .05) scores; students’ social science course grades were significantly related to LI (p < .05), LMI (p < .01), and INTI (p < .05) scores; and students’ course grades for foreign languages were significantly related to LI (p < .01), LMI (p < .01) and INTI (p < .01) scores (see Table 3). Table 3. Relationships Between Multiple Intelligences Scores and Academic Success LI LMI SI MI KI INPI INTI NI Mathematics .458** .695** .080 .174 .285 .356* .522** .140 Science .340* .575** .007 .070 .239 .312 .379* .085 Social science .359* .598** .125 .118 .217 .319 .356* .139 Foreign language .484** .718** .211 .201 .260 .316 .495** .227 Note. ** p < .01, * p < .05. LI = linguistic intelligence, LMI = logical-mathematical intelligence, SI = spatial intelligence, MI = musical intelligence, KI = bodily-kinesthetic intelligence, INPI = interpersonal intelligence, INTI = intrapersonal intelligence, NI = naturalist intelligence. Multiple regression analyses were conducted in which the variance caused by the MII was removed, and partial correlations were computed between course grades and children’s STAT total and subsection scores. Separate analyses were conducted for each subject area using first the STAT subsections and then using just the STAT total scores. Analyses regarding mathematics course grades yielded significant partial correlations for the creative subsection score (Pr = .44, p < .01) and for the total STAT score (Pr = .62, p < .01), but the partial correlations were not significant for the analytical (Pr = .14) and practical (Pr = 05) STAT scores. Similarly, the regression analyses predicting students’ science course grades yielded significant partial correlations for STAT total scores (Pr = .53, p < .01) and for the creative subsection score (Pr = .42, p < .01), but not for the analytical (Pr = .14) or practical (Pr = .06) STAT scores. Additionally, when I performed the same analyses of social science course grades these yielded significant partial correlations with STAT total scores (Pr = .54, p < .01) and creative subsection scores (Pr = .34, p < .05) but not with analytical (Pr = 19) or creative (Pr = .04) STAT scores. Finally, analyses yielded the same pattern for foreign language course grades and STAT total and subsection scores. Regression analyses yielded significant partial correlations for practical subsection scores (Pr = .41, p < .02) and for total STAT scores (Pr = .61. p < 01). Thus, the total STAT scores and creative subsection scores significantly predicted academic achievement in mathematics, science, social science, and foreign language courses, independent of multiple intelligences scores; however, the analytical and practical subsection scores did not. Correspondingly, the partial correlations CHILDREN’S ACADEMIC ACHIEVEMENT 631 between course grade (for mathematics, social science, science, and foreign language) and the MII subsection scores, with the variation caused by the STAT removed, were significant only for LMI (Pr = .70, p < .01) scores. This finding indicates that, independent of the STAT, only LMI scores predicted achievement in any subject area. Discussion The results in this study showed that STAT total scores were significantly related to LI, LMI, and INTI scores. Analytical subsection STAT scores were significantly related to LMI scores. Practical STAT subsection scores were significantly correlated only with INTI scores. These results are based on the partial correlations between multiple intelligences and STAT scores. However, I limited the scope of this study to the students’ own preferences in regard to their multiple intelligences. In future studies students’ intelligence types should be assessed together with the performances of students on related intelligences for different age groups and different subject areas. In the present study mathematics course grades were significantly related to STAT total scores and to scores for the STAT analytical, practical, and creative abilities subsections. Similarly, science, social science, and foreign language course grades were significantly related to the LI, LM, and INTI scores of the participants. Results of multiple regression analyses indicated that total STAT scores and creative ability scores significantly predicted academic achievement in mathematics, social science, science, and foreign language learning, independent of multiple intelligences scores; however, the analytical and practical ability scores did not. These results are consistent with those reported by Sternberg et al. (2001), who found that total STAT and creative ability scores significantly predicted academic achievement. However, contrary to the findings reported by Sternberg et al., in my study the analytical and practical ability scores did not relate significantly to academic achievement. On the other hand, Koke and Vernon (2003) reported that total STAT scores and only practical ability scores predicted psychology course midterm grades of university students. All these results might indicate that there may be cultural differences within the dominant cognitive abilities represented in the national education systems of various countries. My results in this study also revealed that the partial correlation between course grades for all of the subject areas and each of the MII subsection scores, with the variation caused by the STAT removed, was significant for only the LMI score. This indicates that, independent of the STAT, only LMI scores predicted achievement in any subject area. It should also be noted that in this study the students’ multiple intelligences scores were based on their own preferences for CHILDREN’S ACADEMIC ACHIEVEMENT632 the items representing various kinds of intelligences. In other words, the multiple intelligences scores did not indicate the actual performance of the children in each type of intelligence. I believe that it would be of value for future researchers to test how well the STAT would predict academic achievement for scores on a test in which students’ multiple intelligences scores were each taken into account separately. The relationship between other tests and STAT scores could also be examined with more heterogeneous sample groups. References Al-Balhan, E. M. (2006). Multiple intelligence styles in relation to improved academic performance in Kuwaiti middle school reading. Digest of Middle East Studies, 15, 18-34. http://doi.org/ cd8zdh Armstrong, T. (2009). Multiple intelligences in the classroom. Alexandria, VA: ASCD. Brody, N. (2003). Construct validation of the Sternberg Triarchic Abilities Test: Comment and reanalysis. Intelligence, 31, 319-329. http://doi.org/ffgmzb Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York: Cambridge University Press. Castejón, J. L., Gilar, R., & Perez, N. (2008). From “g factor” to multiple intelligences: Theoretical foundations and implications for classroom practice. In E. P. Velliotis (Ed.), Classroom culture and dynamics (pp. 101-127). New York: Nova Science. Douglas, O., Burton, K. S., & Reese-Durham, N. R. (2008). The effects of the multiple intelligence teaching strategy on the academic achievement of eighth grade math students. Journal of Instructional Psychology, 35, 182-187. Gardner, H. (1993). Frames of mind: The theory of multiple intelligences. New York: Basic. Gottfredson, L. S. (2002). g: Highly general and highly practical. In R. J. Sternberg & E. L. Grigorenko (Eds.), The general intelligence factor: How general is it? (pp. 331-380). Mahwah, NJ: Erlbaum. Greenhawk, J. (1997). Multiple intelligences meet standards. Educational Leadership, 55, 62-64. Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger/Greenwood. Koke, L. C., & Vernon, P. A. (2003). The Sternberg Triarchic Abilities Test (STAT) as a measure of academic achievement and general intelligence. Personality and Individual Differences, 35, 1803-1807. http://doi.org/fmfpqb McMahon, S. D., Rose, D., & Parks, M. (2004). Multiple intelligences and reading achievement: An examination of the Teele Inventory of Multiple Intelligences. The Journal of Experimental Education, 73, 41-52. http://doi.org/bwptfs Mettetal, G., Jordan, C., & Harper, S. (1997). Attitude toward a multiple intelligences curriculum. Journal of Educational Research, 91, 115-122. http://doi.org/dmsgds Özdermir, P., Güneysu, S., & Tekkaya, C. (2006). Enhancing learning through multiple intelligences. Journal of Biological Education, 40, 74-78. http://doi.org/fn2x6h Reid, C., Romanoff, B., Algozzine, B., & Udall, A. (2000). An evaluation of alternative screening procedures. Journal for the Education of the Gifted, 23, 378-396. Saban, A. (2002). Öğrenme ve öğretme [Learning and teaching: New theories and approaches]. Ankara: Nobel. Sternberg, R. J. (1985). Implicit theories of intelligence, creativity, and wisdom. Journal of Personality and Social Psychology, 49, 607-627. http://doi.org/cstvmp Sternberg, R. J. (1993). The Sternberg Triarchic Abilities Test. Unpublished manuscript. http://doi.org/cd8zdh CHILDREN’S ACADEMIC ACHIEVEMENT 633 Sternberg, R. J. (1997). The concept of intelligence and its role in lifelong learning and success. American Psychologist, 52, 1030-1037. http://doi.org/dzxj2p Sternberg, R. J. (1999a). Intelligence as developing expertise. Contemporary Educational Psychology, 24, 359-375. http://doi.org/dzvjsj Sternberg, R. J. (1999b). The theory of successful intelligence. Review of General Psychology, 3, 292-316. http://doi.org/cqrkxh Sternberg, R. J. (2000). The concept of intelligence. In R. J. Sternberg (Ed.), Handbook of intelligence (pp. 3-13). New York: Cambridge University Press. Sternberg, R. J. (2000). Practical intelligence in everyday life. New York: Cambridge University Press. Sternberg, R. J., Castejón, J. L., Prieto, M. D., Hautamäki, J., & Grigorenko, E. L. (2001). Confirmatory factor analysis of the Sternberg Triarchic Abilities Test in three international samples: An empirical test of the triarchic theory of intelligence. European Journal of Psychological Assessment, 17, 1-16. http://doi.org/cn7tjp Sternberg, R. J., Ferrari, M., Clinkenbeard, P. R., & Grigorenko, E. L. (1996). Identification, instruction, and assessment of gifted children: A construct validation of a triarchic model. Gifted Child Quarterly, 40, 129-137. http://doi.org/d3rf9w Snyder, R. F. (1999). The relationship between learning styles/multiple intelligences and academic achievement of high school students. High School Journal, 83, 11-20. Copyright of Social Behavior & Personality: an international journal is the property of Society for Personality Research and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. Journal of Clinical Child and Adolescent Psychology 2005, Vol. 34, No. 3, 506-522 Copyright © 2005 by Lawrence Erlbaum Associates, Inc. Evidence-Based Assessment of Learning Disabilities in Children and Adolescents Jack M. Fletcher Department of Pediatrics and the Center for Academic and Reading Skills, University of Texas Health Science Center at Houston David J. Francis Department of Psychology and the Texas Institute for Measurement, Evaluation and Statistics, University of Houston Robin D. Morris Department of Psychology, Georgia State University G. Reid Lyon Child Development and Behavior Branch, National Institute of Child Health and Human Development The reliability and validity of 4 approaches to the assessment of children and adoles- cents with learning disabilities (LD) are reviewed, including models based on (a) ap- titude-achievement discrepancies, (b) low achievement, (c) intra-individual differ- ences, and (d) response to intervention (RTI). We identify serious psychometric problems that affect the reliability of models based on aptitude-achievement discrep- ancies and low achievement. There are also significant validity problems for models based on aptitude-achievement discrepancies and intra-individual differences. Mod- els that incorporate RTI have considerable potential for addressing both the reliabil- ity and validity issues but cannot represent the sole criterion for LD identification. We suggest that models incorporating both low achievement and RTI concepts have the strongest evidence base and the most direct relation to treatment. The assessment of children for LD must reflect a stronger underlying classification that takes into ac- count relations with other childhood disorders as well as the reliability and validity of the underlying classification and resultant assessment and identification system. The implications of this type of model for clinical assessments of children for whom LD is a concern are discussed. Assessment metbods for identifying children and adolescents with learning disabilities (LD) are mul- tiple, varied, and the subject of heated debates among practitioners. Those debates involve issues that extend beyond the value of specific tests, often reflecting dif- ferent views of how LD is best identified. These views reflect variations in the definition of LD and, therefore, variations in what measures are selected to opera- tionalize the definition (Fletcher, Foorman, et al., 2002). Any focus on the "best tests" leads to a hopeless Grants from the National Institute of Child Health and Human De- velopment, P50 21888, Center for Learning and Attention Disorders, and National Science Foundation 9979968, Early Reading Develop- ment; A Cognitive Neuroscience Approach supported this article. We gratefully acknowledge contributions of Rita Taylor to prep- aration of this article. Requests for reprints should be sent to Jack M. Fletcher, Depart- ment of Pediatrics, University of Texas Health Science Center at Houston, 7000 Fannin Street, UCT 2478, Houston, TX 77030. E-mail: [email protected] morass of confusion in an area such as LD that has not successfully addressed the classification and definition issues that lead to identification of who does and who does not possesses characteristics of LD. Definitions always reflect an implicit classification indicating how different constructs are measured and used to identify members of the class in terms of similarities and differ- ences relative to other entities that are not considered members of the class (Morris & Fletcher, 1988). For LD, children who are members of this class are his- torically differentiated from children who have other achievement-related difficulties, such as mental retar- dation, sensory disorders, emotional or behavioral dis- turbances, and environmental causes of underachieve- ment, including economic disadvantage, minority language status, and inadequate instruction (Fletcher, Francis, Rourke, Shaywitz, & Shaywitz, 1993; Lyon, Fletcher, & Barnes, 2003). If the classification is valid, children with LD may share characteristics that are similar with other groups of underachievers, but they 506 ASSESSMENT OF LD should also differ in ways that can be measured and that can serve to define and operationalize the class of children and adolescents with LD. In this article, we consider evidence-based ap- proaches to the assessment of LD in the context of differ- ent approaches to the classification and identification of LD. We argue that the measurement systems that are used to identify children and adolescents with LD are in- separable from the classifications from which the identi- fication criteria evolve. Moreover, all measurement sys- tems are imperfect attempts to measure a construct (LD) that operates as a latent variable that is unknowable in- dependently of how it is measured and therefore of how LD is classified. The construct of LD is imperfectly measured simply because the measurement tools them- selves are not error free (Francis et al., 2005). Different approaches to classification and definition capitalize on this error of measurement in ways that reduce or in- crease the reliability of the classification itself Simi- larly, evaluating similarities and differences among groups of students who are identified as LD and not LD is a test of the validity of the underlying classification, so long as the variables used to assess this form of validity are not the same as those used for identification (Morris & Fletcher, 1988). As with any form of validity, ade- quate reliability is essential. Classifications can be reli- able and still lack validity. The converse is not true; they cannot be valid and lack reliability. A valid classifica- tion of LD predicts important characteristics of the group. Consistent with the spirit of this special section, the most important characteristic is whether the classifi- cation is meaningfully related to intervention. For LD, a classification should also predict a variety of differences on cognitive skills, behavioral attributes, and achieve- ment variables not used to form the classification, developmental course, response to intervention (RTI), neurobiological variables, or prognosis (Fletcher, Lyon, et al., 2002). To address these issues, we consider the reliability and validity of four approaches to the classification and assessment of LD: (a) IQ discrepancy and other forms of aptitude-achievement discrepancy, (b) low achieve- ment, (c) intra-individual differences, and (d) models incorporating RTI and some form of curriculum-based measurement. We consider how each classification re- flects the historically prominent concept of "unex- pected underachievement" as the key construct in LD assessment (Lyon et al., 2001), that is, what many early observers characterized as a group of children unable to master academic skills despite the absence of known causes of poor achievement (sensory disorder, mental retardation, emotional disturbances, economic disad- vantages, inadequate instruction). From this perspec- tive, a valid classification and measurement system for LD must identify a unique group of underachievers that is clearly differentiated from groups with other forms of underachievement. Defining LD Historically, definition and classification issues have haunted the field of LD. As reviewed in Lyon et al. (2001), most early conceptualizations viewed LD simply as a form of "unexpected" underachievement. The primary approach to assessment involved the iden- tification of intra-individual variability as a marker for the unexpectedness of LD, along with the exclusion of other causes of underachievement that would be ex- pected to produce underachievement. This type of defi- nition was explicitly coded into U.S. federal statutes when LD was identified as an eligibility category for special education in Public Law 94-142 in 1975; es- sentially the same definition is part of current U.S. fed- eral statues in the Individuals with Disabilities Educa- tion Act (1997). The U.S. statutory definition of LD is essentially a set of concepts that in itself is difficult to operation- alize. In 1977, recommendations for operationalizing the federal definition of LD were provided to states af- ter passage of Public Law 94—142 to help identify chil- dren in this category of special education (U. S. Office of Education, 1977). In these regulations, LD was defined as a heterogeneous group of seven disorders (oral language, listening comprehension, basic read- ing, reading comprehension, math calculations, math reasoning, written language) with a common marker of intra-individual variability represented by a discrep- ancy between IQ and achievement (i.e., unexpected underachievement). Unexpectedness was also indi- cated by maintaining the exclusionary criteria present in the statutory definition that presumably lead to ex- pected underachievement. Other parts of the regula- tions emphasize the need to ensure that the child's edu- cational program provided adequate opportunity to learn. No recommendations were made concerning the assessment of psychological processes, most likely be- cause it was not clear that reliable methods existed for assessing processing skills and because the field was not clear on what processes should be assessed (Reschly, Hosp, & Smied, 2003). This approach to definition is now widely imple- mented with substantial variability across schools, dis- tricts, and states in which students are served in special education as LD (MacMillan & Siperstein, 2002; Mer- cer, Jordan, Allsop, & Mercer, 1996; Reschly et al., 2003). It is also the basis for assessments of LD outside of schools. Consider, for example, thedefinition of read- ing disorders in the Diagnostic and Statistical Manual of Mental Disorders (4th ed.; American Psychiatric As- sociation, 1994), which indicates that the student must perform below levels expected for age and IQ, and spec- ifies only sensory disorders as exclusionary: A. Reading achievement, as measured by individ- ually administered standardized tests of read- 507 FLETCHER, FRANCIS, MORRIS, LYON ing accuracy or comprehension, is substan- tially below that expected given the person's chronological age, measured intelligence, and age-appropriate education. B. The disturbance in Criterion A significantly in- terferes with academic achievement or activi- ties of daily living that require reading skills. C. If a sensory deficit is present, the reading diffi- culties are in excess of those usually associated with it. The International Classification of Diseases-10 has a similar definition. It differs largely in being more spe- cific in requiring use of a regression-adjusted discrep- ancy, specifying cut points (achievement two standard errors below IQ) for identifying a child with LD, and expanding the range of exclusions. Although these definitions are used in what are of- ten disparate realms of practice, they lead to similar ap- proaches to the identification of children and adoles- cents as LD. Across these realms, children commonly receive IQ and achievement tests. The IQ test is com- monly interpreted as an aptitude measure or index against which achievement is compared. Different achievement tests are used because LD may affect achievement in reading, math, or written language. The heterogeneity is recognized explicitly in the U.S. statu- tory and regulatory definitions of LD (Individuals With Disabihties Education Act, 1997) and in the psychi- atric classifications by the provision of separate defini- tions for each academic domain. However, it is still essentially the same definition applied in different do- mains. In many settings, this basic assessment is sup- plemented with tests of processing skills derived from multiple perspectives (neuropsychology, information processing, and theories of LD). The approach boils down to administration of a battery of tests to identify LD, presumably with treatment implications. Underlying Classification Hypotheses Implicit in all these definitions are slight variations on a classification model of individuals with LD as those who show a measurable discrepancy in some but not all domains of skill development and who are not identified into another subgroup of poor achievers. In some instances, the discrepancy is quantified with two tests in an aptitude-achievement model epitomized by the IQ-discrepancy approach in the U.S. federal regu- latory definition and the psychiatric classifications of the Diagnostic and Statistical Manual of Mental Dis- orders (4th ed.; American Psychiatric Association, 1994) and the International Classification of Dis- eases-10. Here the classification model implicitly stip- ulates that those who meet an IQ-discrepancy inclusionary criterion are different in meaningful ways from those who are underachievers and do not meet the discrepancy criteria or criteria for one of the exclu- sionary conditions. Some have argued that this model lacks validity and propose that LD is synonymous with underachievement, so that it should be identified solely by achievement tests (Siegel, 1992), often with some exclusionary criteria to help ensure that the achieve- ment problem is unexpected. Thus, the contrast is re- ally between a two-test aptitude-achievement discrep- ancy and a one-test chronological age-achievement discrepancy with achievement low relative to age- based (or grade-based) expectations. If processing measures are added, the model becomes a multitest discrepancy model. Identification of a child as LD in all three of these models is typically based on assess- ment at a single point in time, so we refer to them as "status" models. Finally, RTI models emphasize the "adequate opportunity to learn" exclusionary criterion by assessing the child's response to different instruc- tional efforts over time with frequent brief assess- ments, that is, a "change" model. The child who is LD becomes one who demonstrates intractability in learn- ing characteristics by not responding adequately to in- struction that is effective with most other students. Dimensional Nature of LD Each of these four models can be evaluated for reli- ability and validity. Unexpected underachievement, a concept critically important to the validity of the under- lying construct of LD, can also be examined. The reli- ability issues are similar across the first three models and stem from the dimensional nature of LD. Most pop- ulation-based studies have shown that reading and math skills are normally distributed (Jorm, Share, Matthews, & Matthews, 1986; Lewis, Hitch, & Walker, 1994; Rodgers, 1983; Shalev, Auerbach, Manor, & Gross- Tsur, 2000; Shaywitz, Escobar, Shaywitz, Fletcher, & Makuch, 1992; Silva, McGee, & Williams, 1985). These findings are buttressed by behavioral genetic studies, which are not consistent with the presence of qualitatively different characteristics associated with the heritability of reading and math disorders (Fisher & DeFries, 2002; Gilger, 2002). As dimensional traits that exist on a continuum, there would be no expectation of natural cut points that differentiate individuals with LD from those who are underachievers but not identified as LD (Shaywitz et al., 1992). The unobservable nature of LD makes two-test and one-test discrepancy models unreliable in ways that are psychometrically predictable but not in ways that simply equate LD with poor achievement (Fran- cis et al., 2005; Stuebing et al., 2002). The problem is that the measurement approach is based on a static assessment model that possesses insufficient informa- tion about the underlying construct to allow for reli- able classifications of individuals along what is es- sentially an unobservable dimension. If LD was a 508 ASSESSMENT OF LD manifest concept that was directly observable in the behavior of affected individuals, or if there were nat- ural discontinuities that represented a qualitative breakpoint in the distribution of achievement skills or the cognitive skills on which achievement depends, this problem would be less of an obstacle. However, like achievement or intelligence, LD is a latent con- struct that must be inferred from the pattern of perfor- mance on directly observable operationalizations of other latent constructs (namely, test scores that index constructs like reading achievement, phonological awareness, aptitude, and so on). The more informa- tion available to support the inference of LD, the more reliable (and valid) that inference becomes, thus supporting the fine-grained distinctions necessitated by two-test and one-test discrepancy models. To the extent that the latent construct, LD, is categorical, by which we mean that the construct indexes different classes of learners (i.e., children who learn differ- ently) as opposed to simply different levels of achievement, then systems of identification that rely on one measurable variable lack sufficient informa- tion to identify the latent classes and assign individu- als to those classes without placing additional, untestable, and unsupportable constraints on the sys- tem. It is simply not possible to use a single mean and standard deviation and to estimate separate means and standard deviations for two (or more) unobservable latent classes of individuals and deter- mine the percentage of individuals falling into each class, let alone to classify specific individuals into those classes. Without constraints, such as specifying the magnitude of differences in the means of the la- tent classes, the ratio of standard deviations, and the odds of membership in the two (or more) classes, tbe system is under-identified, which simply means that there are many different solutions that cannot be dis- tinguished from one another. When the system is under-identified, the only solu- tion is to expand the measurement system to increase the number of observed relations, which in one sense is what intra-individual difference models attempt by add- ing assessments of processing skills. Other criteria are necessary because it is impossible to uniquely identify a distinct subgroup of underachieving individuals consis- tent with the construct of LD when identification is based on a single assessment at a single time point. Adding external criteria, such as an aptitude measure or multiple assessments of processing skills, increases the dimensionality of the measurement system and makes latent classification more feasible, even when the other criteria are themselves imperfect. But the main issues for one-test, two-test, and multitest identification mod- els involve the reliability of the underlying classifica- tions and whether they identify a unique subgroup of un- derachievers. In the next section, we examine variations in reliability and validity for each of these models, fo- cusing on the importance of reliability, as the validity of the classifications can be no stronger than their reliability. Models Based on Two-Test Discrepancies Although the IQ-discrepancy model is the most widely utilized approach to identifying LD, there are many different ways to operationalize the model. For example, some implementations are based on a com- posite IQ score, whereas others utilize either a verbal or nonverbal IQ score. Qther approaches drop IQ as the aptitude measure and use a measure such as listening comprehension. In the validity section, we discuss each of these approaches. The reliability issues are similar for each example of an aptitude-achievement discrepancy. Reliability Specific reliability problems for two-test discrep- ancy models pertain to any comparison of two corre- lated assessments that involve the determination of a child's performance relative to a cut point on a continu- ous distribution. Discrepancy involves the calculation of a difference score (D) to estimate the true difference (A) between two latent constructs. Thus, discussions about discrepancy must distinguish between problems with the manifest (i.e., observed) difference (D) as an index of the true difference (A) but also must consider whether the true difference (A) reflects the construct of interest. Problems with the reliability of D based on differences between two tests are well known, albeit not in the LD context (Bereiter, 1967). However, there is nothing that fundamentally limits the applicability of this research to LD if we are willing to accept a notion of A as a marker for LD. There are major problems with this assumption that are reviewed in Francis et al. (2005). The most significant is regression to the mean. On average, regression to the mean indicates that scores that are above the mean will be lower when the test is repeated or when a second correlated test is used to compute D. In this example, individuals who have IQ scores above the mean will obtain achievement test scores that, on average, will be lower than the IQ test score because the achievement score will move toward the mean. The opposite is true for individuals with IQ scores below the mean. This leads to the paradox of children with achievement scores that exceed IQ, or the identification of low-achieving, higher IQ children with achievement above the average range as LD. Although adjusting for the correlation of IQ and achievement helps correct for regression effects (Rey- nolds, 1984-1985), unreliability also stems from the attempt to assess a person's standing relative to a cut point on a continuous distribution. As discussed in the 509 FLETCHER, FRANCIS, MORRIS, LYON following section on low achievement models, this problem makes identification with a single test—even one with small amounts of measurement error—poten- tially unreliable, a problem for any status model. None of this discussion addresses the validity ques- tion concerning A. Specifically, does A embody LD as we would want to conceptualize it (e.g., as unexpected underachievement), or is A merely a convenient con- ceptualization of LD because it is a conceptualization that leads directly to easily implemented, operational definitions, however fiawed they might be? Validity The validity of the IQ-discrepancy model has been extensively studied. Two independent meta-analyses have shown that effect sizes on measures of achieve- ment and cognitive functions are in the negligible to small range (at best) for the comparison of groups formed on the basis of discrepancies between IQ and reading achievement versus poor readers without an IQ discrepancy (Hoskyn & Swanson, 2000; Stuebing et al., 2002), findings similar to studies not included in these meta-analyses (Stanovich & Siegei, 1994). Qther validity studies have not found that discrepant and nondiscrepant poor readers differ in long-term prog- nosis (Francis, Shaywitz, Stuebing, Shaywitz, & Flet- cher, 1996; Silva et al., 1985), response to instruction (Fletcher, Lyon, et al., 2002; Jimenez et al., 2003; Stage, Abbott, Jenkins, & Beminger, 2003; Vellutino, Scanlon, & Jaccard, 2003), or neuroimaging correlates (Lyon et al., 2003; but also see Shaywitz et al., 2003, which shows differences in groups varying in IQ but not IQ discrepancy). Studies of genetic variability show negligible to small differences related to IQ-dis- crepancy models that may reflect regression to the mean (Pennington, Gilger, Olson, & DeFries, 1992; Wadsworth, Olson, Pennington, & DeFries, 2000). Similar empirical evidence has been reported for LD in math and language (Fletcher, Lyon, et al., 2002; Mazzocco & Myers, 2003). This is not surprising given that the problems are inherent in the underlying psychometric model and have little to do with the spe- cific measures involved in the model except to the ex- tent that specific test reliabilities and intertest correla- tions enter into the equations. Despite the evidence of weak validity for the practice of differentiating discrepant and nondiscrepant stu- dents, alternatives based on discrepancy models con- tinue to be proposed, and psychologists outside of schools commonly implement this flawed model. How- ever, given the reliability problems inherent in IQ dis- crepancy models, it is not surprising that these other at- tempts to operationalize aptitude-achievement discrepancy have not met with success. In the Stuebing et al. (2002) meta-analysis, 32 of the 46 major studies had a clearly defined aptitude measure. Of these studies. 510 19 used Full Scale IQ, 8 used Verbal IQ, 4 used Perfor- mance IQ, and 1 study used a discrepancy of listening comprehension and reading comprehension. Not sur- prisingly, these different discrepancy models did not yield results that were different from those when a com- posite IQ measure was utilized. Neither Fletcher et al. (1994) nor Aaron, Kuchta, and Grapenthin (1988) were able to demonstrate major differences between discrep- ant and low achievement groups formed on the basis of listening comprehension and reading comprehension. The differences in these models involve slight changes in who is identified as discrepant or low achieving depending on the cut point and the correla- tion of the aptitude and achievement measures. The changes simply reflect fluctuations around the cut point where children are most similar. It is not surpris- ing that effect sizes comparing poor achievers with and without IQ discrepancies are uniformly low across these different models. Current practices based on this approach to identification of LD epitomized by the federal regulatory definition and psychiatric classifica- tions are fundamentally flawed. One-Test (Low Achievement) Models Reliability The measurement problems that emerge when a specific cut point is used for identification purposes af- fect any psychometric approach to LD identification. These problems are more significant when the test score is not criterion referenced, or when the score dis- tributions have been smoothed to create a normal uni- variate distribution. To reiterate, the presence of a natu- ral breakpoint in the score distribution, typically observed in multimodal distributions, would make it simple to validate cut points. But natural breaks are not usually apparent in achievement distributions because reading and math achievement distributions are nor- mal. Thus, LD is essentially a dimensional trait, or a variation on normal development. Regardless of normality, measurement error attends any psychometric procedure and affects cut points in a normal distribution (Shepard, 1980). Because of mea- surement error, any cut point set on the observed distri- bution will lead to instability in the identification of class members because observed test scores will fluc- tuate around the cut point with repeated testing or use of an alternative measure of the same construct (e.g., two reading tests). This fluctuation is not just a prob- lem of correlated tests or simply a matter of setting better cut scores or developing better tests. Rather, no single observed test score can capture perfectly a stu- dent's ability on an imperfectly measured latent vari- able. The fluctuation in identifications will vary across different tests, depending in part on the measurement ASSESSMENT OF LD error. In both real and simulated data sets, fluctuations in up to 35% of cases are found when a single test is used to identify a cut point. Similar problems are ap- parent if a two-test discrepancy model is used (Francis et al., 2005; Shaywitz et al., 1992). This problem is less of an issue for research, which rarely hinges on the identification of individual chil- dren. Thus, it does not have great impact on the validity of a low achievement classification because, on aver- age, children around the cut point who may be fluctuat- ing in and out of the class of interest with repeated test- ing are not very different. However, the problems for an individual child who is being considered for special education placement or a psychiatric diagnosis are ob- vious. A positive identification in either example often carries a poor prognosis. Validity Models based on the use of achievement markers can be shown to have a great deal of validity (see Fletcher, Lyon, et al., 2002; Fletcher, Morris, & Lyon, 2003; Siegel, 1992). In this respect, if groups are formed such that the participants do not meet criteria for mental retardation and have achievement scores that are below the 25th percentile, a variety of compari- sons show that subgroups of underachievers emerge that can be validly differentiated on external variables and help demonstrate the viability of tbe construct of LD. For example, if children with reading and math disabilities identified in this manner are compared to typical achievers, it is possible to show that these three groups display different cognitive correlates. In addi- tion, neurobiological studies show that these groups differ both in the neural correlates of reading and math performance as well as the heritability of reading and math disorders (Lyon et al., 2003). These achievement subgroups, which by definition include children who meet either low achievement or IQ-discrepancy crite- ria, even differ in RTI, providing strong evidence for "aptitude by treatment" interactions; math interven- tions provided for children with reading problems are demonstrably ineffective, and vice versa. Despite this evidence for validity, concerns emerge about definitions based solely on achievement cut points. Simply utilizing a low achievement definition, even when different exclusionary criteria are applied, does not operationalize the true meaning of unexpected underachievement. Although such an approach to identification is deceptively simple, … Neuron Article Fractionating Human Intelligence Adam Hampshire,1,* Roger R. Highfield,2 Beth L. Parkin,1 and Adrian M. Owen1 1The Brain and Mind Institute, The Natural Sciences Centre, Department of Psychology, The University of Western Ontario, London ON, N6A 5B7, Canada 2Science Museum, Exhibition Road, London SW72DD, UK *Correspondence: [email protected] http://dx.doi.org/10.1016/j.neuron.2012.06.022 SUMMARY What makes one person more intellectually able than another? Can the entire distribution of human intelligence be accounted for by just one general factor? Is intelligence supported by a single neural system? Here, we provide a perspective on human intelligence that takes into account how general abilities or ‘‘factors’’ reflect the functional organiza- tion of the brain. By comparing factor models of individual differences in performance with factor models of brain functional organization, we demon- strate that different components of intelligence have their analogs in distinct brain networks. Using simulations based on neuroimaging data, we show that the higher-order factor ‘‘g’’ is accounted for by cognitive tasks corecruiting multiple networks. Finally, we confirm the independence of these com- ponents of intelligence by dissociating them using questionnaire variables. We propose that intelli- gence is an emergent property of anatomically distinct cognitive systems, each of which has its own capacity. INTRODUCTION Few topics in psychology are as old or as controversial as the study of human intelligence. In 1904, Charles Spearman famously observed that performance was correlated across a spectrum of seemingly unrelated tasks (Spearman, 1904). He proposed that a dominant general factor ‘‘g’’ accounts for correlations in performance between all cognitive tasks, with residual differences across tasks reflecting task-specific fac- tors. More controversially, on the basis of subsequent attempts to measure ‘‘g’’ using tests that generate an intelligence quotient (IQ), it has been suggested that population variables including gender (Irwing and Lynn, 2005; Lynn, 1999), class (Burt, 1959, 1961; McManus, 2004), and race (Rushton and Jensen, 2005) correlate with ‘‘g’’ and, by extension, with one’s genetically pre- determined potential. It remains unclear, however, whether population differences in intelligence test scores are driven by heritable factors or by other correlated demographic variables such as socioeconomic status, education level, and motivation (Gould, 1981; Horn and Cattell, 1966). More relevantly, it is questionable whether they relate to a unitary intelligence factor, Ne as opposed to a bias in testing paradigms toward particular components of a more complex intelligence construct (Gould, 1981; Horn and Cattell, 1966; Mackintosh, 1998). Indeed, over the past 100 years, there has been much debate over whether general intelligence is unitary or composed of multiple factors (Carroll, 1993; Cattell, 1949; Cattell and Horn, 1978; Johnson and Bouchard, 2005). This debate is driven by the observation that test measures tend to form distinctive clusters. When combined with the intractability of developing tests that mea- sure individual cognitive processes, it is likely that a more complex set of factors contribute to correlations in performance (Carroll, 1993). Defining the biological basis of these factors remains a challenge, however, due in part to the limitations of behavioral factor analyses. More specifically, behavioral factor analyses do not provide an unambiguous model of the underlying cogni- tive architecture, as the factors themselves are inaccessible, being measured indirectly by estimating linear components from correlations between the performance measures of dif- ferent tests. Thus, for a given set of behavioral correlations, there are many factor solutions of varying degrees of complexity, all of which are equally able to account for the data. This ambiguity is typically resolved by selecting a simple and interpretable factor solution. However, interpretability does not necessarily equate to biological reality. Furthermore, the accuracy of any factor model depends on the collection of a large number of pop- ulation measures. Consequently, the classical approach to intel- ligence testing is hampered by the logistical requirements of pen and paper testing. It would appear, therefore, that the classical approach to behavioral factor analysis is near the limit of its resolution. Neuroimaging has the potential to provide additional con- straint to behavioral factor models by leveraging the spatial segregation of functional brain networks. For example, if one homogeneous system supports all intelligence processes, then a common network of brain regions should be recruited when- ever difficulty increases across all cognitive tasks, regardless of the exact stimulus, response, or cognitive process that is manipulated. Conversely, if intelligence is supported by multiple specialized systems, anatomically distinct brain networks should be recruited when tasks that load on distinct intelligence factors are undertaken. On the surface, neuroimaging results accord well with the former account. Thus, a common set of frontal and parietal brain regions is rendered when peak activa- tion coordinates from a broad range of tasks that parametrically modulate difficulty are smoothed and averaged (Duncan and Owen, 2000). The same set of multiple demand (MD) regions is activated during tasks that load on ‘‘g’’ (Duncan, 2005; Jung uron 76, 1225–1237, December 20, 2012 ª2012 Elsevier Inc. 1225 mailto:[email protected] http://dx.doi.org/10.1016/j.neuron.2012.06.022 Neuron Fractionating Human Intelligence and Haier, 2007), while the level of activation within frontoparietal cortex correlates with individuals differences in IQ score (Gray et al., 2003). Critically, after brain damage, the size of the lesion within, but not outside of, MD cortex is correlated with the esti- mated drop in IQ (Woolgar et al., 2010). However, these results should not necessarily be equated with a proof that intelligence is unitary. More specifically, if intelligence is formed from multiple cognitive systems and one looks for brain responses during tasks that weigh most heavily on the ‘‘g’’ factor, one will most likely corecruit all of those functionally distinct systems. Similarly, by rendering brain activation based on many task demands, one will have the statistical power to render the networks that are most commonly recruited, even if they are not always corecruited. Indeed, there is mounting evidence demonstrating that different MD regions respond when distinct cognitive demands are manipulated (Corbetta and Shulman, 2002; D’Esposito et al., 1999; Hampshire and Owen, 2006; Hampshire et al., 2008, 2011; Koechlin et al., 2003; Owen et al., 1996; Pet- rides, 2005). However, such a vast array of highly specific func- tional dissociations have been proposed in the neuroimaging literature as a whole that they often lack credibility, as they fail to account for the broader involvement of the same brain regions in other aspects of cognition (Duncan and Owen, 2000; Hamp- shire et al., 2010). The question remains, therefore, whether intel- ligence is supported by one or multiple systems, and if the latter is the case, which cognitive processes those systems can most broadly be described as supporting. Furthermore, even if multiple functionally distinct brain networks contribute to intelli- gence, it is unknown whether the capacities of those networks are independent or are related to the same set of diffuse biolog- ical factors that modulate general neural efficiency. It is unclear, therefore, whether the pattern of individual differences in intelli- gence reflects the functional organization of the brain. Here, we address the question of whether human intelligence is best conceived of as an emergent property of functionally distinct brain networks using factor analyses of brain imag- ing, behavioral, and simulated data. First, we break MD cortex down into its constituent functional networks by factor analyzing regional activation levels during the performance of 12 challenging cognitive tasks. Then, we build a model, based on the extent to which the different functional networks are recruited during the performance of those 12 tasks, and deter- mine how well that model accounts for cross-task correlations in performance in a large (n = 44,600) population sample. Factor solutions, generated from brain imaging and behavioral data, are compared directly, to answer the question of whether the same set of cognitive entities is evident in the functional organization of the brain and in individual differences in perfor- mance. Simulations, based on the imaging data, are used to determine the extent to which correlations between first-order behavioral components are predicted by cognitive tasks re- cruiting multiple functional brain networks, and the extent to which those correlations may be accounted for by a spatially diffuse general factor. Finally, we examine whether the behav- ioral components of intelligence show a degree of indepen- dence, as evidenced by dissociable correlations with the types of questionnaire variable that ‘‘g’’ has historically been associ- ated with. 1226 Neuron 76, 1225–1237, December 20, 2012 ª2012 Elsevier Inc RESULTS AND DISCUSSION Identifying Functional Networks within MD Cortex Sixteen healthy young participants undertook the cognitive battery in the MRI scanner. The cognitive battery consisted of 12 tasks, which, based on well-established paradigms from the neuropsychology literature, measured a range of the types of planning, reasoning, attentional, and working memory skills that are considered akin to general intelligence (see Supple- mental Experimental Procedures available online). The activation level of each voxel within MD cortex was calculated separately for each task relative to a resting baseline using general linear modeling (see Supplemental Experimental Procedures) and the resultant values were averaged across participants to remove between-subject variability in activation—for example, due to individual differences in regional signal intensity. The question of how many functionally distinct networks were apparent within MD cortex was addressed using exploratory factor analysis. Voxels within MD cortex (Figure 1A) were transformed into 12 vectors, one for each task, and these were examined using principal components analysis (PCA), a factor analysis technique that extracts orthogonal linear components from the 12-by-12 matrix of task-task bivariate correlations. The results revealed two ‘‘significant’’ principal components, each of which explained more variability in brain activation than was contributed by any one task. These components accounted for �90% of the total variance in task-related activation across MD cortex (Table S1). After orthogonal rotation with the Varimax algorithm, the strengths of the task-component loadings were highlyvariableandeasilycomprehensible(Table1andFigure1B). Specifically, all of the tasks in which information had to be actively maintained in short-term memory, for example, spatial working memory, digit span, and visuospatial working memory, loaded heavily on one component (MDwm). Conversely, all of the tasks in which information had to be transformed in mind according to logical rules, for example, deductive reasoning, grammatical reasoning, spatial rotations, and color-word remapping, loaded heavily on the other component (MDr). When factor scores were generated at each voxel using regression and projected back onto the brain, two clearly defined functional networks were rendered (Figure 1D). Thus, the insula/frontal operculum (IFO), the superior frontal sulcus (SFS), and the ventral portion of the anterior cingulate cortex/ presupplementary motor area (ACC/preSMA) had greater MDwm component scores, whereas the inferior frontal sulcus (IFS), inferior parietal cortex (IPC), and the dorsal portion of the ACC/preSMA had greater MDr compo- nent scores. When the PCA was rerun with spherical regions of interest (ROIs) centered on each MD subregion, with radii that varied from 10 to 25 mm in 5 mm steps and excluding voxels that were on average deactivated, the task loadings correlated with those from the MD mask at r > 0.95 for both components and at all radii. Thus, the PCA solution was robust against varia- tions in the extent of the ROIs. When data from the whole brain were analyzed using the same method, three significant compo- nents were generated, the first two of which correlated with those from the MD cortex analysis (MDr r = 0.76, MDwm r = 0.83), demonstrating that these were the most prominent active-state networks in the brain. The factor solution was also reliable at . Figure 1. Factor Analyzing Functional Brain Imaging Data from within Multiple Demand Cortex (A) The MD cortex ROIs. (B) PCA of the average activation patterns within MD cortex for each task (x axis reports task- component loading). (C) PCA with each individual’s data included as separate columns (error bars report SEM). (D) Component scores from the analysis of MD task-related activations averaged across individ- uals. Voxels that loaded more heavily on the MDwm component are displayed in red. Voxels that loaded more heavily on the MDr network are displayed in blue. (E) T contrasts of component scores against zero from the PCA with individual data concatenated into 12 columns (FDR corrected at p < 0.05 for all MD voxels). Neuron Fractionating Human Intelligence the individual subject level. Rerunning the same PCA on each individual’s data generated solutions with two significant compo- nents in 13/16 cases. There was one three-component solution and two four-component solutions. Rerunning the two-compo- nent PCA with each individual’s data set included as 12 separate columns (an approach that did not constrain the same task to load on the same component across participants) demonstrated that the pattern of task-component loadings was also highly reli- able at the individual subject level (Figure 1C). In order to test the reliability of the functional networks across participants, the data were concatenated instead of averaged into 12 columns (an approach that does not constrain the same voxels to load on the same components across individuals), and component Neuron 76, 1225–1237, De scores were estimated at each voxel and projected back into two sets of 16 brain maps. When t contrasts were calculated against zero at the group level, the same MDwm and MDr functional networks were rendered (Figure 1E). While the PCA works well to identify the number of significant components, a potential weakness for this method is that the unrotated task-component load- ings are liable to be formed from mixtures of the underlying factors and are heavily biased toward the component that is ex- tracted first. This weakness necessitates the application of rotation to the task- component matrix; however, rotation is not perfect, as it identifies the task- component loadings that fit an arbitrary set of criteria designed to generate the simplest and most interpretable solution. To deal with this potential issue, the task- functional network loadings were recalcu- lated using independent component anal- ysis (ICA), an analysis technique that exploits the more powerful properties of statistical independence to extract the sources from mixed signals. Here, we used ICA to extract two spatially distinct func- tional brain networks using gradient ascent toward maximum entropy (code adapted from Stone and Porrill, 1999). The resultant components were broadly similar, although not identical, to those from the PCA (Table 1). More specifically, all tasks loaded posi- tively on both independent brain networks but to highly varied extents, with the short-term memory tasks loading heavily on one component and the tasks that involved transforming informa- tion according to logical rules loading heavily on the other. Based on these results, it is reasonable to conclude that MD cortex is formed from at least two functional networks, with all 12 cognitive tasks recruiting both networks but to highly variable extents. cember 20, 2012 ª2012 Elsevier Inc. 1227 Table 1. PCA and ICA of Activation Levels in 2,275 MD Voxels during the Performance of 12 Cognitive Tasks PCA ICA MDr MDwm MDr MDwm Self-ordered search 0.38 0.69 1.45 3.26 Visuospatial working memory 0.27 0.84 1.24 2.68 Spatial span 0.17 0.86 0.51 2.23 Digit span 0.28 0.76 0.76 2.20 Paired associates 0.56 0.62 1.90 1.97 Spatial planning 0.58 0.50 2.43 2.74 Feature match 0.68 0.49 2.00 0.88 Interlocking polygons 0.74 0.31 2.11 0.61 Verbal reasoning 0.78 0.15 2.62 0.60 Spatial rotation 0.75 0.44 2.86 1.88 Color-word remapping 0.69 0.42 3.07 0.95 Deductive reasoning 0.90 0.18 3.98 0.19 PCA/ICA correlation MDr r = 0.92 PCA/ICA correlation MDwm r = 0.81 Table 2. Task-Component Loadings from the PCA of Internet Data with Orthogonal Rotation 1 (STM) 2 (Reasoning) 3 (Verbal) Spatial span 0.69 0.22 Visuospatial working memory 0.69 0.21 Self-ordered search 0.62 0.16 0.16 Paired associates 0.58 0.25 Spatial planning 0.41 0.45 Spatial rotation 0.14 0.66 Feature match 0.15 0.57 0.22 Interlocking polygons 0.54 0.3 Deductive reasoning 0.19 0.52 �0.14 Digit span 0.26 �0.2 0.71 Verbal reasoning 0.33 0.66 Color-word remapping 0.22 0.35 0.51 Neuron Fractionating Human Intelligence The Relationship between the Functional Organization of MD Cortex and Individual Differences in Intelligence: Permutation Modeling A critical question is whether the loadings of the tasks on the MDwm and MDr functional brain networks form a good predictor of the pattern of cross-task correlations in performance observed in the general population. That is, does the same set of cognitive entities underlay the large-scale functional organiza- tion of the brain and individual differences in performance? It is important to note that factor analyses typically require many measures. In the case of the spatial factor analyses reported above, measures were taken from 2,275 spatially distinct ‘‘vox- els’’ within MD cortex. In the case of the behavioral analyses, we used scores from �110,000 participants who logged in to undertake Internet-optimized variants of the same 12 tasks. Of these, �60,000 completed all 12 tasks and a post task question- naire. After case-wise removal of extreme outliers, null values, nonsense questionnaire responses, and exclusion of partici- pants above the age of 70 and below the age of 12, exactly 44,600 data sets, each composed of 12 standardized task scores, were included in the analysis (see Experimental Procedures). The loadings of the tasks on the MDwm and MDr networks from the ICA were formed into two vectors. These were re- gressed onto each individual’s set of 12 standardized task scores with no constant term. When each individual’s MDwm and MDr beta weights (representing component scores) were varied in this manner, they centered close to zero, showed no positive correlation (MDwm mean beta = 0.05 ± 1.78; MDr mean beta = 0.11 ± 2.92; MDwm-MDr correlation r = �0.20), and, importantly, accounted for 34.3% of the total variance in performance scores. For comparison, the first two principal components of the behavioral data accounted for 36.6% of the variance. Thus, the model based on the brain imaging data captured close to the maximum amount of variance that could 1228 Neuron 76, 1225–1237, December 20, 2012 ª2012 Elsevier Inc be accounted for by the two best-fitting orthogonal linear components. The average test-retest reliability of the 12 tasks, collected in an earlier Internet cohort (Table S2), was 68%. Consequently, the imaging ICA model predicted >50% of the reliable variance in performance. The statistical significance of this fit was tested against 1,000 permutations, in which the MDwm and MDr vectors were randomly rearranged both within and across vector prior to regression. The original vectors formed a better fit than the permuted vectors in 100% of cases, demonstrating that the brain imaging model was a significant predictor of the performance data relative to models with the same fine-grained values and the same level of complexity. Two further sets of permutation tests were carried out in which one vector was held constant and the other randomly permuted 1,000 times. When the MDwm vector was permuted, the original vectors formed a better fit in 100% of cases. When the MDr vector was permuted, the original vectors formed a better fit in 99.3% of cases. Thus, both the MDwm and the MDr vectors were significant predictors of individual differences in behavioral performance. The Relationship between the Functional Organization of MD Cortex and Individual Differences in Intelligence: Similarity of Factor Solutions Exploratory factor analysis was carried out on the behavioral data using PCA. There were three significant behavioral compo- nents that each accounted for more variance than was contrib- uted by any one test (Table S3) and that together accounted for 45% of the total variance. After orthogonal rotation with the Varimax algorithm, the first two components showed a marked similarity to the loadings of the tasks on the MDwm and MDr networks (Table 2). Thus, the first component (STM) included all of the tasks in which information was held actively on line in short-term memory, whereas the second component (reasoning) included all of the tasks in which information was transformed in mind according to logical rules. Correlation analyses between the task to functional brain network loadings and the task to behavioral component loadings confirmed that the two approaches generated broadly similar solutions (STM-MDwm . Figure 2. Localizing the Functional-Ana- tomical Correlates of the Verbal Component When task-component loadings for the verbal factor from the behavioral analysis were stan- dardized and used as a predictor of activation within the whole brain, a left lateralized network was rendered, including the left inferior frontal gyrus, and temporal lobe regions bilaterally (p < 0.05 FDR corrected for the whole brain mass). Neuron Fractionating Human Intelligence r = 0.79, p < 0.001; reasoning-MDr r = 0.64, p < 0.05). The third behavioral component was readily interpretable and easily comprehensible, accounting for a substantial proportion of the variance in the three tasks that used verbal stimuli (Table 2), these being digit span, verbal reasoning, and color-word remap- ping. A relevant question regards why there was no third network in the analysis of the MD cortex activation data. One possibility was that a spatial equivalent of the verbal component did exist in MD cortex but that it accounted for less variance than was contributed by any one task in the imaging analysis. Extracting three-component PCA and ICA solutions from the imaging data did not generate an equivalent verbal component, a result that is unsurprising, as a defining characteristic of MD cortex is its insensitivity to stimulus category (Duncan and Owen, 2000). A more plausible explanation was that the third behavioral component had a neural basis in category-sensitive brain regions outside of MD cortex. In line with this view, the task- factor loadings from the third behavioral component correlated closely with those from the additional third component extracted from the PCA of all active voxels within the brain (r = 0.82, p < 0.001). In order to identify brain regions that formed a likely analog of the verbal component, the task-component loadings were standardized so that they had unit deviation and zero mean and were used to predict activation unconstrained within the whole brain mass (see Experimental Procedures). Regions including the left inferior frontal gyrus and the bilateral temporal lobes were significantly more active during the performance of tasks that weighed on the verbal component (Figure 2). This set of brain regions had little overlap with MD cortex, an obser- vation that was formalized using t tests on the mean beta weights from within each of the anatomically distinct MD cortex ROIs. This liberal approach demonstrated that none of the MD ROIs were significantly more active for tasks that loaded on the verbal component (p > 0.05, uncorrected and one tailed). Determining the Likely Neural Basis of Higher-Order Components Based on this evidence, it is reasonable to infer that the behavioral factors that underlie correlations in an individual’s Neuron 76, 1225–1237, De performance on tasks of the type typically considered akin to intelligence have a basis in the functioning of multiple brain networks. This observation allows novel insights to be derived regarding the likely basis of higher-order components. More specifically, in classical intelligence testing, first-order components gener- ated by factor analyzing the correlations between task scores are invariably correlated positively if allowed to rotate into their optimal oblique orientations. A common approach is to under- take a second-order factor analysis of the correlations between the obliquely orientated first-order components. The resultant second-order component is often denoted as ‘‘g.’’ This approach is particularly useful when tasks load heavily on multiple components, as it can simplify the task to first-order component weightings, making the factor solution more readily interpretable. A complication for this approach, however, is that the underlying source of this second-order component is ambiguous. More specifically, while correlations between first-order components from the PCA may arise because the underlying factors are themselves correlated (for example, if the capacities of the MDwm and MDr networks were influenced by some diffuse factor like conductance speed or plasticity), they will also be correlated if there is ‘‘task mixing,’’ that is, if tasks tend to weigh on multiple independent factors. In behavioral factor analysis, these accounts are effectively indis- tinguishable as the components or latent variables cannot be measured directly. Here, we have an objective measure of the extent to which the tasks are mixed, as we know, based on the functional neuroimaging data, the extent to which the tasks recruit spatially separated functional networks relative to rest. Consequently, it is possible to subdivide ‘‘g’’ into the proportion that is predicted by the mixing of tasks on multiple functional brain networks and the proportion that may be explained by other diffuse factors (Figure 3). Two simulated data sets were generated; one based on the loadings of the tasks on the MDwm and MDr functional networks (2F) and the other including task activation levels for the verbal network (3F). Each of the 44,600 simulated ‘‘individuals’’ was assigned a set of either two (2F) or three (3F) factor scores using a random Gaussian generator. Thus, the underlying factor scores represented normally distributed individual differences and were assumed to be completely independent in the simula- tions. The 12 task scores were assigned for each individual by multiplying the task-functional network loadings from the ICA of the neuroimaging data by the corresponding, randomly cember 20, 2012 ª2012 Elsevier Inc. 1229 Figure 3. Determining Whether Cross-component Correlations in the Behavioral Factor Analysis Are Accounted for by the Tasks Recruiting Multiple Independent Functional Brain Networks A cognitivetaskcanmeasureacombinationofnoise,task-specificcomponents,andcomponentsthataregeneral,contributingtotheperformanceofmultipletasks. In the current study, there were three first-order components: reasoning, short-term memory (STM), and verbal processing. In classical intelligence testing, the first- order components are invariably correlated positively when allowed to rotate into oblique orientations. A factor analysis of these correlations may be undertaken to estimate a second-order component and this is generally denoted as ‘‘g.’’ ‘‘g’’ may be generated from distinct sources: task mixing, the tendency for tasks to corecruit multiple systems, and diffuse factors that contribute to the capacities of all of those systems. When simulations were built based on the brain imaging data, the correlations between the first-order components from the behavioral study were entirely accounted for by tasks corecruiting multiple functional networks. Neuron Fractionating Human Intelligence generated, factor score and summating the resultant values. The scores were then standardized for each task and noise was added by adding the product of randomly generated Gaussian noise, the test-retest reliabilities (Table S2), and a noise level constant. A series of iterative … Whose IQ Is It?—Assessor Bias Variance in High-Stakes Psychological Assessment Paul A. McDermott University of Pennsylvania Marley W. Watkins Baylor University Anna M. Rhoad University of Pennsylvania Assessor bias variance exists for a psychological measure when some appreciable portion of the score variation that is assumed to reflect examinees’ individual differences (i.e., the relevant phenomena in most psychological assessments) instead reflects differences among the examiners who perform the assessment. Ordinary test reliability estimates and standard errors of measurement do not inherently encompass assessor bias variance. This article reports on the application of multilevel linear modeling to examine the presence and extent of assessor bias in the administration of the Wechsler Intelligence Scale for Children—Fourth Edition (WISC–IV) for a sample of 2,783 children evaluated by 448 regional school psychologists for high-stakes special education classification purposes. It was found that nearly all WISC–IV scores conveyed significant and nontrivial amounts of variation that had nothing to do with children’s actual individual differences and that the Full Scale IQ and Verbal Comprehension Index scores evidenced quite substantial assessor bias. Implications are explored. Keywords: measurement bias, assessment, assessor variance, WISC–IV The Wechsler scales are among the most popular and re- spected intelligence tests worldwide (Groth-Marnat, 2009). The many scores extracted from a given Wechsler test administra- tion have purported utility for a multitude of applications. For example, as pertains to the contemporary version for school-age children (the Wechsler Intelligence Scale for Children—Fourth Edition [WISC–IV]; Wechsler, 2003), the publisher recom- mends that resultant scores be used to (a) assess general intel- lectual functioning; (b) assess performance in each major do- main of cognitive ability; (c) discover strengths and weaknesses in each domain of cognitive ability; (d) interpret clinically meaningful score patterns associated with diagnostic groups; (e) interpret the scatter of subtests both diagnostically and prescriptively; (f) suggest classroom modifications and teacher accommodations; (g) analyze score profiles from both an inter- individual and intraindividual perspective; and (h) statistically contrast and then interpret differences between pairs of com- ponent scores and between individual scores and subsets of multiple scores (Prifitera, Saklofske, & Weiss, 2008; Wechsler, 2003; Weiss, Saklofske, Prifitera, & Holdnack, 2006). The publisher and other writers offer interpretations for the unique underlying construct meaning (as distinguished from the actual nominal labels) for every WISC–IV composite score, sub- score, and many combinations thereof (Flanagan & Kaufman, 2009; Groth-Marnat, 2009; Mascolo, 2009). Moreover, the Wechsler Full Scale IQ (FSIQ) is routinely used to differentially classify mental disability (Bergeron, Floyd, & Shands, 2008; Spruill, Oakland, & Harrison, 2005) and giftedness (McClain & Pfeiffer, 2012), to discover appreciable discrepancies between expected and observed school achievement as related to learning disabilities (Ahearn, 2009; Kozey & Siegel, 2008), and to exclude ability problems as an etiological alternative in the identification of noncognitive disorders (emotional disturbance, communication disabilities, etc.; Kamphaus, Worrell, & Harrison, 2005). As Kane (2013) has reminded test publishers and users, “the validity of a proposed interpretation or use depends on how well the evidence supports the claims being made” and “more- ambitious claims require more support than less-ambitious claims” (p. 1). At the most fundamental level, the legitimacy of every claim is entirely dependent on the accuracy of test scores in reflecting individual differences. Such accuracy is traditionally assessed through measures of content sampling error (internal consistency estimates) and temporal sampling error (test–retest stability esti- mates; Allen & Yen, 2001; Wasserman & Bracken, 2013). These estimates are commonplace in test manuals, as incorporated in a standard error of measurement index. It is sometimes assumed that such indexes fully represent the major threats to test score inter- pretation and use, but they do not (Hanna, Bradley, & Holen, 198l; This article was published Online First November 4, 2013. Paul A. McDermott, Graduate School of Education, Quantitative Meth- ods Division, University of Pennsylvania; Marley W. Watkins, Department of Educational Psychology, Baylor University; Anna M. Rhoad, Graduate School of Education, Quantitative Methods Division, University of Penn- sylvania. This research was supported in part by U.S. Department of Education’s Institute of Education Sciences Grant R05C050041-05. Correspondence concerning this article should be addressed to Paul A. McDermott, Graduate School of Education, Quantitative Methods Divi- sion, University of Pennsylvania, 3700 Walnut Street, Philadelphia, PA 19104-6216. E-mail: [email protected] T hi s do cu m en t is co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia ti on or on e of it s al li ed pu bl is he rs . T hi s ar ti cl e is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y. Psychological Assessment © 2013 American Psychological Association 2014, Vol. 26, No. 1, 207–214 1040-3590/14/$12.00 DOI: 10.1037/a0034832 207 mailto:[email protected] http://dx.doi.org/10.1037/a0034832 Oakland, Lee, & Axelrad, 1975; Thorndike & Thorndike-Christ, 2010; Viswanathan, 2005). Tests administered individually by psychologists or other specialists (in contrast to paper-and-pencil test administrations) are highly vulnerable to error sources beyond content and time sampling. For example, substantial portions of error variance in scores are rooted in the systematic and erratic errors of those who administer and score the tests (Terman, 1918). This is referred to as assessor bias (Hoyt & Kerns, 1999; Rauden- bush & Sadoff, 2008). Assessor bias is manifest where, for example, a psychologist will tend to drift from the standardized protocol for test adminis- tration (altering or ignoring stopping rules or verbal prompts, mishandling presentation of items and materials, etc.) and errone- ously scoring test responses (failure to query ambiguous answers, giving too much or too little credit for performance, erring on time limits, etc.). Sometimes these errors appear sporadically and are limited to a given testing session, whereas other errors will tend to reside more systematically with given psychologists and general- ize over a more pervasive mode of unconventional, error-bound, testing practice. Administration and scoring biases, most espe- cially pervasive types, undermine the purpose of testing. Their corrupting effects are exponentially more serious when testing purposes are high stakes, and there is abundant evidence that such biases will operate to distort major score interpretations, to change results of clinical trials, and to alter clinical diagnoses and special education classifications (Allard, Butler, Faust, & Shea, 1995; Allard & Faust, 2000; Franklin, Stillman, Burpeau, & Sabers, 1982; Mrazik, Janzen, Dombrowski, Barford, & Krawchuk, 2012; Schafer, De Santi, & Schneider, 2011). Recently, Waterman, McDermott, Fantuzzo, and Gadsden (2012) demonstrated research designs to estimate the amount of systematic assessor bias variance carried by cognitive ability scores in early childhood. Well-trained assessors applying individ- ually administered tests were randomly assigned to child examin- ees, whereafter each assessor tested numerous children. Conven- tional test-score internal consistency, stability, and generalizability were first supported (McDermott et al., 2009), and thereafter hierarchical linear modeling (HLM) was used to partition score variance into that part conveying children’s actual individual differences (the relevant target phenomena in any high-stakes psychological assessment) and that part conveying assessor bias (also known as assessor variance; Waterman et al., 2012). The technique was repeated for other high-stakes assessments in elementary school and on multiple occasions, each application revealing whether assessor variance was relatively trivial or substantial. This article reports on the application of the Waterman et al. (2012) technique to WISC–IV assessments by regional school psychologists over a period of years. The sample comprises child examinees who were actually undergoing assessment for high- stakes special education classification and related clinical pur- poses. Whereas the study was designed to investigate the presence and extent of assessor bias variance, it was not designed to pin- point the exact causes of that bias. Rather, multilevel procedures are used to narrow the scope of probable primary causes and ancillary empirical analyses, and interpretations are used to shed light on the most likely sources of WISC–IV score bias. Method Participants Two large southwestern public school districts were recruited for this study by university research personnel, as regulated by Internal Review Board (IRB) and respective school district confidentiality and procedural policies. School District 1 had an enrollment of 32,500 students and included 31 elementary, eight middle, and six high schools. Ethnic composition for the 2009 –2010 academic year was 67.2% Caucasian, 23.8% Hispanic, 4.0% African American, 3.9% Asian, and 1.1% Native American. District 2 served 26,000 students in 2009 –2010, with 16 elementary schools, three kindergarten through eighth-grade schools, six middle schools, five high schools, and one alternative school. Caucasian students comprised 83.1% of enrollments, Hispanic 10.5%, Asian 2.9%, African American 1.7%, and other ethnic minorities 1.8%. Eight trained school psychology doctoral students examined ap- proximately 7,500 student special education files and retrieved perti- nent information from all special education files spanning the years 2003–2010, during which psychologists had administered the WISC– IV. Although some special education files contained multiple periodic WISC–IV assessments, only those data pertaining to the first (or only) WISC–IV assessment for a given child were applied for this study; this was used as a measure to enhance comparability of assessment conditions and to avert sources of within-child temporal variance. Information was collected for a total of 2,783 children assessed for the first time via WISC–IV, that information having been provided by 448 psychologists over the study years, with 2,044 assessments col- lected through District 1 files and 739 District 2 files. The assessments ranged from one to 86 per psychologist (M � 6.5, SD � 13.2). Characteristics of the examining psychologists were not available through school district files, nor was such information necessary for the statistical separation of WISC–IV score variance attributable to psychologists versus children. Sample constituency for the 2,783 first-time assessments included 66.0% male children, 78.3% Caucasian, 13.0% Hispanic, 5.4% Afri- can American, and 3.3% other less represented ethnic minorities. Ages ranged from 6 to 16 years (M � 10.3 years, SD � 2.5), where English was the home language for 95.0% of children (Spanish the largest exception at 3.8%) and English was the primary language for 96.7% of children (Spanish the largest exception at 2.3%). Whereas all children were undergoing special education assess- ment for the first time using the WISC–IV, 15.7% of those children had undergone prior psychological assessments not involving the WISC–IV (periodic assessments were obligatory under state policy). All assessments were deemed as high stakes, with a primary diagnosis of learning disability rendered for 57.6% of children, emotional dis- turbance for 11.6%, attention-deficit/hyperactivity disorder for 8.0%, intellectual disability for 2.6%, 12.1% with other diagnoses, and 8.0% receiving no diagnosis. Secondary diagnoses included 10.3% of chil- dren with speech impairments and 3.7% with learning disabilities. Instrumentation The WISC–IV features 10 core and five supplemental subtests, each with an age-blocked population mean of 10 and standard deviation of 3. The core subtests are used to form four factor indexes, where the Verbal Comprehension Index (VCI) is based on T hi s do cu m en t is co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia ti on or on e of it s al li ed pu bl is he rs . T hi s ar ti cl e is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y. 208 MCDERMOTT, WATKINS, AND RHOAD the Similarities, Vocabulary, and Comprehension subtests; the Perceptual Reasoning Index is based on Block Design, Matrix Reasoning, and Picture Concepts subtests; the Working Memory Index (WMI) on the Digit Span and Letter–Number Sequencing subtests; and the Processing Speed Index (PSI) on the Coding and Symbol Search subtests. The FSIQ is also formed from the 10 core subtests. The factor indexes and FSIQ each retain an age-blocked population mean of 100 and standard deviation of 15. The supple- mental subtests were not included in this study because their infrequent application precluded requisite statistical power for multilevel analyses. Analyses The eight school psychology doctoral students examined each special education case file and collected WISC–IV scores, assess- ment date, child demographics, consequent psychological diagno- ses, and identity of the examining psychologist. Following IRB and school district requirements, the identity of participating chil- dren and psychologists was concealed before data were released to the researchers. Because test protocols were not accessible, nor had standardized observations of test sessions been conducted, it was not possible to determine whether specific scoring errors were present, nor to associate psychologists with specific error types. Rather, test score variability was analyzed via multilevel linear modeling as conducted through SAS PROC MIXED (SAS Insti- tute, 2011). As a preliminary step to identify the source(s) of appreciable score nesting, a three-level unconditional one-way random effects HLM model was tested for the FSIQ score and each respective factor index and subtest score, where Level 1 modeled score variance between children within psychologists, Level 2 modeled score variance between psychologists within school districts, and Level 3 modeled variance between school districts. This series of analyses sought to determine whether sufficient score variation existed between psychologists and whether this was related to school district affiliation. A second series of multilevel models examined the prospect that because all data had been filtered through a process involving eight different doctoral students, per- haps score variation was affected by the data collection mechanism as distinguished from the psychologists who produced the data. Here, an unconditional cross-classified model was constructed for FSIQ and each factor index and subtest score, with score variance dually nested within doctoral student data collectors and examin- ing psychologists. Setting aside alternative hypotheses regarding influence of data collectors and school districts, each IQ measure was examined through a two-level unconditional HLM model in which Level 1 represented variation between children within examining psychol- ogists and Level 2 variation between psychologists. The intraclass correlation was derived from the random coefficient for intercepts associated with each model and thereafter converted to a percent- age of score variation between psychologists and between children within psychologists. Because psychologists were not assigned randomly to assess given children (assignment will normally vary as a function of random events, but also as related to which psychologists may more often be affiliated with certain child age cohorts, schools, educational levels, etc.), it seemed reasonable to hypothesize that such nonrandom assignment would potentially result in some systematic characterization of those students assessed by given psychologists. Thus, any systematic patterns of assignments by child demographics could somehow homogenize IQ score varia- tion within psychologists. To ameliorate this potential, each two- level unconditional model was augmented by addition of covari- ates including child age, sex, ethnicity (minority vs. Caucasian), child primary language (English as a secondary language vs. English as a primary language), and their interactions. The binary covariates were transformed to reflect the percentage of children manifesting a given demographic characteristic as associated with each psychologist, and all the covariates were grand-mean recen- tered to capture (and control) differences between psychologists (Hofmann & Gavim, 1998). Covariates were added systematically to the model for each IQ score so as to minimize Akaike’s information criterion (AIC; as recommended by Burnham & An- derson, 2004), and only statistically significant effects were per- mitted to remain in final models (although nonsignificant main effects were permitted to remain in the presence of their significant interactions). Whereas final models were tested under restricted maximum-likelihood estimation, and are so reported, the overall statistical consequence of the covariate augmentation for each model was tested through likelihood ratio deviance tests contrast- ing each respective unconditional and final conditional model under full maximum-likelihood estimation (per Littell, Milliken, Stroup, Wolfinger, & Schabenberger, 2006). In essence, the con- ditional models operated to correct estimates of between- psychologists variance (obtained through the initial unconditional models) for the prospect that some of that variance was influenced by the nonrandom assignment of psychologists to children. Results A preliminary unconditional HLM model was applied for FSIQ and each respective factor index and subtest score, where children were nested within psychologists and psychologists within school districts. The coefficient for random intercepts of children nested within psychologists was statistically significant for almost all models, but the coefficient for psychologists nested within districts was nonsignificant for every model. Similarly, a preliminary mul- tilevel model for each IQ score measured cross-classified children nested within data collectors as well as psychologists. No model produced a statistically significant effect for collectors, whereas most models evinced a significant effect for psychologists. There- fore, school district and data collection effects were deemed in- consequential, and subsequent HLM models tested a random in- tercept for nesting within psychologists only. For each IQ score, two-level, unconditional and conditional HLM models were constructed, initially testing the presence of psychologist assessor variance and thereafter controlling for dif- ferences in child age, sex, ethnicity, language status, and their interactions. Table 1 reports the statistical significance of the assessor variance effect for each IQ score and the estimated percentage of variance associated exclusively with psychologists versus children’s individual differences. The last column indicates the statistical significance of the improvement of the conditional model (controlling for child demographics) over the unconditional model for each IQ measure. Where these values are nonsignificant, understanding is enhanced by interpreting percentages associated T hi s do cu m en t is co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia ti on or on e of it s al li ed pu bl is he rs . T hi s ar ti cl e is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y. 209WHOSE IQ IS IT? with the unconditional model, and where values are significant, interpretation is enhanced by percentages associated with the con- ditional model. Following this logic, percentages preferred for interpretation are boldfaced. The conditional models (which control for child demographics) make a difference for FSIQ, VCI (especially its Similarities sub- test), WMI, and PSI (especially its Coding subtest) scores. This suggests at least that the nonrandom assignment of school psy- chologists to children may result in imbalanced distributions of children by their age, sex, ethnicity, and language status. This in itself is not problematic and likely reflects the realities of requisite quasi-systematic case assignment within school districts. Thus, psychologists will be assigned partly on the basis of their famil- iarity with given schools, levels of expertise with age cohorts, travel convenience, and school district administrative divisions— all factors that would tend to militate demographic differences across case loads. The conditional models accommodate for that prospect. At the same time, it should be recognized that the control mechanisms in the conditional models are also probably overly conservative because they will inadvertently control for assessor bias arising as a function of children’s demographic characteristics (race, sex, etc.) unrelated to case assignment methods. Considering the major focus of the study (identification of that portion of IQ score variation that without mitigation has nothing to do with children’s actual individual differences), the FSIQ and all four factor index scores convey significant and nontrivial (viz. �5%) assessor bias. More troubling, bias for FSIQ (12.5%) and VCI (10.0%) is substantial (�10%). Within VCI, the Vocab- ulary subtest (14.3% bias variance) and Comprehension subtest (10.7% bias variance) are the primary culprits, each conveying substantial bias. Further problematic, under PSI, the Symbol Search subtest is laden with substantial bias variance (12.7%). On the positive side, the Matrix Reasoning subtest involves no statistically significant bias (2.8%). Additionally, the Coding sub- test, although retaining a statistically significant amount of asses- sor variance, essentially yields a trivial (�5%) amount of such variance (4.4%). (Note that the �5% criterion for deeming hier- archical cluster variance as practically inconsequential comports with the convention recommended by Snijders & Baker, 1999, and Waterman et al., 2012.) Discussion The degree of assessor bias variance conveyed by FSIQ and VCI scores effectively vitiates the usefulness of those measures for differential diagnosis and classification, particularly in the vicinity of the critical cut points ordinarily applied for decision making. That is, to the extent that decisions on mental deficiency and intellectual giftedness will depend on discovery of FSIQs � 70 or � 130, respectively, or that ability-achievement discrepancies (whether based on regression modeling or not) will depend on accurate measurement of the FSIQ, those decisions cannot be Table 1 Percentages of Score Variance Associated With Examiner Psychologists Versus Children’s Individual Differences on the Wechsler Intelligence Scale for Children—Fourth Edition IQ score N Unconditional modelsa Conditional modelsb Difference between unconditional and conditional models (p)c % variance between psychologists % variance between children % variance between psychologists % variance between children Full Scale IQ 2,722 16.2�� 83.8 12.5�� 87.5 .0049 Verbal Comprehension Index 2,783 14.0�� 86.0 10.0�� 90.0 �.0001 Similarities 2,551 10.6�� 89.4 7.4�� 92.6 .0069 Vocabulary 2,538 14.3�� 85.7 10.4�� 89.6 ns Comprehension 2,524 10.7�� 87.3 9.9�� 90.1 ns Perceptual Reasoning Index 2,783 7.1�� 92.9 5.7�� 94.3 ns Block Design 2,544 5.3�� 94.7 3.8� 96.2 ns Matrix Reasoning 2,520 2.8 97.2 2.4 97.6 ns Picture Concepts 2,540 5.4� 94.6 4.9� 95.1 ns Working Memory Index 2,782 9.8�� 90.2 8.3�� 91.7 .002 Digit Span 2,548 7.8�� 92.2 7.5�� 92.5 ns Letter–Number Sequencing 2,486 5.2� 94.8 4.2� 95.8 ns Processing Speed Index 2,778 12.6�� 87.4 7.6�� 92.4 �.0001 Coding 2,528 9.2�� 90.8 4.4� 95.6 �.0001 Symbol Search 2,521 12.7�� 87.3 9.9�� 90.1 ns a Entries for percentage of variance between psychologists equal ICC � 100 as derived in hierarchical linear modeling. Percentages of variance between children equal (1 � ICC) � 100. Boldface entries are regarded optimal for interpretation purposes (in contrast to entries under the alternative conditional model, which do not represent significant improvement). Model specification is Yij � �00 � �0j � rij, where i indexes children within psychologists and j indexes psychologists. Significance tests indicate statistical significance of the random coefficient for psychologists, where p values � .01 are considered nonsignificant. ICC � interclass correlation coefficient. b Entries for percentage of variance between psychologists equal residual ICC � 100 as derived in hierarchical linear modeling, incorporating statistically significant fixed effects for child age, sex, ethnicity, language status, and their interactions. Percentages of variance between children equal (1 �residual ICC) � 100. Boldface entries are regarded optimal for interpretation purposes (in contrast to entries under the alternative unconditional model). Model specification is Yij � �00 � �01MeanAgej � �02MeanPercentMalej � �03MeanPercentMinorityj � �04MeanPercentESLj � �05(MeanAgej)(MeanPercentMalej) � . . . � rij, where i indexes children within psychologists, j indexes psychologists, and nonsignificant terms are dropped from models. Significance tests indicate statistical significance of the residualized random coefficient for psychologists, where p values � .01 are considered nonsignificant. c Values are based on tests of the deviance between �2 log likelihood estimates for respective unconditional and conditional models under full maximum-likelihood estimation. ps � .01 are considered nonsignificant (ns). � p � .01. �� p � .001. �� p � .0001. T hi s do cu m en t is co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia ti on or on e of it s al li ed pu bl is he rs . T hi s ar ti cl e is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y. 210 MCDERMOTT, WATKINS, AND RHOAD rendered with reasonable confidence because the IQ measures reflect substantial proportions of score variation emblematic of differences among examining psychologists rather than among children. The folly of basing decisions in part or in whole on such IQ measures is accentuated where the evidence (for intellectual disability, etc.) is anything but incontrovertible because the FSIQ score is markedly above or below the cut point or the ability- achievement discrepancy is so immense as to leave virtually no doubt that real and substantial disparity exists (see also Franklin et al., 1982; Gresham, 2009; Lee, Reynolds, & Willson, 2003; Mrazik et al., 2012; Reynolds & Milam, 2012, on the matter of high-stakes decisions following IQ test administration and scoring errors). This study is limited by virtue of its dependence on a regional rather than a more representative national sample. Indeed, future research should explore the broader generalization of assessor bias effects. From one perspective, it would seem ideal if psychologists could be randomly assigned to children because that process would equitably disperse the myriad elements of variance that can neither be known nor controlled. From another perspective, random as- signment is probably infeasible because, to the extent that partic- ipant children and their families and schools are expecting psy- chological services from those practitioners who have the best relationships with given schools or school personnel or expertise with certain levels of child development, the reactivity associated with random assignment for high-stakes assessments could do harm or be perceived as doing harm. Unfortunately, test protocols were inaccessible, and there were no standardized test session observations. Thus, it was not possible to …

55161

DEBATING ABILITY TESTING

DEBATING ABILITY TESTING - Psychology

CATEGORIES