DEBATING ABILITY TESTING - Psychology
See Chapters 5 and 6 from textbooks attached and the required articles attached, and view the IQ: A history of deceit video. This is link to video.
https://fod.infobase.com/OnDemandEmbed.aspx?Token=52818&aid=18596&Plt=FOD&loid=0&w=640&h=480&ref
Present at least two viewpoints debating professional approaches to assessment used in psychology for assigned age group Adults age 61 and older. In addition to the required readings attached, research a minimum of one peer-reviewed article on ability testing research at is pertains to Adults age 61 and older.
Briefly compare and discuss at least two theories of intelligence and the most-up-to-date version of two intelligence tests related to those theories.
Analyze challenges related to testing individuals Adults age 61 and older
and describe any special ethical and sociocultural issues which must be considered.
Analyze and provide evidence from research on the validity of the tests you selected that supports or opposed using those specific intelligence tests with your assigned populations.
Present the pros and cons of individual versus group intelligence testing.
Summarize the implications of labelling and mislabeling individuals in Adults age 61 and older as a result of testing and assessment.
Required Resources
Text
Gregory, R. J. (2014). Psychological testing: History, principles, and applications (7th ed.). Boston, MA: Pearson.
Chapter 5: Theories and Individual Tests of Intelligence and Achievement
Chapter 6: Group Tests and Controversies in Ability Testing
Articles
Ekinci, B. (2014). The Relationship among Sternberg's triarchic abilities, Gardner's multiple intelligences, and academic achievement. Social Behavior & Personality, 42(4), 625-633. doi: 10.2224/sbp.2014.42.4.625
The full-text version of this article can be accessed through the EBSCOhost database in the University of Arizona Global Campus Library. The author presents a discussion of the relationships among Sternberg’s triarchic abilities (STA), Gardner’s multiple intelligences, and the academic achievement of children attending primary schools. The article serves as an example of an empirical investigation of theoretical intellectual constructs.
Fletcher, J. M., Francis, D. J., Morris, R. D., & Lyon, G. R. (2005). Evidence-based assessment of learning disabilities in children and adolescents. Journal of Clinical Child and Adolescent Psychiatry, 34(3), 506-522. Retrieved from the EBSCOhost database.
The authors of the article review the reliability and validity of four approaches to the assessment of children and adolescents with learning disabilities.
Hampshire, A., Highfield, R. R., Parkin, B. L., & Owen, A. M. (2012). Fractionating human intelligence. Neuron, 76(6). 1225–1237. doi: 10.1016/j.neuron.2012.06.022
The full-text version of this article can be accessed through the ProQuest database in the University of Arizona Global Campus Library. The authors compare factor models of individual differences in performance with factor models of brain functional organization to demonstrate that different components of intelligence have analogs in distinct brain networks.
Healthwise Staff. (2014). Mental health assessment
 (Links to an external site.)
. Retrieved from http://www.webmd.com/mental-health/mental-health-assessment
This online article presents information on the purposes of mental health assessments and what examinees and family members may expect during mental health assessment visits.
McDermott, P. A., Watkins, M. W., & Rhoad, A. M. (2014). Whose IQ is it?—Assessor bias variance in high-stakes psychological assessment. Psychological Assessment, 26(1), 207-214. doi: 10.1037/a0034832
The full-text version of this article can be accessed through the EBSCOhost database in the University of Arizona Global Campus Library. Assessor bias occurs when a significant portion of the examinee’s test score actually reflects differences among the examiners who perform the assessment. The authors examine the extent of assessor bias in the administration of the Wechsler Intelligence Scale for Children—Fourth Edition (WISC–IV) and explore the implications of this phenomenon.
Rockstuhl, T., Seiler, S., Ang, S., Van Dyne, L., & Annen, H. (2011). Beyond general intelligence (IQ) and emotional intelligence (EQ): The Role of cultural intelligence (CQ) on cross-border leadership effectiveness in a globalized world. Journal of Social Issues, 67(4). 825-840. Retrieved from the EBSCOhost database.
This article represents a contemporary, real-world application of intellectual testing. The authors discuss the implication of the research on the relationship among general intelligence (IQ), emotional intelligence (EQ), cultural intelligence (CQ) and cross-border leadership effectiveness.
Multimedia
de Rossier, L. (Producer) & Boutinard-Rouelle, P. (Director). (2011). IQ: A history of deceit
 (Links to an external site.)
[Video file]. Retrieved from https://fod.infobase.com/OnDemandEmbed.aspx?Token=52818&aid=18596&Plt=FOD&loid=0&w=640&h=480&ref
The full version of this video is available through the Films on Demand database in the University of Arizona Global Campus Library. This program reviews the history of intelligence assessment
5.1 DEFINITIONS OF 
INTELLIGENCE 
Before we discuss definitions of intelligence, we 
need to clarify the nature of definition itself. 
Sternberg (1986) makes a distinction between 
operational and “real” definitions that is 
important in this context. An operational 
definition defines a concept in terms of the way 
it is measured. Boring (1923) carried this 
viewpoint to its extreme when he defined 
intelligence as “what the tests test.” Believe it or 
not, this was a serious proposal, designed 
largely to short-circuit rampant and divisive 
disagreements about the definition of 
intelligence. 
Operational definitions of intelligence suffer 
from two dangerous shortcomings (Sternberg, 
1986). First, they are circular. Intelligence tests 
were invented to measure intelligence, not to 
define it. The test designers never intended for 
their instruments to define intelligence. Second, 
operational definitions block further progress in 
understanding the nature of intelligence, 
because they foreclose discussion on the 
adequacy of theories of intelligence. 
This second problem—the potentially stultifying 
effects of relying on operational definitions of 
intelligence—casts doubt on the common 
practice of affirming the concurrent validity of 
new tests by correlating them with old tests. If 
established tests serve as the principal criterion 
against which new tests are assessed, then the 
new tests will be viewed as valid only to the 
extent that they correlate with the old ones. 
Such a conservative practice drastically curtails 
innovation. The operational definition of 
intelligence does not allow for the possibility 
that new tests or conceptions of intelligence 
may be superior to the existing ones. 
We must conclude, then, that operational 
definitions of intelligence leave much to be 
desired. In contrast, a real definition is one that 
seeks to tell us the true nature of the thing being 
defined (Robinson, 1950; Sternberg, 1986). 
Perhaps the most common way—but by no 
means the only way—of producing real 
definitions of intelligence is to ask experts in the 
field to define it. 
Expert Definitions of Intelligence 
Intelligence has been given many real 
definitions by prominent researchers in the field. 
In the following, we list several examples, 
paraphrased slightly for editorial consistency. 
The reader will note that many of these 
definitions appeared in an early but still 
influential symposium, “Intelligence and Its 
Measurement,” published in the Journal of 
Educational Psychology (Thorndike, 1921). 
Other definitions stem from a modern update of 
this early symposium, What Is Intelligence?, 
edited by Sternberg and Detterman (1986). 
Intelligence has been defined as the following: 
•  Spearman (1904, 1923): a general ability 
that involves mainly the eduction of 
relations and correlates. 
•  Binet and Simon (1905): the ability to judge 
well, to understand well, to reason well. 
•  Terman (1916): the capacity to form 
concepts and to grasp their significance. 
•  Pintner (1921): the ability of the individual 
to adapt adequately to relatively new 
situations in life. 
•  Thorndike (1921): the power of good 
responses from the point of view of truth or 
fact. 
•  Thurstone (1921): the capacity to inhibit 
instinctive adjustments, flexibly imagine 
different responses, and realize modified 
instinctive adjustments into overt behavior. 
•  Wechsler (1939): The aggregate or global 
capacity of the individual to act 
purposefully, to think rationally, and to deal 
effectively with the environment. 
•  Humphreys (1971): the entire repertoire of 
acquired skills, knowledge, learning sets, 
and generalization tendencies considered 
intellectual in nature that are available at any 
one period of time. 
•  Piaget (1972): a generic term to indicate the 
superior forms of organization or 
equilibrium of cognitive structuring used for 
adaptation to the physical and social 
environment. 
•  Sternberg (1985a, 1986): the mental 
capacity to automatize information 
processing and to emit contextually 
appropriate behavior in response to novelty; 
intelligence also includes metacomponents, 
performance components, and knowledge-
acquisition components (discussed later). 
•  Eysenck (1986): error-free transmission of 
information through the cortex. 
•  Gardner (1986): the ability or skill to solve 
problems or to fashion products that are 
valued within one or more cultural settings. 
•  Ceci (1994): multiple innate abilities that 
serve as a range of possibilities; these 
abilities develop (or fail to develop, or 
develop and later atrophy) depending upon 
motivation and exposure to relevant 
educational experiences. 
•  Sattler (2001): intelligent behavior reflects 
the survival skills of the species, beyond 
beyond those associated with basic 
physiological processes. 
The preceding list of definitions is 
representative although definitely not 
exhaustive. For one thing, the list is exclusively 
Western and omits several cross-cultural 
conceptions of intelligence. Eastern conceptions 
of intelligence, for example, emphasize 
benevolence, humility, freedom from 
conventional standards of judgment, and doing 
what is right as essential to intelligence. Many 
African conceptions of intelligence place heavy 
emphasis on social aspects of intelligence such 
as maintaining harmonious and stable 
intergroup relations (Sternberg & Kaufman, 
1998). The reader can consult Bracken and 
Fagan (1990), Sternberg (1994), and Sternberg 
and Detterman (1986) for additional ideas. 
Certainly, this sampling of views is sufficient to 
demonstrate that there appear to be as many 
definitions of intelligence as there are experts 
willing to define it! 
In spite of this diversity of viewpoints, two 
themes recur again and again in expert 
definitions of intelligence. Broadly speaking, 
the experts tend to agree that intelligence is (1) 
the capacity to learn from experience and (2) the 
capacity to adapt to one’s environment. That 
learning and adaptation are both crucial to 
intelligence stands out with poignancy in certain 
cases of mental disability in which persons fail 
to possess one or the other capacity in sufficient 
degree (Case Exhibit 5.1). 
CASE EXHIBIT 5.1 
Learning and Adaptation as Core 
Functions of Intelligence 
Persons with mental disability often 
demonstrate the importance of experiential 
learning and environmental adaptation as key 
ingredients of intelligence. Consider the case 
history of a 61-year-old newspaper vendor with 
moderate mental retardation well known to local 
mental health specialists. He was an interesting 
if not eccentric gentleman who stored canned 
goods in his freezer and cursed at welfare 
workers who stopped by to see how he was 
doing. In spite of his need for financial support 
from a state agency, he was fiercely independent 
and managed his own household with minimal 
supervision from case workers. Thus, in some 
respects he maintained a tenuous adaptation to 
his environment. To earn much-needed extra 
income, he sold a local 25-cent newspaper from 
a streetside newsstand. He recognized that a 
quarter was proper payment and had learned to 
give three quarters in change for a dollar bill. He 
refused all other forms of payment, an 
arrangement that his customers could accept. 
But one day the price of the newspaper was 
increased to 35 cents, and the newspaper vendor 
was forced to deal with nickels and dimes as 
well as quarters and dollar bills. The amount of 
learning required by this slight shift in 
environmental demands exceeded his 
intellectual abilities, and, sadly, he was soon out 
of business. His failed efforts highlight the 
essential ingredients of intelligence: learning 
from experience and adaptation to the 
environment. 
How well do intelligence tests capture the 
experts’ view that intelligence consists of 
learning from experience and adaptation to the 
environment? The reader should keep this 
question in mind as we proceed to review major 
intelligence tests in the topics that follow. 
Certainly, there is cause for concern: Very few 
contemporary intelligence tests appear to 
require the examinee to learn something new or 
to adapt to a new situation as part and parcel of 
the examination process. At best, prominent 
modern tests provide indirect measures of the 
capacities to learn and adapt. How well they 
capture these dimensions is an empirical 
question that must be demonstrated through 
validational research. 
Layperson and Expert Conceptions of 
Intelligence 
Another approach to understanding a construct 
is to study its popular meaning. This method is 
more scientific than it may appear. Words have a 
common meaning to the extent that they help 
provide an effective portrayal of everyday 
transactions. If laypersons can agree on its 
meaning, a construct such as intelligence is in 
some sense “real” and, therefore, potentially 
useful. Thus, asking persons on the street, 
“What does intelligence mean to you?” has 
much to recommend it. 
Sternberg, Conway, Ketron, and Bernstein 
(1981) conducted a series of studies to 
investigate conceptions of intelligence held by 
American adults. In the first study, people in a 
train station, entering a supermarket, and 
studying in a college library were asked to list 
behaviors characteristic of different kinds of 
intelligence. In a second study—the only one 
discussed here—both laypersons and experts 
(mainly academic psychologists) rated the 
importance of these behaviors to their concept 
of an “ideally intelligent” person. 
The behaviors central to expert and lay 
conceptions of intelligence turned out to be very 
similar, although not identical. In order of 
importance, experts saw verbal intelligence, 
problem-solving ability, and practical 
intelligence as crucial to intelligence. 
Laypersons regarded practical problemsolving 
ability, verbal ability, and social competence to 
be the key ingredients in intelligence. Of course, 
opinions were not unanimous; these conceptions 
represent the consensus view of each group. In 
their conception of intelligence, experts place 
more emphasis on verbal ability than problem 
solving, whereas laypersons reverse these 
priorities. Nonetheless, experts and laypersons 
alike consider verbal ability and problem 
solving to be essential aspects of intelligence. 
As the reader will see, most intelligence tests 
also accent these two competencies. 
Prototypical examples would be vocabulary 
(verbal ability) and block design (problem 
solving) from the Wechsler scales, discussed 
later. We see then that everyday conceptions of 
intelligence are, in part, mirrored quite faithfully 
by the content of modern intelligence tests. 
Some disagreement between experts and 
laypersons is also evident. Experts consider 
practical intelligence (sizing up situations, 
determining how to achieve goals, awareness 
and interest in the world) an essential 
constituent of intelligence, whereas laypersons 
identify social competence (accepting others for 
what they are, admitting mistakes, punctuality, 
and interest in the world) as a third component. 
Yet, these two nominations do share one 
property in common: Contemporary tests 
generally make no attempt to measure either 
practical intelligence or social competence. 
Partly, this reflects the psychometric difficulties 
encountered in devising test items relevant to 
these content areas. However, the more 
influential reason intelligence tests do not 
measure practical intelligence or social 
competence is inertia: Test developers have 
blindly accepted historically incomplete 
conceptions of intelligence. Until recently, the 
development of intelligence testing has been a 
conservative affair, little changed since the days 
of Binet and the Army Alpha and Beta tests for 
World War I recruits. There are some signs that 
testing practices may soon evolve, however, 
with the development of innovative instruments. 
For example, Sternberg and colleagues have 
proposed innovative tests based on his model of 
intelligence. Another interesting instrument 
based on a new model of intelligence is the 
Everyday Problem Solving Inventory (Cornelius 
& Caspi, 1987). In this test, examinees must 
indicate their typical response to everyday 
problems such as failing to bring money, 
checkbook, or credit card when taking a friend 
to lunch. 
Many theorists in the field of intelligence have 
relied on factor analysis for the derivation or 
validation of their theories. In fact, it is not an 
overstatement to say that perhaps the majority 
of the theories in this area have been impacted 
by the statistical tools of factor analysis, which 
provide ways to portion intelligence into its 
subcomponents. One of the most compelling 
theories of intelligence, the Cattell-Horn-Carroll 
theory reviewed later, would not exist without 
factor analysis. Thus, before summarizing 
theories, we provide a brief review of this 
essential statistical tool. 
5.2 A PRIMER OF FACTOR 
ANALYSIS 
Broadly speaking, there are two forms of factor 
analysis: confirmatory and exploratory. In 
confirmatory factor analysis, the purpose is to 
confirm that test scores and variables fit a 
certain pattern predicted by a theory. For 
example, if the theory underlying a certain 
intelligence test prescribed that the subtests 
belong to three factors (e.g., verbal, 
performance, and attention factors), then a 
confirmatory factor analysis could be 
undertaken to evaluate the accuracy of this 
prediction. Confirmatory factor analysis is 
essential to the validation of many ability tests. 
The central purpose of exploratory factor 
analysis is to summarize the interrelationships 
among a large number of variables in a concise 
and accurate manner as an aid in 
conceptualization (Gorsuch, 1983). For 
instance, factor analysis may help a researcher 
discover that a battery of 20 tests represents 
only four underlying variables, called factors. 
The smaller set of derived factors can be used to 
represent the essential constructs that underlie 
the complete group of variables. 
Perhaps a simple analogy will clarify the nature 
of factors and their relationship to the variables 
or tests from which they are derived. Consider 
the track-and-field decathlon, a mixture of 10 
diverse events including sprints, hurdles, pole 
vault, shot put, and distance races, among 
others. In conceptualizing the capability of the 
individual decathlete, we do not think 
exclusively in terms of the participant’s skill in 
specific events. Instead, we think in terms of 
more basic attributes such as speed, strength, 
coordination, and endurance, each of which is 
reflected to a different extent in the individual 
events. For example, the pole vault requires 
speed and coordination, while hurdle events 
demand coordination and endurance. These 
inferred attributes are analogous to the 
underlying factors of factor analysis. Just as the 
results from the 10 events of a decathlon may 
boil down to a small number of underlying 
factors (e.g., speed, strength, coordination, and 
endurance), so too may the results from a 
battery of 10 or 20 ability tests reflect the 
operation of a small number of basic cognitive 
attributes (e.g., verbal skill, visualization, 
calculation, and attention, to cite a hypothetical 
list). This example illustrates the goal of factor 
analysis: to help produce a parsimonious 
description of large, complex data sets. 
We will illustrate the essential concepts of factor 
analysis by pursuing a classic example 
concerned with the number and kind of factors 
that best describe student abilities. Holzinger 
and Swineford (1939) gave 24 ability-related 
psychological tests to 145 junior high school 
students from Forest Park, Illinois. The factor 
analysis described later was based on methods 
outlined in Kinnear and Gray (1997). 
It should be intuitively obvious to the reader that 
any large battery of ability tests will reflect a 
smaller number of basic, underlying abilities 
(factors). Consider the 24 tests depicted in Table 
5.1. Surely some of these tests measure common 
underlying abilities. For example, we would 
expect Sentence Completion, Word 
Classification, and Word Meaning (variables 7, 
8, and 9) to assess a factor of general language 
ability of some kind. In like manner, other 
groups of tests seem likely to measure common 
underlying abilities—but how many abilities or 
factors? And what is the nature of these 
underlying abilities? Factor analysis is the ideal 
tool for answering these questions. We follow 
the factor analysis of the Holzinger and 
Swineford (1939) data from beginning to end. 
TABLE 5.1 The 24 Ability Tests Used by 
Holzinger and Swineford (1939) 
1.Visual Perception 
2.Cubes 
3.Paper Form Board 
4.Flags 
5.General Information 
6.Paragraph Comprehension 
7.Sentence Completion 
8.Word Classification 
9.Word Meaning 
10.Add Digits 
11.Code (Perceptual Speed) 
12.Count Groups of Dots 
13.Straight and Curved Capitals 
14.Word Recognition 
15.Number Recognition 
16.Figure Recognition 
17.Object-Number 
18.Number-Figure 
19.Figure-Word 
20.Deduction 
21.Numerical Puzzles 
22.Problem Reasoning 
23.Series Completion 
24.Arithmetic Problems
The Correlation Matrix 
The beginning point for every factor analysis is 
the correlation matrix, a complete table of 
intercor-relations among all the variables.1 The 
correlations between the 24 ability variables 
discussed here can be found in Table 5.2. The 
reader will notice that variables 7, 8, and 9 do, 
indeed, intercorrelate quite strongly 
(correlations of .62, .69, and .53), as we 
suspected earlier. This pattern of 
intercorrelations is presumptive evidence that 
these variables measure something in common; 
that is, it appears that these tests reflect a 
common underlying factor. However, this kind 
of intuitive factor analysis based on a visual 
inspection of the correlation matrix is hopelessly 
limited; there are just too many intercorrelations 
for the viewer to discern the underlying patterns 
for all the variables. Here is where factor 
analysis can be helpful. Although we cannot 
elucidate the mechanics of the procedure, factor 
analysis relies on modern high-speed computers 
to search the correlation matrix according to 
objective statistical rules and determine the 
smallest number of factors needed to account 
for the observed pattern of intercorrelations. The 
analysis also produces the factor matrix, a table 
showing the extent to which each test loads on 
(correlates with) each of the derived factors, as 
discussed in the following section. 
The Factor Matrix and Factor Loadings 
The factor matrix consists of a table of 
correlations called factor loadings. The factor 
loadings (which can take on values from −1.00 
to +1.00) indicate the weighting of each variable 
on each factor. For example, the factor matrix in 
Table 5.3 shows that five factors (labeled I, II, 
III, IV, and V) were derived from the analysis. 
Note that the first variable, Series Completion, 
has a strong positive loading of .71 on factor I, 
indicating that this test is a reasonably good 
index of factor I. Note also that Series 
Completion has a modest negative loading of 
−.11 on factor II, indicating that, to a slight 
extent, it measures the opposite of this factor; 
that is, high scores on Series Completion tend to 
signify low scores on factor II, and vice versa. 
TABLE 5.2 The Correlation Matrix for 24 
Ability Variables 
1 2 3 4 5 6 7 8 9 1011121314151617181920212223
2 32
3 4032
4 472331
5 32292523
6 3423273362
7 301622346672
8 33173839585362
9 3320183372716953
10126 8 103120252917
1131159 11343523302848
123115141622101827115943
13492432333431354028415451
141310187 282924252617351320
1524137 1323251718251524171437
16412726321929183024123112284133
17181 1819212723262729362819343532
183726212526171625213235353221333445
19271131141925232727192911262119263236
2037293034404445434517202524302739263017
Note: Decimals omitted. 
Source: Reprinted with permission from Holzinger, K., 
& Harman, H. (1941). Factor analysis: A synthesis of 
factorial methods. Chicago: University of Chicago 
Press. Copyright © 1941 The University of Chicago 
Press. 
The factors may seem quite mysterious, but in 
reality they are conceptually quite simple. A 
factor is nothing more than a weighted linear 
sum of the variables; that is, each factor is a 
precise statistical combination of the tests used 
in the analysis. In a sense, a factor is produced 
by “adding in” carefully determined portions of 
some tests and perhaps “subtracting out” 
fractions of other tests. What makes the factors 
special is the elegant analytical methods used to 
derive them. Several different methods exist. 
These methods differ in subtle ways beyond the 
scope of this text; the reader can gather a sense 
of the differences by examining names of 
213731173532263136274140364318233517363341
22412325384439403648163019282425282732344637
2347353834444341505026253538242636292730514550
242821202542434439425341413630172633413737453843
procedures: principal components factors, 
principal axis factors, method of unweighted 
least squares, maximum-likelihood method, 
image factoring, and alpha factoring 
(Tabachnick & Fidell, 1989). Most of the 
methods yield highly similar results. 
The factor loadings depicted in Table 5.3 are 
nothing more than correlation coefficients 
between variables and factors. These 
correlations can be interpreted as showing the 
weight or loading of each factor on each 
variable. For example, variable 9, the test of 
Word Meaning, has a very strong loading (.69) 
on factor I, modest negative loadings (−.45 and 
−.29) on factors II and III, and negligible 
loadings (.08 and .00) on factors IV and V. 
TABLE 5.3 The Principal Axes Factor 
Analysis for 24 Variables 
Factors
I II III IV V
23 Series Completion 0.71-0.110.140.110.07
8 Word Classification 0.70-0.24-0.15-0.11-0.13
5 General Information 0.70-0.32-0.34-0.040.08
9 Word Meaning 0.69-0.45-0.290.080.00
Geometric Representation of Factor 
Loadings 
6 Paragraph 
Comprehension
0.69-0.42-0.260.08-0.01
7 Sentence Completion 0.68-0.42-0.36-0.05-0.05
24 Arithmetic Problems 0.670.20-0.23-0.04-0.11
20 Deduction 0.64-0.190.130.060.28
22 Problem Reasoning 0.64-0.150.110.05-0.04
21 Numerical Puzzles 0.620.240.10-0.210.16
13 Straight and Curved 
Capitals
0.620.280.02-0.36-0.07
1 Visual Perception 0.62-0.010.42-0.21-0.01
11 Code (Perceptual Speed) 0.570.44-0.200.040.01
18 Number–Figure 0.550.390.200.15-0.11
16 Figure Recognition 0.530.080.400.310.19
4 Flags 0.51-0.180.32-0.23-0.02
17 Object–Number 0.490.27-0.030.47-0.24
2 Cubes 0.40-0.080.39-0.230.34
12 Count Groups of Dots 0.480.55-0.14-0.330.11
10 Add Digits 0.470.55-0.45-0.190.07
3 Paper Form Board 0.44-0.190.48-0.12-0.36
14 Word Recognition 0.450.09-0.030.550.16
15 Number Recognition 0.420.140.100.520.31
19 Figure–Word 0.470.140.130.20-0.61
It is customary to represent the first two or three 
factors as reference axes in two- or three-
dimensional space.2 Within this framework the 
factor loadings for each variable can be plotted 
for examination. In our example, five factors 
were discovered, too many for simple 
visualization. Nonetheless, we can illustrate the 
value of geometric representation by 
oversimplifying somewhat and depicting just 
the first two factors (Figure 5.1). In this graph, 
each of the 24 tests has been plotted against the 
two factors that correspond to axes I and II. The 
reader will notice that the factor loadings on the 
first factor (I) are uniformly positive, whereas 
the factor loadings on the second factor (II) 
consist of a mixture of positive and negative. 
 
FIGURE 5.1 Geometric Representation of 
the First Two Factors from 24 Ability Tests 
The Rotated Factor Matrix 
An important point in this context is that the 
position of the reference axes is arbitrary. There 
is nothing to prevent the researcher from 
rotating the axes so that they produce a more 
sensible fit with the factor loadings. For 
example, the reader will notice in Figure 5.1 that 
tests 6, 7, and 9 (all language tests) cluster 
together. It would certainly clarify the 
interpretation of factor I if it were to be 
redirected near the center of this cluster (Figure 
5.2). This manipulation would also bring factor 
II alongside interpretable tests 10, 11, and 12 
(all number tests). 
Although rotation can be conducted manually 
by visual inspection, it is more typical for 
researchers to rely on one or more objective 
statistical criteria to produce the final rotated 
factor matrix. Thurstone’s (1947) criteria of 
positive manifold and simple structure are 
commonly applied. In a rotation to positive 
manifold, the computer program seeks to 
eliminate as many of the negative factor 
loadings as possible. Negative factor loadings 
make little sense in ability testing, because they 
imply that high scores on a factor are correlated 
with poor test performance. In a rotation to 
simple structure, the computer program seeks 
to simplify the factor loadings so that each test 
has significant loadings on as few factors as 
possible. The goal of both criteria is to produce 
a rotated factor matrix that is as straightforward 
and unambiguous as possible. 
 
FIGURE 5.2 Geometric Representation of 
the First Two Rotated Factors from 24 
Ability Tests 
The rotated factor matrix for this problem is 
shown in Table 5.4. The particular method of 
rotation used here is called varimax rotation. 
Varimax should not be used if the theoretical 
expectation suggests that a general factor may 
occur. Should we expect a general factor in the 
analysis of ability tests? The answer is as much 
a matter of faith as of science. One researcher 
may conclude that a general factor is likely and, 
therefore, pursue a different type of rotation. A 
second researcher may be comfortable with a 
Thurstonian viewpoint and seek multiple ability 
factors using a varimax rotation. We will 
explore this issue in more detail later, but it is 
worth pointing out here that a researcher 
encounters many choice points in the process of 
conducting a factor analysis. It is not surprising, 
then, that different researchers may reach 
different conclusions from factor analysis, even 
when they are analyzing the same data set. 
The Interpretation of Factors 
Table 5.4 indicates that five factors underlie the 
intercorrelations of the 24 ability tests. But what 
shall we call these factors? The reader may find 
the answer to this question disquieting, because 
at this juncture we leave the realm of cold, 
objective statistics and enter the arena of 
judgment, insight, and presumption. In order to 
interpret or name a factor, the researcher must 
make a reasoned judgment about the common 
processes and abilities shared by the tests with 
strong loadings on that factor. For example, in 
Table 5.4 it appears that factor I is verbal ability, 
because the variables with high loadings stress 
verbal skill (e.g., Sentence Completion loads 
.86, Word Meaning loads .84, and Paragraph 
Comprehension loads .81). The variables with 
low loadings also help sharpen the meaning of 
factor I. For example, factor I is not related to 
numerical skill (Numerical Puzzles loads .18) or 
spatial skill (Paper Form Board loads .16). 
Using a similar form of inference, it appears that 
factor II is mainly numerical ability (Add Digits 
loads .85, Count Groups of Dots loads .80). 
Factor III is less certain but appears to be a 
visual-perceptual capacity, and factor IV 
appears to be a measure of recognition. We 
would need to analyze the single test on factor V 
(Figure-Word) to surmise the meaning of this 
factor. 
TABLE 5.4 The Rotated Varimax Factor 
Matrix for 24 Ability Variables 
Factors
I II III IV V
7 Sentence Completion 0.860.150.130.030.07
9 Word Meaning 0.840.060.150.180.08
6 Paragraph 
Comprehension
0.810.070.160.180.10
5 General Information 0.790.220.160.12-0.02
8 Word Classification 0.650.220.280.030.21
22 Problem Reasoning 0.430.120.380.230.22
10 Add Digits 0.180.85-0.100.09-0.01
12 Count Groups of Dots 0.020.800.200.030.00
11 Code (Perceptual 
Speed)
0.180.640.050.300.17
13 Straight and Curved 
Capitals
0.190.600.40-0.050.18
Note: Boldfaced entries signify subtests loading 
strongly on each factor. 
These results illustrate a major use of factor 
analysis, namely, the identification of a small 
number of marker tests from a large test battery. 
Rather than using a cumbersome battery of 24 
tests, a researcher could gain nearly the same 
information by carefully selecting several tests 
with strong loadings on the five factors. For 
example, the first factor is well represented by 
24 Arithmetic Problems 0.410.540.120.160.24
21 Numerical Puzzles 0.180.520.450.160.02
18 Number-Figure 0.000.400.280.380.36
1 Visual Perception 0.170.210.690.100.20
2 Cubes 0.090.090.650.12-0.18
4 Flags 0.260.070.60-0.010.15
3 Paper Form Board 0.16-0.090.57-0.050.49
23 Series Completion 0.420.240.520.180.11
20 Deduction 0.430.110.470.35-0.07
15 Number Recognition 0.110.090.120.74-0.02
14 Word Recognition 0.230.100.000.690.10
16 Figure Recognition 0.070.070.460.590.14
17 Object-Number 0.150.25-0.060.520.49
19 Figure-Word 0.160.160.110.140.77
test 7, Sentence Completion (.86) and test 9, 
Word Meaning (.84); the second factor is 
reflected in …
CHAPTER 6 
Group Tests and 
Controversies in Ability 
Testing 
TOPIC 6A Group Tests of Ability and 
Related Concepts 
6.1 Nature, Promise, and Pitfalls of Group Tests 
6.2 Group Tests of Ability 
6.3 Multiple Aptitude Test Batteries 
6.4 Predicting College Performance 
6.5 Postgraduate Selection Tests 
6.6 Educational Achievement Tests 
The practical success of early intelligence 
scales such as the 1905 Binet-Simon test 
motivated psychologists and educators to 
develop instruments that could be administered 
simultaneously to large numbers of examinees. 
Test developers were quick to realize that group 
tests allowed for the efficient evaluation of 
dozens or hundreds of examinees at the same 
time. As reviewed in an earlier chapter, one of 
the first uses of group tests was for screening 
and assignment of military personnel during 
World War I. The need to quickly test thousands 
of Army recruits inspired psychologists in the 
United States, led by Robert M. Yerkes, to make 
rapid advances in psychometrics and test 
development (Yerkes, 1921). Many new 
applications followed immediately—in 
education, industry, and other fields. In Topic 
6A, Group Tests of Ability and Related 
Concepts, we introduce the reader to the varied 
applications of group tests and also review a 
sampling of typical instruments. In addition, we 
explore a key question raised by the 
consequential nature of these tests—can 
examinees boost their scores significantly by 
taking targeted test preparation courses? This is 
but one of many unexpected issues raised by the 
widespread use of group tests. In Topic 6B, Test 
Bias and Other Controversies, we continue a 
reflective theme by looking into test bias and 
other contentious issues in testing. 
6.1 NATURE, PROMISE, AND 
PITFALLS OF GROUP TESTS 
Group tests serve many purposes, but the vast 
majority can be assigned to one of three types: 
ability, aptitude, or achievement tests. In the real 
world, the distinction among these kinds of tests 
often is quite fuzzy (Gregory, 1994a). These 
instruments differ mainly in their functions and 
applications, less so in actual test content. In 
brief, ability tests typically sample a broad 
assortment of proficiencies in order to estimate 
current intellectual level. This information 
might be used for screening or placement 
purposes, for example, to determine the need for 
individual testing or to establish eligibility for a 
gifted and talented program. In contrast, 
aptitude tests usually measure a few 
homogeneous segments of ability and are 
designed to predict future performance. 
Predictive validity is foundational to aptitude 
tests, and often they are used for institutional 
selection purposes. Finally, achievement tests 
assess current skill attainment in relation to the 
goals of school and training programs. They are 
designed to mirror educational objectives in 
reading, writing, math, and other subject areas. 
Although often used to identify educational 
attainment of students, they also function to 
evaluate the adequacy of school educational 
programs. 
Whatever their application, group tests differ 
from individual tests in five ways: 
• Multiple-choice versus open-ended format 
• Objective machine scoring versus examiner 
scoring 
• Group versus individualized administration 
• Applications in screening versus remedial 
planning 
• Huge versus merely large standardization 
samples 
These differences allow for great speed and cost 
efficiency in group testing, but a price is paid 
for these advantages. 
Although the early psychometric pioneers 
embraced group testing wholeheartedly, they 
recognized fully the nature of their Faustian 
bargain: Psychologists had traded the soul of the 
individual examinee in return for the benefits of 
mass testing. Whipple (1910) summed up the 
advantages of group testing but also pointed to 
the potential perils: 
Most mental tests may be administered either to 
individuals or to groups. Both methods 
have advantages and disadvantages. The 
group method has, of course, the particular 
merit of economy of time; a class of 50 or 
100 children may take a test in less than a 
fiftieth or a hundredth of the time needed to 
administer the same test individually. 
Again, in certain comparative studies, e.g., 
of the effects of a week’s vacation upon the 
mental efficiency of school children, it 
becomes imperative that all S’s should take 
the tests at the same time. On the other 
hand, there are almost sure to be some S’s 
in every group that, for one reason or 
another, fail to follow instructions or to 
execute the test to the best of their ability. 
The individual method allows E to detect 
these cases, and in general, by the exercise 
of personal supervision, to gain, as noted 
above, valuable information concerning S’s 
attitude toward the test. 
In sum, group testing poses two interrelated 
risks: (1) some examinees will score far below 
their true ability, owing to motivational 
problems or difficulty following directions and 
(2) invalid scores will not be recognized as 
such, with undesirable consequences for these 
atypical examinees. There is really no simple 
way to entirely avoid these risks, which are part 
of the trade-off for the efficiency of group 
testing. However, it is possible to minimize the 
potentially negative consequences if examiners 
scrutinize very low scores with skepticism and 
recommend individual testing for these cases. 
We turn now to an analysis of group tests in a 
variety of settings, including cognitive tests for 
schools and clinics, placement tests for career 
and military evaluation, and aptitude tests for 
college and postgraduate selection. 
6.2 GROUP TESTS OF 
ABILITY 
Multidimensional Aptitude Battery-II 
(MAB-II) 
The Multidimensional Aptitude Battery-II 
(MAB-II; Jackson, 1998) is a recent group 
intelligence test designed to be a paper-and-
pencil equivalent of the WAIS-R. As the reader 
will recall, the WAIS-R is a highly respected 
instrument (now replaced by the WAIS-III), in 
its time the most widely used of the available 
adult intelligence tests. Kaufman (1983) noted 
that the WAIS-R was “the criterion of adult 
intelligence, and no other instrument even 
comes close.” However, a highly trained 
professional needs about 1½ hours just to 
administer the Wechsler adult test to a single 
person. Because professional time is at a 
premium, a complete Wechsler intelligence 
assessment—including administration, scoring, 
and report writing—easily can cost hundreds of 
dollars. Many examiners have long suspected 
that an appropriate group test, with the attendant 
advantages of objective scoring and 
computerized narrative report, could provide an 
equally valid and much less expensive 
alternative to individual testing for most 
persons. 
The MAB-II was designed to produce subtests 
and factors parallel to the WAIS-R but 
employing a multiple-choice format capable of 
being computer scored. The apparent goal in 
designing this test was to produce an instrument 
that could be administered to dozens or 
hundreds of persons by one examiner (and 
perhaps a few proctors) with minimal training. 
In addition, the MAB-II was designed to yield 
IQ scores with psychometric properties similar 
to those found on the WAIS-R. Appropriate for 
examinees from ages 16 to 74, the MAB-II 
yields 10 subtest scores, as well as Verbal, 
Performance, and Full Scale IQs. 
Although it consists of original test items, the 
MAB-II is mainly a sophisticated subtest-by-
subtest clone of the WAIS-R. The 10 subtests 
are listed as follows: 
The reader will notice that Digit Span from the 
WAIS-R is not included on the MAB-II. The 
reason for this omission is largely practical: 
There would be no simple way to present a 
Digit-Span-like subtest in paper-and-pencil 
format. In any case, the omission is not serious. 
Digit Span has the lowest correlation with 
overall WAIS-R IQ, and it is widely recognized 
that this subtest makes a minimal contribution to 
the measurement of general intelligence. 
The only significant deviation from the WAIS-R 
is the replacement of Block Design with a 
Spatial subtest on the MAB-II. In the Spatial 
subtest, examinees must mentally perform 
spatial rotations of figures and select one of five 
possible rotations presented as their answer 
(Figure 6.1). Only mental rotations are involved 
(although “flipped-over” versions of the original 
Verbal Performance
Information Digit Symbol
Comprehension Picture Completion
Arithmetic Spatial
Similarities Picture Arrangement
Vocabulary Object Assembly
stimulus are included as distractor items). The 
advanced items are very complex and 
demanding. 
The items within each of the 10 MAB-II 
subtests are arranged in order of increasing 
difficulty, beginning with questions and 
problems that most adolescents and adults find 
quite simple and proceeding upward to items 
that are so difficult that very few persons get 
them correct. There is no penalty for guessing 
and examinees are encouraged to respond to 
every item within the time limit. Unlike the 
WAIS-R in which the verbal subtests are 
untimed power measures, every MAB-II subtest 
incorporates elements of both power and speed: 
Examinees are allowed only seven minutes to 
work on each subtest. Including instructions, the 
Verbal and Performance portions of the MAB-II 
each take about 50 minutes to administer. 
The MAB-II is a relatively minor revision of the 
MAB, and the technical features of the two 
versions are nearly identical. A great deal of 
psychometric information is available for the 
original version, which we report here. With 
regard to reliability, the results are generally 
quite impressive. For example, in one study of 
over 500 adolescents ranging in age from 16 to 
20, the internal consistency reliability of Verbal, 
Performance, and Full Scale IQs was in the high 
.90s. Test–retest data for this instrument also 
excel. In a study of 52 young psychiatric 
patients, the individual subtests showed 
reliabilities that ranged from .83 to .97 (median 
of .90) for the Verbal scale and from .87 to .94 
(median of .91) for the Performance scale 
(Jackson, 1984). These results compare quite 
favorably with the psychometric standards 
reported for the WAIS-R. 
Factor analyses of the MAB-II are broadly 
supportive of the construct validity of this 
instrument and its predecessor (Lee, Wallbrown, 
& Blaha, 1990). Most recently, Gignac (2006) 
examined the factor structure of the MAB-II 
using a series of confirmatory factor analyses 
with data on 3,121 individuals reported in 
Jackson (1998). The best fit to the data was 
provided by a nested model consisting of a first-
order general factor, a first-order Verbal 
Intelligence factor, and a first-order 
Performance Intelligence factor. The one caveat 
of this study was that Arithmetic did not load 
specifically on the Verbal Intelligence factor 
independent of its contribution to the general 
factor. 
 
FIGURE 6.1 Demonstration Items from 
Three Performance Tests of the 
Multidimensional Aptitude Battery-II (MAB) 
Source: Reprinted with permission from Jackson, D. N. 
(1984a). Manual for the Multidimensional Aptitude 
Battery. Port Huron, MI: Sigma Assessment Systems, 
Inc. (800) 265–1285. 
Other researchers have noted the strong 
congruence between factor analyses of the 
WAIS-R (with Digit Span removed) and the 
MAB. Typically, separate Verbal and 
Performance factors emerge for both tests 
(Wallbrown, Carmin, & Barnett, 1988). In a 
large sample of inmates, Ahrens, Evans, and 
Barnett (1990) observed validity-confirming 
changes in MAB scores in relation to education 
level. In general, with the possible exception 
that Arithmetic does not contribute reliably to 
the Verbal factor, there is good justification for 
the use of separate Verbal and Performance 
scales on this test. 
In general, the validity of this test rests upon its 
very strong physical and empirical resemblance 
to its parent test, the WAIS-R. Correlational data 
between MAB and WAIS-R scores are crucial in 
this regard. For 145 persons administered the 
MAB and WAIS-R in counterbalanced fashion, 
correlations between subtests ranged from .44 
(Spatial/Block Design) to .89 (Arithmetic and 
Vocabulary), with a median of .78. WAIS-R and 
MAB IQ correlations were very healthy, 
namely, .92 for Verbal IQ, .79 for Performance 
IQ, and .91 for Full Scale IQ (Jackson, 1984a). 
With only a few exceptions, correlations 
between MAB and WAIS-R scores exceed those 
between the WAIS and the WAIS-R. Carless 
(2000) reported a similar, strong overlap 
between MAB scores and WAIS-R scores in a 
study of 85 adults for the Verbal, Performance, 
and Full Scale IQ scores. However, she found 
that 4 of the 10 MAB subtests did not correlate 
with the WAIS-R subscales they were designed 
to represent, suggesting caution in using this 
instrument to obtain detailed information about 
specific abilities. 
Chappelle et al. (2010) obtained MAB-II scores 
for military personnel in an elite training 
program for AC-130 gunship operators. The 
officers who passed training (N = 59) and those 
who failed training (N = 20) scored above 
average (mean Full Scale IQs of 112.5 and 
113.6, respectively), but there were no 
significant differences between the two groups 
on any of the test indices. This is a curious 
result insofar as IQ typically demonstrates at 
least mild predictive potential for real world 
vocational outcomes. Further research on the 
MABII as a predictor of real world results 
would be desirable. 
The MAB-II shows great promise in research, 
career counseling, and personnel selection. In 
addition, this test could function as a screening 
instrument in clinical settings, as long as the 
examiner views low scores as a basis for follow-
up testing with an individual intelligence test. 
Examiners must keep in mind that the MAB-II 
is a group test and, therefore, carries with it the 
potential for misuse in individual cases. The 
MAB-II should not be used in isolation for 
diagnostic decisions or for placement into 
programs such as classes for intellectually gifted 
persons. 
A Multilevel Battery: The Cognitive 
Abilities Test (CogAT) 
One important function of psychological testing 
is to assess students’ abilities that are 
prerequisite to traditional classroom-based 
learning. In designing tests for this purpose, the 
psychometrician must contend with the obvious 
and nettlesome problem that school-aged 
children differ hugely in their intellectual 
abilities. For example, a test appropriate for a 
sixth grader will be much too easy for a tenth 
grader, yet impossibly difficult for a third 
grader. 
The answer to this dilemma is a multilevel 
battery, a series of overlapping tests. In a multi-
level battery, each group test is designed for a 
specific age or grade level, but adjacent tests 
possess some common content. Because of the 
overlapping content with adjacent age or grade 
levels, each test possesses a suitably low floor 
and high ceiling for proper assessment of 
students at both extremes of ability. Virtually 
every school system in the United States uses at 
least one nationally normed multilevel battery. 
The Cognitive Abilities Test (CogAT) is one of 
the best school-based test batteries in current 
use (Lohman & Hagen, 2001). A recent revision 
of the test is the CogAT Multilevel Edition, 
Form 6, released in 2001. Norms for 2005 also 
are available. We discuss this instrument in 
some detail. 
The CogAT evolved from the Lorge-Thorndike 
Intelligence Tests, one of the first group tests of 
intelligence intended for widespread use within 
school systems. The CogAT is primarily a 
measure of scholastic ability but also 
incorporates a nonverbal reasoning battery with 
items that bear no direct relation to formal 
school instruction. The two primary batteries, 
suitable for students in kindergarten through 
third grade, are briefly discussed at the end of 
this section. Here we review the multilevel 
edition intended for students in 3rd through 12th 
grade. 
The nine subtests of the multilevel CogAT are 
grouped into three areas: Verbal, quantitative, 
and nonverbal, each including three subtests. 
Representative items for the subtests of the 
CogAT are depicted in Figure 6.2. The tests on 
the Verbal Battery evaluate verbal skills and 
reasoning strategies (inductive and deductive) 
needed for effective reading and writing. The 
tests on the Quantitative Battery appraise 
quantitative skills important for mathematics 
and other disciplines. The Nonverbal Battery 
can be used to estimate cognitive level of 
students with limited reading skill, poor English 
proficiency, or inadequate educational exposure. 
For each CogAT subtest, items are ordered by 
difficulty level in a single test booklet. 
However, entry and exit points differ for each of 
eight overlapping levels (A through H). In this 
manner, grade-appropriate items are provided 
for all examinees. 
The subtests are strictly timed, with limits that 
vary from 8 to 12 minutes. Each of the three 
batteries can be administered in less than an 
hour. However, the manual recommends three 
successive testing days for younger children. 
For older children, two batteries should be 
administered the first day, with a single testing 
period the next. 
 
 
FIGURE 6.2 Subtests and Representative 
Items of the Cognitive Abilities Test, Form 6 
Note: These items resemble those on the CogAT 6. 
Correct answers: 1: B. yogurt (the only dairy product). 
2: D. swim (fish swim in the ocean). 3: E. bottom (the 
opposite of top). 4: A. I is greater than II (4 is greater 
than 2). 5: C. 26 (the algorithm is add 10, subtract 5, 
add 10 . . .). 6: A. −1 (the only answer that fits) 7: A 
(four-sided shape that is filled in). 8: D (same shape, 
bigger to smaller). 9: E (correct answer). 
Raw scores for each battery can be transformed 
into an age-based normalized standard score 
with mean of 100 and standard deviation of 15. 
In addition, percentile ranks and stanines for age 
groups and grade level are also available. 
Interpolation was used to determine fall, winter, 
and spring grade-level norms. 
The CogAT was co-normed (standardized 
concurrently) with two achievement tests, the 
Iowa Tests of Basic Skills and the Iowa Tests of 
Educational Development. Concurrent 
standardization with achievement measures is a 
common and desirable practice in the norming 
of multilevel intelligence tests. The particular 
virtue of joint norming is that the expected 
correspondence between intelligence and 
achievement scores is determined with great 
precision. As a consequence, examiners can 
more accurately identify underachieving 
students in need of remediation or further 
assessment for potential learning disability. 
The reliability of the CogAT is exceptionally 
good. In previous editions, the Kuder-
Richardson-20 reliability estimates for the 
multilevel batteries averaged .94 (Verbal), .92 
(Quantitative), and .93 (Nonverbal) across all 
grade levels. The six-month test–retest 
reliabilities for alternate forms ranged from .85 
to .93 (Verbal), .78 to .88 (Quantitative), and .81 
to .89 (Nonverbal). 
The manual provides a wealth of information on 
content, criterion-related, and construct validity 
of the CogAT; we summarize only the most 
pertinent points here. Correlations between the 
CogAT and achievement batteries are 
substantial. For example, the CogAT verbal 
battery correlates in the .70s to .80s with 
achievement subtests from the Iowa Tests of 
Basic Skills. 
The CogAT batteries predict school grades 
reasonably well. Correlations range from the 
.30s to the .60s, depending on grade level, sex, 
and ethnic group. There does not appear to be a 
clear trend as to which battery is best at 
predicting grade point average. Correlations 
between the CogAT and individual intelligence 
tests are also substantial, typically ranging from 
.65 to .75. These findings speak well for the 
construct validity of the CogAT insofar as the 
Stanford-Binet is widely recognized as an 
excellent measure of individual intelligence. 
Ansorge (1985) has questioned whether all three 
batteries are really necessary. He points out that 
correlations among the Verbal, Quantitative, and 
Nonverbal batteries are substantial. The median 
values across all grades are as follows: 
Since the Quantitative battery offers little 
uniqueness, from a purely psychometric point of 
view there is no justification for including it. 
Nonetheless, the test authors recommend use of 
all batteries in hopes that differences in 
performance will assist teachers in remedial 
Verbal and Quantitative 0.78
Nonverbal and Quantitative 0.78
Verbal and Nonverbal 0.72
planning. However, the test authors do not make 
a strong case for doing this. 
A study by Stone (1994) provides a notable 
justification for using the CogAT as a basis for 
student evaluation. He found that CogAT scores 
for 403 third graders provided an unbiased 
prediction of student achievement that was more 
accurate than teacher ratings. In particular, 
teacher ratings showed bias against Caucasian 
and Asian American students by underpredicting 
their achievement scores. 
Raven’s Progressive Matrices (RPM) 
First introduced in 1938, Raven’s Progressive 
Matrices (RPM) is a nonverbal test of inductive 
reasoning based on figural stimuli (Raven, 
Court, & Raven, 1986, 1992). This test has been 
very popular in basic research and is also used 
in some institutional settings for purposes of 
intellectual screening. 
RPM was originally designed as a measure of 
Spearman’s g factor (Raven, 1938). For this 
reason, Raven chose a special format for the test 
that presumably required the exercise of g. The 
reader is reminded that Spearman defined g as 
the “eduction of correlates.” The term eduction 
refers to the process of figuring out relationships 
based on the perceived fundamental similarities 
between stimuli. In particular, to correctly 
answer items on the RPM, examinees must 
identify a recurring pattern or relationship 
between figural stimuli organized in a 3 × 3 
matrix. The items are arranged in order of 
increasing difficulty, hence the reference to 
progressive matrices. 
Raven’s test is actually a series of three different 
instruments. Much of the confusion about 
validity, factorial structure, and the like stems 
from the unexamined assumption that all three 
forms should produce equivalent findings. The 
reader is encouraged to abandon this 
unwarranted hypothesis. Even though the three 
forms of the RPM resemble one another, there 
may be subtle differences in the problem-
solving strategies required by each. 
The Coloured Progressive Matrices is a 36-item 
test designed for children from 5 to 11 years of 
age. Raven incorporated colors into this version 
of the test to help hold the attention of the young 
children. The Standard Progressive Matrices is 
normed for examinees from 6 years and up, 
although most of the items are so difficult that 
the test is best suited for adults. This test 
consists of 60 items grouped into 5 sets of 12 
progressions. The Advanced Progressive 
Matrices is similar to the Standard version but 
has a higher ceiling. The Advanced version 
consists of 12 problems in Set I and 36 
problems in Set II. This form is especially 
suitable for persons of superior intellect. 
Large sample U.S. norms for the Coloured and 
Standard Progressive Matrices are reported in 
Raven and Summers (1986). Separate norms for 
Mexican American and African American 
children are included. Although there was no 
attempt to use a stratified random-sampling 
procedure, the selection of school districts was 
so widely varied that the American norms for 
children appear to be reasonably sound. Sattler 
(1988) summarizes the relevant norms for all 
versions of the RPM. Raven, Court, and Raven 
(1992) produced new norms for the Standard 
Progressive Matrices, but Gudjonsson (1995) 
has raised a concern that these data are 
compromised because the testing was not 
monitored. 
For the Coloured Progressive Matrices, split-
half reliabilities in the range of .65 to .94 are 
reported, with younger children producing lower 
values (Raven, Court, & Raven, 1986). For the 
Standard Progressive Matrices, a typical split-
half reliability is .86, although lower values are 
found with younger subjects (Raven, Court, & 
Raven, 1983). Test–retest reliabilities for all 
three forms vary considerably from one sample 
to the next (Raven, 1965; Raven et al., 1986). 
For normal adults in their late teens or older, 
reliability coefficients of .80 to .93 are typical. 
However, for preteen children, reliability 
coefficients as low as .71 are reported. Thus, for 
younger subjects, RPM may not possess 
sufficient reliability to warrant its use for 
individual decision making. 
Factor-analytic studies of the RPM provide 
little, if any, support for the original intention of 
the test to measure a unitary construct 
(Spearman’s g factor). Studies of the Coloured 
Progressive Matrices reveal three orthogonal 
factors (e.g., Carlson & Jensen, 1980). Factor I 
consists largely of very difficult items and might 
be termed closure and abstract reasoning by 
analogy. Factor II is labeled pattern completion 
through identity and closure. Factor III consists 
of the easiest items and is defined as simple 
pattern completion (Carlson & Jensen, 1980). In 
sum, the very easy and the very hard items on 
the Coloured Progressive Matrices appear to tap 
different intellectual processes. 
The Advanced Progressive Matrices breaks 
down into two factors that may have separate 
predictive validities (Dillon, Pohlmann, & 
Lohman, 1981). The first factor is composed of 
items in which the solution is obtained by 
adding or subtracting patterns (Figure 6.3a). 
Individuals performing well on these items may 
excel in rapid decision making and in situations 
where part–whole relationships must be 
perceived. The second factor is composed of 
items in which the solution is based on the 
ability to perceive the progression of a pattern 
(Figure 6.3b). Persons who perform well on 
these items may possess good mechanical 
ability as well as good skills for estimating 
projected movement and performing mental 
rotations. However, the skills represented by 
each factor are conjectural at this point and in 
need of independent confirmation. 
A huge body of published research bears on the 
validity of the RPM. The early data are well 
summarized by Burke (1958), while later 
findings are compiled in the current RPM 
manuals (Raven & Summers, 1986; Raven, 
Court, & Raven, 1983, 1986, 1992). In general, 
validity coefficients with achievement tests 
range from the .30s to the .60s. As might be 
expected, these values are somewhat lower than 
found with more traditional (verbally loaded) 
intelligence tests. Validity coefficients with 
other intelligence tests range from the .50s to 
the .80s. 
 
FIGURE 6.3 Raven’s Progressive Matrices: 
Typical Items 
Also, as might be expected, the correlations tend 
to be higher with performance than with verbal 
tests. In a massive study involving thousands of 
schoolchildren, Saccuzzo and Johnson (1995) 
concluded that the Standard Progressive 
Matrices and the WISC-R showed 
approximately equal predictive validity and no 
evidence of differential validity across eight 
different ethnic groups. In a lengthy review, 
Raven (2000) discusses stability and variation in 
the norms for the Raven’s Progressive Matrices 
across cultural, ethnic, and socioeconomic 
groups over the last 60 years. Indicative of the 
continuing interest in this venerable instrument, 
Costenbader and Ngari (2001) describe the 
standardization of the Coloured Progressive 
Matrices in Kenya. Further indicating the huge 
international popularity of the test, Khaleefa and 
Lynn (2008) provide standardization data for 6- 
to 11-year-old children in Yemen. 
Even though the RPM has not lived up to its 
original intentions of measuring Spearman’s g 
factor, the test is nonetheless a useful index of 
nonverbal, figural reasoning. The recent 
updating of norms was a much-welcomed 
development for this well-known test, in that 
many American users were leary of the outdated 
and limited British norms. Nonetheless, adult 
norms for the Standard and Advanced 
Progressive Matrices are still quite limited. 
The RPM is particularly valuable for the 
supplemental testing of children and adults with 
hearing, language, or physical disabilities. Often 
these examinees are difficult to assess with 
traditional measures that require auditory 
attention, verbal expression, or physical 
manipulation. In contrast, the RPM can be 
explained through pantomime, if necessary. 
Moreover, the only output required of the 
examinee is a pencil mark or gesture denoting 
the chosen alternative. For these reasons, the 
RPM is ideally suited for testing persons with 
limited command of the English language. In 
fact, the RPM is about as culturally reduced as 
possible: The test protocol does not contain a 
single word in any language. Mills and Tissot 
(1995) found that the Advanced Progressive 
Matrices identified a higher proportion of 
minority children as gifted than did a more 
traditional measure of academic aptitude (the 
School and College Ability Test). 
Bilker, Hansen, Brensinger, and others (2012) 
developed a …
Journal of Social Issues, Vol. 67, No. 4, 2011, pp. 825--840
Beyond General Intelligence (IQ) and Emotional
Intelligence (EQ): The Role of Cultural Intelligence
(CQ) on Cross-Border Leadership Effectiveness
in a Globalized World
Thomas Rockstuhl∗
Nanyang Technological University
Stefan Seiler
Swiss Military Academy at ETH Zurich
Soon Ang
Nanyang Technological University
Linn Van Dyne
Michigan State University
Hubert Annen
Swiss Military Academy at ETH Zurich
Emphasizing the importance of cross-border effectiveness in the contemporary
globalized world, we propose that cultural intelligence—the leadership capabil-
ity to manage effectively in culturally diverse settings—is a critical leadership
competency for those with cross-border responsibilities. We tested this hypothesis
with multisource data, including multiple intelligences, in a sample of 126 Swiss
military officers with both domestic and cross-border leadership responsibilities.
Results supported our predictions: (1) general intelligence predicted both domes-
tic and cross-border leadership effectiveness; (2) emotional intelligence was a
stronger predictor of domestic leadership effectiveness, and (3) cultural intelli-
gence was a stronger predictor of cross-border leadership effectiveness. Overall,
∗Correspondence concerning this article should be sent to Thomas Rockstuhl, Block S3, 01C-108
Nanyang Business School, Nanyang Technological University, Nanyang Avenue, Singapore 639798
[e-mail: [email protected]].
825
C© 2011 The Society for the Psychological Study of Social Issues
826 Rockstuhl et al.
results show the value of cultural intelligence as a critical leadership competency
in today’s globalized world.
Globalization is a reality in the 21st century workplace. As a consequence,
leaders must function effectively in cross-border situations as well as in domestic
contexts. Leaders working in cross-border contexts must cope effectively with
contrasting economic, political, and cultural practices. As a result, careful selec-
tion, grooming, and development of leaders who can operate effectively in our
globalized environment is a pressing need for contemporary organizations (Avolio,
Walumbwa, & Weber, 2009).
To date, research on leadership effectiveness has been dominantly domestic in
focus, and does not necessarily generalize to global leaders (Gregersen, Morrison,
& Black, 1998; House, Hanges, Javidan, Dorfman, & Gupta, 2004). Hence, there
is a critical need for research that extends our understanding of how differences in
context (domestic vs. cross-border) require different leadership capabilities (Johns,
2006). As we build our arguments, we emphasize the importance of matching
leadership capabilities to the specific context.
Global leaders, like all leaders, are responsible for performing their job re-
sponsibilities and accomplishing their individual goals. Accordingly, general ef-
fectiveness, defined as the effectiveness of observable actions that managers take
to accomplish their goals (Campbell, McCloy, Oppler, & Sager, 1993), is im-
portant for global leaders. We use the term “general” in describing this type of
effectiveness because it makes no reference to culture or cultural diversity. Thus,
it applies to all leader jobs.
Going beyond general effectiveness, it is crucial to recognize the unique
responsibilities that leaders have when their jobs are international in scope and
involve cross-border responsibilities (Spreitzer, McCall, & Mahoney, 1997). Lead-
ership in cross-border contexts requires leaders to (1) adopt a multicultural per-
spective rather than a country-specific perspective; (2) balance local and global
demands which can be contradictory; and (3) work with multiple cultures si-
multaneously rather than working with one dominant culture (Bartlett & Goshal,
1992). Thus, we define cross-border effectiveness as the effectiveness of ob-
servable actions that managers take to accomplish their goals in situations char-
acterized by cross-border cultural diversity. This aspect of global leaders’ ef-
fectiveness explicitly recognizes and emphasizes the unique challenges of het-
erogeneous national, institutional, and cultural contexts (Shin, Morgeson, &
Campion, 2007).
Effective leadership depends on the ability to solve complex technical and
social problems (Mumford, Zaccaro, Harding, Jacobs, & Fleishman, 2000). Given
important differences in domestic and cross-border contexts, it is unlikely that
leadership effectiveness is the same in domestic contexts as in cross-border con-
texts. In this article, we aim to shed light on these differences by focusing on
Cultural Intelligence and Cross-Border Leadership Effectiveness 827
ways that leadership competencies are similar and different in their relevance to
different contexts (domestic vs. cross-border).
Cultural Intelligence and Cross-Border Leadership Effectiveness
When leaders work in cross-border contexts, the social problems of leadership
are especially complex because cultural background influences prototypes and
schemas about appropriate leadership behaviors. For example, expectations about
preferred leadership styles (House et al., 2004), managerial behaviors (Shin et al.,
2007), and the nature of relationships (Yeung & Ready, 1995) are all influenced
by culture. Thus, effective cross-border leadership requires the ability to function
in culturally diverse contexts.
Although general intelligence (Judge, Colbert, & Ilies, 2004) as well as emo-
tional intelligence (Caruso, Meyer, & Salovey, 2002) have been linked to lead-
ership effectiveness in domestic contexts, neither deals explicitly with the ability
to function in cross-border contexts. To address the unique aspects of culturally
diverse settings, Earley and Ang (2003) drew on Sternberg and Detterman’s (1986)
multidimensional perspective on intelligence to develop a conceptual model of cul-
tural intelligence (CQ). Ang and colleagues (Ang & Van Dyne, 2008; Ang et al.,
2007) defined CQ as an individual’s capability to function effectively in situations
characterized by cultural diversity. They conceptualized CQ as a multidimen-
sional concept comprising metacognitive, cognitive, motivational, and behavioral
dimensions.
Metacognitive CQ is an individual’s level of conscious cultural awareness dur-
ing intercultural interactions. It involves higher level cognitive strategies—such
as developing heuristics and guidelines for social interaction in novel cultural
settings—based on deep-level information processing. Those with high metacog-
nitive CQ are consciously aware of the cultural preferences and norms of different
societies prior and during interactions. They question cultural assumptions and
adjust their mental models about intercultural experiences (Triandis, 2006).
Whereas metacognitive CQ focuses on higher order cognitive processes, cog-
nitive CQ is knowledge of norms, practices, and conventions in different cultures
acquired from education and personal experience. This includes knowledge of
cultural universals as well as knowledge of cultural differences. Those with high
cognitive CQ have sophisticated mental maps of culture, cultural environments,
and how the self is embedded in cultural contexts. These knowledge structures
provide them with a starting point for anticipating and understanding cultural
systems that shape and influence patterns of social interaction within a culture.
Motivational CQ is the capability to direct attention and energy toward learn-
ing about and operating in culturally diverse situations. Kanfer and Heggestad
(1997, p. 39) argued that motivational capacities “provide agentic control of af-
fect, cognition, and behavior that facilitate goal accomplishment.” Expectations
828 Rockstuhl et al.
and the value associated with successfully accomplishing a task (Eccles &
Wigfield, 2002) influence the direction and magnitude of energy channeled to-
ward that task. Those with high motivational CQ direct attention and energy toward
cross-cultural situations based on their intrinsic interest in cultures (Deci & Ryan,
1985) and confidence in intercultural effectiveness (Bandura, 2002).
Finally, behavioral CQ is the capability to exhibit culturally appropriate verbal
and nonverbal actions when interacting with people from other cultures. Behav-
ioral CQ also includes judicious use of speech acts—using culturally appropriate
words and phrases in communication. Those with high behavioral CQ demonstrate
flexibility in their intercultural interactions and adapt their behaviors to put others
at ease and facilitate effective interactions.
Rooted in differential biological bases (Rockstuhl, Hong, Ng, Ang, & Chiu,
2011), metacognitive, cognitive, motivational, and behavioral CQ represent qual-
itatively different facets of overall CQ—the capability to function and manage
effectively in culturally diverse settings (Ang & Van Dyne, 2008; Ang et al.,
2007). Accordingly, the four facets are distinct capabilities that together form a
higher level overall CQ construct.
Offermann and Phan (2002) offered three theoretical reasons for why leaders
with high CQ capabilities are better able to manage the culturally diverse ex-
pectations of their followers in cross-border contexts (Avolio et al., 2009). First,
awareness during intercultural interactions allows leaders to understand the impact
of their own culture and background. It gives them insights into how their own
values may bias their assumptions about behaviors in the workplace. It enhances
awareness of the expectations they hold for themselves and others in leader –
follower relationships. Second, high CQ causes leaders to pause and verify the
accuracy of their cultural assumptions, consider their knowledge of other cul-
tures, and hypothesize about possible values, biases, and expectations that may
apply to intercultural interactions. Third, leaders with high CQ combine their
rich understanding of self and others with motivation and behavioral flexibility in
ways that allow them to adapt their leadership behaviors appropriately to specific
cross-cultural situations.
In addition to managing diverse expectations as a function of cultural dif-
ferences, leaders in cross-border contexts also need to effectively manage the
exclusionary reactions that can be evoked by cross-cultural contact (Torelli, Chiu,
Tam, Au, & Keh, 2011). Social categorization theory (Tajfel, 1981; Turner, 1987)
theorizes that exclusionary reactions to culturally diverse others are initially driven
by perceptions of dissimilarity and viewing others as members of the out-group.
Research demonstrates, however, that those with high CQ are more likely to de-
velop trusting relationships with culturally diverse others and less likely to engage
in exclusionary reactions (Rockstuhl & Ng, 2008). Consistent with our earlier
emphasis on matching capabilities to the context, their results also demonstrated
that CQ did not influence trust when partners were culturally homogeneous.
Cultural Intelligence and Cross-Border Leadership Effectiveness 829
An increasing amount of research demonstrates the importance of CQ for
performance effectiveness in cross-border contexts (for reviews, see Ang, Van
Dyne, & Tan, 2011; Ng, Van Dyne, & Ang, in press). This includes expatriate per-
formance in international assignments (Chen, Kirkman, Kim, Farh, & Tangirala,
2010), successful intercultural negotiations (Imai & Gelfand, 2010), leadership
potential (Kim & Van Dyne, 2011), and leadership effectiveness in culturally
diverse work groups (Groves & Feyerherm, 2011).
To summarize, theory and research support the notion that leaders with high
CQ should be more effective at managing expectations of culturally diverse others
and minimizing exclusionary reactions that can occur in cross-border contexts.
Thus, we hypothesize that general intelligence will predict leadership effectiveness
in domestic contexts and in cross-border contexts; emotional intelligence will be
a stronger predictor of leadership effectiveness in domestic contexts; and cultural
intelligence will be a stronger predictor of leadership effectiveness in cross-border
contexts.
Method
We tested our hypotheses with field data from 126 military leaders and their
peers studying at the Swiss Military Academy at ETH Zurich. CQ has special
relevance to leadership in military settings because armed forces throughout the
world are increasingly involved in international assignments (Ang & Ng, 2007).
We obtained data from professional officers in a 3-year training program that
focused on developing domestic and cross-border leadership capabilities. Thus,
the sample allows comparison of leadership effectiveness across contexts. During
the program officers completed domestic assignments (e.g., physical education,
group projects, and general military and leadership military training) as well
and cross-border assignments (e.g., international support operations for the UN in
former Yugoslavia and international civil-military collaboration training with U.S.,
EU, and Croatian armed forces). Military contexts represent high-stakes settings
where leadership effectiveness has broad implications for countries, regions, and
in some cases, the world. Poor-quality leadership can exacerbate tensions and
heighten conflict between groups. In addition, it is essential that military leaders
overcome initial exclusionary reactions that can be triggered when interacting
with people from different cultures in high-stress situations. As a result, gaining
a better understanding of general and cross-border leadership effectiveness in this
setting should have important practical implications.
All 126 participants (95% response rate) were male Caucasians with average
previous leadership experience of 6.44 years (SD = 4.79). On average, they had
lived in 1.45 different countries (SD = .91). They had been studying and working
together on a daily basis for at least 7 months prior to the study.
830 Rockstuhl et al.
Procedure
Two peers in the program, selected based on cultural diversity, provided rat-
ings of general and cross-border leadership effectiveness, such that those with
French, Italian, or Rhaeto-Romansh background were rated by peers who had a
German background and vice versa. We designed the data collection using peers for
the assessment of leadership effectiveness for four reasons. First, all participants
had extensive previous leadership experience in the military and were knowl-
edgeable observers in these contexts. Second, military mission goals were clearly
specified, and thus peers could readily observe both domestic and cross-border
effectiveness in terms of mission completion. Third, participants worked closely
together and had numerous opportunities to observe peers’ leadership effective-
ness across general and cross-border contexts. Finally, Viswesvaran, Schmidt,
and Ones (2002) showed in their meta-analysis of convergence between peer and
supervisory ratings that leadership is one job performance dimension for which
ratings from these two sources are interchangeable.
Participants provided data on cultural intelligence, emotional intelligence, and
demographic background. In addition, we obtained archival data on general mental
ability and personality. This multisource approach is a strength of the design.
Measures
Peers assessed general leadership effectiveness and cross-border leadership
effectiveness with six items each (1 = strongly disagree; 7 = strongly agree). Ex-
isting leadership effectiveness measures (e.g., Ng, Ang, & Chan, 2008; Offermann,
Bailey, Vasilopoulos, Seal, & Sass, 2004) do not distinguish explicitly between
general and cross-border effectiveness. Thus, we reviewed the literature on general
leadership effectiveness, developed six general leadership items, and then wrote
parallel items that focused specifically on leadership effectiveness in culturally
diverse contexts.
Independent ratings by three subject matter experts (1 = not at all repre-
sentative, 2 = somewhat, 3 = highly representative) provided face validity for the
items (intraclass correlation = .83). Exploratory factor analysis (pilot sample #1:
n = 95) showed two distinct factors (74.49% explained variance), and confirma-
tory factor analysis (CFA) (pilot sample #2: n = 189) demonstrated acceptable fit:
� 2 (53df ) = 94.69, p < .05, RMSEA = .066. In the substantive sample, interrater
agreement (rWG(J) = .71 – 1.00) supported aggregation of peer ratings for general
(� = .91) and cross-border leadership effectiveness (� = .93).
We assessed CQ with the previously validated 20-item CQS (Cultural Intel-
ligence Scale: Ang et al., 2007), which is highly reliable and generalizable across
samples and cultures (Van Dyne, Ang, & Koh, 2008). Sample items include:
I check the accuracy of my cultural knowledge as I interact with people from
Cultural Intelligence and Cross-Border Leadership Effectiveness 831
different cultures; and I alter my facial expressions when a cross-cultural inter-
action requires it (� = .89). CFA analysis of a second-order model demonstrated
good fit to the data: � 2 (40df ) = 58.13, p < .05, RMSEA = .061), so we averaged
the four factors to create our measure of overall CQ. We assessed EQ with 19
items (Brackett, Rivers, Shiffman, Lerner, & Salovey, 2006) and obtained archival
data on general mental ability (the SHL Critical Reasoning Test Battery, 1996) and
Big-Five personality (Donnellan, Oswald, Baird, & Lucas, 2006). These controls
are important because prior research shows CQ is related to EQ (Moon, 2010),
general mental ability (Ang et al., 2007), and personality (Ang, Van Dyne, & Koh,
2006). We also controlled for previous leadership experience (number of years of
full-time job experience with the Swiss Military), international experience (num-
ber of countries participants had lived in), and age because prior research shows
relationships with leadership effectiveness.
Results
CFA analysis supported the discriminant validity of the 10 constructs
(� 2 (186df ) = 255.12, p < .05, RMSEA = .046) and the proposed 10-factor model
provided a better fit than plausible alternative models. Table 1 presents descriptive
statistics and correlations. Table 2 summarizes hierarchical regression and relative
weight analyses (Johnson & LeBreton, 2004).
As predicted, IQ was positively related to general leadership effectiveness
(� = .23, p < .05) and cross-border leadership effectiveness (� = .18, p < .05),
even after controlling for age, leadership experience, international experience,
Big-Five personality, EQ, and CQ. Thus, general mental ability had implications
for both aspects of leadership effectiveness.
In addition and consistent with our predictions, EQ was positively related to
general leadership effectiveness (� = .27, p < .05) but not to cross-border leader-
ship effectiveness (� = −.07, n.s.), after controlling for age, leadership experience,
international experience, Big-Five personality, IQ, and CQ. Relative weight analy-
sis demonstrated that EQ predicted 25.7% of the variance in general leadership ef-
fectiveness but only 3.5% of the variance in cross-border leadership effectiveness.
Thus, EQ has special relevance to leadership effectiveness in domestic contexts
but not to leadership effectiveness in cross-border contexts.
Finally, CQ was positively related to cross-border leadership effectiveness
(� = .24, p < .05) but not to general leadership effectiveness (� = −.11, n.s.), after
accounting for the controls. Relative weight analysis showed that CQ predicted
24.7% of the variance in cross-border leadership effectiveness and only 4.7% of
the variance in general leadership effectiveness. Thus, results demonstrate the
unique importance of CQ to cross-border leadership effectiveness.
Results also show that previous international experience predicted both
general (� = .30, p < .01) and cross-border leadership effectiveness (� = .35,
832 Rockstuhl et al.
T
ab
le
1.
M
ea
ns
,S
ta
nd
ar
d
D
ev
ia
ti
on
s,
an
d
C
or
re
la
ti
on
s
V
ar
ia
bl
e
M
S
D
1
2
3
4
5
6
7
8
9
10
11
12
1.
G
en
er
al
le
ad
er
sh
ip
ef
fe
ct
iv
en
es
sa
5.
13
0.
66
(.
91
)
2.
C
ro
ss
-b
or
de
r
le
ad
er
sh
ip
ef
fe
ct
iv
en
es
sa
4.
41
0.
70
.5
6∗
∗
(.
93
)
3.
G
en
er
al
in
te
ll
ig
en
ce
b
22
.0
6
5.
69
.2
3∗
∗
.1
4
–
4.
E
m
ot
io
na
l
in
te
ll
ig
en
ce
c
4.
82
0.
62
.2
6∗
∗
.1
5
.2
3∗
∗
(.
76
)
5.
C
ul
tu
ra
l
in
te
ll
ig
en
ce
c
5.
01
0.
71
.1
7
.3
3∗
∗
.1
5
.6
2∗
∗
(.
89
)
6.
A
gr
ee
ab
le
ne
ss
4.
38
0.
64
.0
1
.0
4
.0
0
.1
1
.0
6
(.
62
)
7.
C
on
sc
ie
nt
io
us
ne
ss
4.
77
0.
56
−.
06
.0
2
.0
2
−.
05
−.
08
.0
2
(.
77
)
8.
E
m
ot
io
na
l
st
ab
il
it
y
4.
53
0.
63
.0
1
.0
1
.1
3
.1
6
−.
06
.2
9∗
∗
.1
8∗
(.
66
)
9.
E
xt
ra
ve
rs
io
n
4.
52
0.
61
.0
7
.0
9
.1
0
.1
7
.1
5
.2
0∗
.0
6
.1
8∗
(.
77
)
10
.
O
pe
nn
es
s
to
ex
pe
ri
en
ce
4.
08
0.
65
.0
6
.1
4
−.
06
.0
9
.2
0∗
.0
2
.0
9
−.
03
.3
7∗
∗
(.
80
)
11
.
A
ge
(i
n
ye
ar
s)
29
.0
7
3.
96
−.
08
.1
1
−.
21
∗
.0
2
.0
9
.1
4
−.
13
.0
3
−.
19
∗
.1
0
–
12
.
L
ea
de
rs
hi
p
ex
pe
ri
en
ce
(i
n
ye
ar
s)
6.
44
4.
79
−.
13
−.
04
−.
28
∗∗
−.
03
.0
1
.1
0
.1
5
.0
4
−.
10
.1
2
.5
5∗
∗
–
13
.
P
ri
or
in
te
rn
at
io
na
l
ex
pe
ri
en
ce
1.
45
0.
91
.2
3∗
∗
.3
8∗
∗
−.
20
∗
.0
1
.2
5∗
∗
.0
9
−.
02
−.
21
∗
.0
0
.0
9
.1
1
.0
6
N
o
te
.
N
=
12
6.
a
O
bs
er
ve
r
re
po
rt
.
b
P
er
fo
rm
an
ce
ba
se
d.
c
S
el
f-
re
po
rt
.
∗ p
<
.0
5,
∗∗
p
<
.0
1.
Cultural Intelligence and Cross-Border Leadership Effectiveness 833
Table 2. Hierarchical Regression Results (N = 126)
General leadership Cross-border leadership
effectiveness effectiveness
Step 1 Step 2 RW Step 1 Step 2 RW
Age (in years) −.06 −.05 2.3% .17 .16 5.6%
Leadership experience (in years) −.11 −.04 4.0% −.16 −.11 2.4%
Prior international experience .25∗∗ .30∗∗ 32.9% .38∗∗∗ .35∗∗∗ 48.1%
Agreeableness −.02 −.03 0.3% −.04 −.04 0.2%
Conscientiousness −.07 −.06 1.8% .02 .02 0.1%
Emotional stability .07 .01 0.7% .07 .07 0.9%
Extraversion .03 .00 0.7% .07 .03 1.3%
Openness to experience .05 .06 1.4% .08 .06 3.6%
General intelligence .23∗ 25.5% .18∗ 9.5%
Emotional intelligence .27∗ 25.7% −.07 3.5%
Cultural intelligence −.11 4.7% .24∗ 24.7%
F 1.32 2.39∗∗ 3.24∗∗ 3.61∗∗∗
(8,117) (11,114) (8,117) (11,114)
�F 1.32 4.89∗∗ 3.24∗∗ 3.94∗∗
(8,117) (3,114) (8,117) (3,114)
R2 .08 .19 .18 .26
�R2 .08 .11 .18 .08
Adjusted R2 .02 .11 .13 .19
Note. RW = relative weights in percentage of R2 explained.∗p < .05, ∗∗p < .01, ∗∗∗p < .001.
p < .001). Surprisingly, previous leadership experience did not predict general
leadership effectiveness (� = −.04, n.s.) or cross-border leadership effectiveness
(� = −.11, n.s.) in our study. While this result is inconsistent with earlier research
that has demonstrated experience can be an important predictor of leadership suc-
cess (Fiedler, 2002), it is also consistent with recent theoretical arguments that
experience may not necessarily translate into effectiveness (Ng, Van Dyne, &
Ang, 2009).
Discussion
This study responds to a recent call for research on the unique aspects of
global leadership and the competencies that predict global leadership effective-
ness (Avolio et al., 2009). As hypothesized, results of our rigorous multisource
research design show differences in predictors of general leadership effectiveness
compared to cross-border leadership effectiveness. Cross-border leaders must
work simultaneously with systems, processes, and people from multiple cultures.
834 Rockstuhl et al.
Thus, cultural intelligence—the capability of functioning effectively in multicul-
tural contexts (Earley & Ang, 2003)—is a critical competency of effective global
leaders.
Theoretical Implications
Our findings have important theoretical implications. First, as Chiu, Gries,
Torelli, and Cheng (2011) point out, the outcomes of globalization are uncertain.
Some academics predict a multicultural global village and others expect clashes
between civilizations. As the articles in this issue attest, contextual and psycholog-
ical factors influence the extent to which intercultural contact activates exclusion-
ary or integrative reactions. For example, Morris, Mor, and Mok (2011) highlight
the adaptive value and creative benefits of developing a cosmopolitan identity.
Our findings complement this perspective by emphasizing the importance of cul-
tural intelligence for leadership effectiveness—especially in high-stakes global
encounters, such as cross-border military assignments. In addition, our study of-
fers another perspective because we emphasize the value of theory and research
on the competencies of global leaders that help them perform in global contexts,
rather than focusing on psychological reactions to globalization. Focusing on
competencies suggests exciting opportunities for future research on the dynamic
interaction between globalization and global leaders.
A second set of theoretical implications is based on the context-specific rela-
tionships demonstrated in this study. Specifically, results suggest that EQ and CQ
are complementary because EQ predicted general but not cross-border leadership
while CQ predicted cross-border but not general leadership effectiveness. This
contrasting pattern reinforces the assertion that domestic leader skillsets do not
necessarily generalize to global leader skillsets (Avolio et al., 2009; Caligiuri,
2006). Hence, EQ and CQ are related but distinct forms of social intelligence
(Moon, 2010), and each has context-specific relevance to different aspects of global
leadership effectiveness. Thus, researchers should match types of intelligences to
specifics of the situation to maximize predictive validity of effectiveness.
Practical Implications
Our findings also have practical implications for the selection and develop-
ment of global leaders. First, the significant relationship between general intelli-
gence and both forms of leader effectiveness reinforces the utility of intelligence
as a selection tool for identifying leadership potential. In addition, the incre-
mental validity of emotional and cultural intelligence as predictors of leadership
effectiveness, over and above previous experience, personality, and general intel-
ligence, confirms predictions that social intelligences also contribute to leadership
effectiveness (Riggio, 2002). Accordingly, managers should consider multiple
Cultural Intelligence and Cross-Border Leadership Effectiveness 835
forms of intelligence when assessing leadership potential, especially when work
roles include responsibility for coordinating complex social interactions.
Given the differential predictive validity of EQ and CQ relative to the two
types of leadership effectiveness in our study, applying the notion of context sim-
ilarity and matching types of intelligence with the leadership context should help
organizations enhance their understanding of what predicts global leader effec-
tiveness. This finding should also help organizations understand why leaders who
are effective in domestic contexts may not be effective in cross-border contexts.
These insights should help organizations tailor leadership development opportuni-
ties to the competency requirements of the situation. When leaders work primarily
in domestic settings, organizations should place more emphasis on developing
within-culture capabilities, such as EQ. In contrast, when leaders work exten-
sively in international or cross-border settings, organizations should emphasize
development of cross-cultural capabilities, such as CQ (Ng, Tan, & Ang, 2011).
Limitations and Future Research
Despite the strength of our multisource design and support for our predictions,
this study has limitations that should help guide future research. First, our cross-
sectional design prevents inferences about the causal direction of relationships.
Thus, we recommend longitudinal field research that assesses capabilities and
leadership effectiveness at multiple points in time.
Second, our study was conducted in a military context and all participants
were male. Thus, we recommend caution in generalizing our findings to other
settings until research can assess whether relationships can be replicated in other
contexts. To address this need, we recommend future research on different types of
intelligences and different aspects of leadership effectiveness in other vocational
settings and different cultures (Gelfand, Erez, & Aycan, 2007).
Third, …
THE RELATIONSHIPS AMONG STERNBERG’S TRIARCHIC 
ABILITIES, GARDNER’S MULTIPLE INTELLIGENCES, AND 
ACADEMIC ACHIEVEMENT
BIRSEN EKINCI
Marmara University
In this study I investigated the relationships among Sternberg’s Triarchic Abilities (STA), 
Gardner’s multiple intelligences, and the academic achievement of children attending 
primary schools in Istanbul, Turkey. Participants were 172 children (93 boys and 81 girls) 
aged between 11 and 12 years. STA Test (STAT) total scores were significantly and positively 
related to linguistic, logical-mathematical, and intrapersonal test scores. Analytical ability 
scores were significantly positively related to only logical-mathematical test scores, practical 
ability scores were only related to intrapersonal test scores, and the STAT subsections were 
significantly related to each other. After removing the effect of multiple intelligences, the 
partial correlations between mathematics, social science, and foreign language course grades 
and creative, practical, analytical, and total STAT scores, were found to be significant for 
creative scores and total STAT scores, but nonsignificant for practical scores and analytical 
STAT scores.
Keywords: Sternberg’s Triarchic Abilities Test, multiple intelligences, academic achievement, 
children, intelligence.
Since 1980 there has been increasing interest in the role of intelligence in 
learning and its impact on student achievement. Similarly to education theorists, 
many researchers on intelligence have been conducting studies to apply theories 
about intelligence, to education in general and, in particular, to the instructional 
context of the classroom (Castejón, Gilar, & Perez, 2008). The main difference 
between contemporary and older approaches to the role of intelligence is that, 
SOCIAL BEHAVIOR AND PERSONALITY, 2014, 42(4), 625-634
© Society for Personality Research
http://dx.doi.org/10.2224/sbp.2014.42.4.625
625
Birsen Ekinci, Atatürk Education Faculty, Marmara University.
This study was supported by the Marmara University, Scientific Research Projects Center, research 
number EGT-D-110913-0387.
Correspondence concerning this article should be addressed to: Birsen Ekinci, Atatürk Education 
Faculty, Department of Primary Education, Marmara University, Göztepe Campus, 34722 Kadiköy, 
Istanbul, Turkey. Email: [email protected]
CHILDREN’S ACADEMIC ACHIEVEMENT626
in earlier conceptualizations, intelligence was described as involving one factor 
of general mental ability that encompasses the common variance among all the 
contributing factors. The existence of this general intelligence factor was originally 
hypothesized by Spearman in 1927 and labeled as “g” (see Jensen, 1998). It was 
hypothesized that this g factor exists over and above the various abilities that 
make up intelligence, including verbal, spatial visualization, numerical reasoning, 
mechanical reasoning, and memory (Carroll, 1993). However, according to 
contemporary theories, intelligence must be regarded as existing in various forms 
and the levels of intelligence can be improved through education. The most 
widely accepted comparative theories of intelligences in recent literature are 
Gardner’s (1993) multiple intelligences theory and Sternberg’s (1985) triarchic 
theory of intelligence. Researchers have reported significant differences between 
student outcomes for classroom instruction conducted following the principles 
of multiple intelligences, and student outcomes under traditionally designed 
courses of instruction in science (Özdermir, Güneysu, & Tekkaya, 2006), reading 
(Al-Balhan, 2006), and mathematics (Douglas, Burton, & Reese-Durham, 2008).
Gardner (1993) developed a theory of multiple intelligences that comprises 
seven distinct areas of skills that each person possesses to different degrees. 
Linguistic intelligence (LI) is the capacity to use words effectively, either orally 
or in writing. Logical-mathematical intelligence (LMI) is the capacity to use 
numbers effectively and to reason well. Spatial intelligence (SI) is the ability to 
perceive the visual-spatial world accurately and to interpret these perceptions. 
Bodily-kinesthetic intelligence (KI) involves expertise in using one’s body to 
express ideas and feelings. Musical intelligence (MI) is the capacity to perceive, 
discriminate, and express musical forms. Interpersonal intelligence (INPI) is the 
ability to perceive, and make distinctions in, the moods, intentions, motivations, 
and feelings of other people. Intrapersonal intelligence (INTI) is self-knowledge 
and the ability to act adaptively on the basis of that knowledge. Naturalist 
intelligence (NI) is expertise in the recognition and classification of the numerous 
species – the flora and fauna – of a person’s environment (Armstrong, 2009).
Researchers have addressed the relationship between multiple intelligences 
and metrics of different abilities, and of various psychological constructs. Reid, 
Romanoff, Algozzine, and Udall (2000) showed that SI, LI, and LMI were 
related to scores in a test to measure the nonverbal abilities of pattern completion, 
reasoning by analogy, serial reasoning, and spatial visualization, among a group 
of handicapped and nonhandicapped children aged between 5 and 17 years. 
Furthermore, the effects of multiple intelligences-based teaching strategies on 
students’ academic achievement have been studied extensively (Al-Balhan, 
2006; Douglas et al., 2008; Greenhawk, 1997; Mettetal, Jordan, & Harper, 
1997; Özdermir et al., 2006). In addition, some researchers have investigated 
the relationship between multiple intelligences and academic achievement 
(McMahon, Rose, & Parks, 2004; Snyder, 1999). McMahon and colleagues 
CHILDREN’S ACADEMIC ACHIEVEMENT 627
found that, compared with other students, fourth-grade students with higher 
scores on LMI were more likely to demonstrate reading comprehension scores 
at, or above, grade level. In a similar study, Snyder reported a positive correlation 
between high school students’ grade point averages and KI. In the same study 
results showed that there was a positive correlation between the total score for 
the Metropolitan Achievement Test-Reading developed by the Psychological 
Corporation of San Antonio, Texas, USA and the categories of LMI and LI.
Sternberg developed the second well-known intelligence theory. According 
to Sternberg (1999a, 1999b), individuals show their intelligence when they 
apply the information-processing components of intelligence to cope with 
relatively novel tasks and situations. Within this approach to intelligence, 
Sternberg (1985) proposed the triarchic theory of intelligence, according to 
which there are three different, but interrelated, aspects of intellect: (a) analytic 
intelligence, (b) creative intelligence, and (c) practical intelligence. Individuals 
highly skilled in analytical intelligence are adept at analytical thinking, which 
involves applying the components of thinking to abstract, and often academic, 
problems. Individuals who have a high degree of creative intelligence are 
skilled at discovering, creating, and inventing ideas and products. People who 
have a high level of practical intelligence are good at using, implementing, 
and applying ideas and products. Sternberg (1997) developed an instrument, 
the Sternberg Triarchic Abilities Test (STAT), to evaluate triarchically based 
intelligence. In this instrument each aspect of intelligence is tested through 
three modes of presentation of problems: verbal, quantitative, and figural. A 
number of previous researchers have established the construct validity of the 
STAT (Sternberg, Castejón, Prieto, Hautamäki, & Grigorenko, 2001; Sternberg, 
Ferrari, Clinkenbeard, & Grigorenko, 1996). Although Sternberg did not intend 
the STAT to be a measure of general intelligence, as assessed by conventional 
intelligence tests, in related literature (Brody, 2003) there are contradictory 
results and opinions on this issue. Sternberg (2000a, 2000b) has claimed that the 
STAT is independent of measures of general intelligence and a more accurate 
predictor of academic achievement. However, Gottfredson (2002) pointed out 
that the data obtained to support this claim are sparse and suggested that the 
data collected by Sternberg et al. (1996) support the conclusion that the STAT is 
related to other measures of intelligence and may, in fact, be a measure of general 
intelligence. The triarchic abilities are related to different intelligence tests scores 
(e.g., Concept Mastery Test, Watson Glaser Critical Thinking Appraisal, Cattle 
Culture-Fait Test of g; Sternberg et al., 1996). However, Brody (2003) suggested 
that although these correlations are substantial, it is likely that they underestimate 
general intelligence because they were obtained from a sample of high school 
students who were predominately categorized as gifted, as determined by IQ 
scores, and these students were, therefore, likely to record a restricted range of 
scores on the tests.
CHILDREN’S ACADEMIC ACHIEVEMENT628
In the present study I hypothesized that both multiple intelligences total 
scores and STAT total scores would be predictors of academic achievement. 
Specifically, I hypothesized that the LI and LMI, and the analytical STAT, would 
be predictors of student success in the subject areas of mathematics, science, 
social science, and foreign- language learning.
Method
Participants
Participants were 174 randomly selected fifth- and sixth-grade students (81 
girls and 93 boys) attending primary school in Istanbul, Turkey. Students’ ages 
ranged from 11 to 12 years old. 
Instruments
The students completed the Turkish version of Gardner’s Multiple Intelligences 
Inventory (MII; Saban, 2002) to assess participants’ preferred intelligence within 
one of the eight categories: LI, LMI, SI, MI, KI, INPI, INTI, and NI. The possible 
score for the MII ranges from 0 to 80. The individual category in which a student 
has the highest score is considered to be the type of intelligence in which that 
student is most skilled. The overall Cronbach’s alpha reliability coefficient in this 
study was .96, denoting high reliability; .89 for LI; .83 for LMI; .89 for SI; .88 
for MI; .78 for KI; .85 for INPI; .85 for INTI; and .84 for NI. 
The second instrument that I used in this study was Sternberg’s Triarchic 
Abilities Test (STAT). The test comprises 81 items divided across three 
subsections designed to measure analytical, creative, and practical abilities. I 
translated this test into Turkish using the back-translation technique. In order 
to ensure that the back-translation retained the meaning of the original form, I 
conducted validity and reliability checks. The Turkish and the English versions 
of the test were given to 80 bilingual Turkish- and English-speaking students 
to complete within two weeks. Analyses of scores for the Turkish and English 
versions of test completed by these students yielded high correlation values (.85 
for analytical, .79 for practical, and .81 for creative subsections). The overall 
alpha reliability coefficient of this test was .89, and for the subsections it was .80 
for analytical, .77 for practical, and .78 for creative.
Procedure
The students completed the instruments during class time and in their 
classrooms. There was no time limit for completion. Each test session lasted 
approximately 60 minutes. The parents of the participating children gave 
permission for the researcher to access the students’ grade point average for 
mathematics, science, social science, and foreign language courses at the end of 
the year during which the study was conducted. Each participant received a pen 
and pencil as a thank-you gift for his/her participation in this study.
CHILDREN’S ACADEMIC ACHIEVEMENT 629
Data Analysis
The data were analyzed using SPSS version 15 to conduct correlation analysis 
and multiple regression analysis.
Results
As shown in Table 1, the children’s STAT total scores (M = 35.34, SD = 9.09) 
were significantly and positively related to LI (M = 28.98; SD = 7.59), LMI (M = 
30.12, SD = 6.87), and INTI (M = 29.10, SD = 7.15) scores (p < .01). Analytical 
subsection STAT scores (M = 13.76, SD = 3.96) were significantly related to LM 
intelligence scores (p < .01). STAT practical subsection scores (M = 10.37, SD = 
3.06) were significantly correlated only with INTI scores (p < .01).
Table 1. Relationships Among STAT Total Scores, Analytical, Practical, and Creative Ability 
Scores, and Multiple Intelligences Scores
 LI LMI SI MI KI INPI INTI NI
Analytical .303 .413** -.057 .093 .036 .021 .281 -.102
Practical .274 .268 .003 .113 .041 .095 .434** -.109
Creative .291 .540** -.062 .103 .004 -.049 .361* -.098
Total .351* .506** -.051 .123 .031 .019 .425** -.124
Note. ** p < .01, * p < .05. LI = linguistic intelligence, LMI = logical-mathematical intelligence, 
SI = spatial intelligence, MI = musical intelligence, KI = bodily-kinesthetic intelligence, INPI = 
interpersonal intelligence, INTI = intrapersonal intelligence, NI = naturalist intelligence.
 Mathematics course grades (M = 3.78; SD = 1.20) were significantly related 
to the STAT total (p < .001) and to the STAT analytical (p < .001), practical 
(p < .01), and creative (p < .01) subsections. Similarly, social science (M = 3.78, 
SD = 1.10) and science course grades (M = 3.51, SD = 1.40) were significantly 
related to the STAT total (p < .01) and to the STAT analytical (p < .01) and creative 
(p < .01) subsections. However, foreign language course grades (M = 3.57, SD 
= 1.16) were significantly related to all of the subsection scores of the STAT 
(p < .001; see Table 2).
Table 2. Relationships Among STAT Total Scores, Analytical, Practical, and Creative Sub-
section Scores, and Academic Success
 Mathematics Science Social science Foreign language
Analytical .536* .395** .304** .454*
Practical .461** .264 .269 .451*
Creative .491* .378** .307** .442*
Total .588* .415** .347** .527*
Note. * p < .001, ** p < .01.
CHILDREN’S ACADEMIC ACHIEVEMENT630
Mathematics grades of the participants were significantly related to LI (p < 
.01), LMI (p < .01), INPI (p < .05), and INTI (p < .01) scores. Similarly, students’ 
course grades for science were significantly related to LI (p < .05), LMI (p < 
.01), and INTI (p < .05) scores; students’ social science course grades were 
significantly related to LI (p < .05), LMI (p < .01), and INTI (p < .05) scores; 
and students’ course grades for foreign languages were significantly related to LI 
(p < .01), LMI (p < .01) and INTI (p < .01) scores (see Table 3).
Table 3. Relationships Between Multiple Intelligences Scores and Academic Success 
 LI LMI SI MI KI INPI INTI NI
Mathematics .458** .695** .080 .174 .285 .356* .522** .140
Science .340* .575** .007 .070 .239 .312 .379* .085
Social science .359* .598** .125 .118 .217 .319 .356* .139
Foreign language  .484** .718** .211 .201 .260 .316 .495** .227
Note. ** p < .01, * p < .05. LI = linguistic intelligence, LMI = logical-mathematical intelligence, 
SI = spatial intelligence, MI = musical intelligence, KI = bodily-kinesthetic intelligence, INPI = 
interpersonal intelligence, INTI = intrapersonal intelligence, NI = naturalist intelligence.
Multiple regression analyses were conducted in which the variance caused by 
the MII was removed, and partial correlations were computed between course 
grades and children’s STAT total and subsection scores. Separate analyses were 
conducted for each subject area using first the STAT subsections and then using 
just the STAT total scores. Analyses regarding mathematics course grades yielded 
significant partial correlations for the creative subsection score (Pr = .44, p < 
.01) and for the total STAT score (Pr = .62, p < .01), but the partial correlations 
were not significant for the analytical (Pr = .14) and practical (Pr = 05) STAT 
scores. Similarly, the regression analyses predicting students’ science course 
grades yielded significant partial correlations for STAT total scores (Pr = .53, 
p < .01) and for the creative subsection score (Pr = .42, p < .01), but not for 
the analytical (Pr = .14) or practical (Pr = .06) STAT scores. Additionally, when 
I performed the same analyses of social science course grades these yielded 
significant partial correlations with STAT total scores (Pr = .54, p < .01) and 
creative subsection scores (Pr = .34, p < .05) but not with analytical (Pr = 19) or 
creative (Pr = .04) STAT scores. Finally, analyses yielded the same pattern for 
foreign language course grades and STAT total and subsection scores. Regression 
analyses yielded significant partial correlations for practical subsection scores 
(Pr = .41, p < .02) and for total STAT scores (Pr = .61. p < 01). Thus, the total 
STAT scores and creative subsection scores significantly predicted academic 
achievement in mathematics, science, social science, and foreign language 
courses, independent of multiple intelligences scores; however, the analytical 
and practical subsection scores did not. Correspondingly, the partial correlations 
CHILDREN’S ACADEMIC ACHIEVEMENT 631
between course grade (for mathematics, social science, science, and foreign 
language) and the MII subsection scores, with the variation caused by the STAT 
removed, were significant only for LMI (Pr = .70, p < .01) scores. This finding 
indicates that, independent of the STAT, only LMI scores predicted achievement 
in any subject area.
Discussion
The results in this study showed that STAT total scores were significantly 
related to LI, LMI, and INTI scores. Analytical subsection STAT scores were 
significantly related to LMI scores. Practical STAT subsection scores were 
significantly correlated only with INTI scores. These results are based on the 
partial correlations between multiple intelligences and STAT scores. However, I 
limited the scope of this study to the students’ own preferences in regard to their 
multiple intelligences. In future studies students’ intelligence types should be 
assessed together with the performances of students on related intelligences for 
different age groups and different subject areas. In the present study mathematics 
course grades were significantly related to STAT total scores and to scores for the 
STAT analytical, practical, and creative abilities subsections. Similarly, science, 
social science, and foreign language course grades were significantly related to 
the LI, LM, and INTI scores of the participants.
Results of multiple regression analyses indicated that total STAT scores 
and creative ability scores significantly predicted academic achievement in 
mathematics, social science, science, and foreign language learning, independent 
of multiple intelligences scores; however, the analytical and practical ability 
scores did not. These results are consistent with those reported by Sternberg et 
al. (2001), who found that total STAT and creative ability scores significantly 
predicted academic achievement. However, contrary to the findings reported 
by Sternberg et al., in my study the analytical and practical ability scores did 
not relate significantly to academic achievement. On the other hand, Koke and 
Vernon (2003) reported that total STAT scores and only practical ability scores 
predicted psychology course midterm grades of university students. All these 
results might indicate that there may be cultural differences within the dominant 
cognitive abilities represented in the national education systems of various 
countries.
My results in this study also revealed that the partial correlation between 
course grades for all of the subject areas and each of the MII subsection scores, 
with the variation caused by the STAT removed, was significant for only the LMI 
score. This indicates that, independent of the STAT, only LMI scores predicted 
achievement in any subject area. It should also be noted that in this study the 
students’ multiple intelligences scores were based on their own preferences for 
CHILDREN’S ACADEMIC ACHIEVEMENT632
the items representing various kinds of intelligences. In other words, the multiple 
intelligences scores did not indicate the actual performance of the children in 
each type of intelligence. I believe that it would be of value for future researchers 
to test how well the STAT would predict academic achievement for scores on a 
test in which students’ multiple intelligences scores were each taken into account 
separately. The relationship between other tests and STAT scores could also be 
examined with more heterogeneous sample groups.
References
Al-Balhan, E. M. (2006). Multiple intelligence styles in relation to improved academic performance 
in Kuwaiti middle school reading. Digest of Middle East Studies, 15, 18-34. http://doi.org/
cd8zdh
Armstrong, T. (2009). Multiple intelligences in the classroom. Alexandria, VA: ASCD. 
Brody, N. (2003). Construct validation of the Sternberg Triarchic Abilities Test: Comment and 
reanalysis. Intelligence, 31, 319-329. http://doi.org/ffgmzb
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. New York: 
Cambridge University Press. 
Castejón, J. L., Gilar, R., & Perez, N. (2008). From “g factor” to multiple intelligences: Theoretical 
foundations and implications for classroom practice. In E. P. Velliotis (Ed.), Classroom culture 
and dynamics (pp. 101-127). New York: Nova Science. 
Douglas, O., Burton, K. S., & Reese-Durham, N. R. (2008). The effects of the multiple intelligence 
teaching strategy on the academic achievement of eighth grade math students. Journal of 
Instructional Psychology, 35, 182-187.
Gardner, H. (1993). Frames of mind: The theory of multiple intelligences. New York: Basic.
Gottfredson, L. S. (2002). g: Highly general and highly practical. In R. J. Sternberg & E. L. 
Grigorenko (Eds.), The general intelligence factor: How general is it? (pp. 331-380). Mahwah, 
NJ: Erlbaum.
Greenhawk, J. (1997). Multiple intelligences meet standards. Educational Leadership, 55, 62-64.
Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger/Greenwood. 
Koke, L. C., & Vernon, P. A. (2003). The Sternberg Triarchic Abilities Test (STAT) as a measure 
of academic achievement and general intelligence. Personality and Individual Differences, 35, 
1803-1807. http://doi.org/fmfpqb
McMahon, S. D., Rose, D., & Parks, M. (2004). Multiple intelligences and reading achievement: 
An examination of the Teele Inventory of Multiple Intelligences. The Journal of Experimental 
Education, 73, 41-52. http://doi.org/bwptfs
Mettetal, G., Jordan, C., & Harper, S. (1997). Attitude toward a multiple intelligences curriculum. 
Journal of Educational Research, 91, 115-122. http://doi.org/dmsgds 
Özdermir, P., Güneysu, S., & Tekkaya, C. (2006). Enhancing learning through multiple intelligences. 
Journal of Biological Education, 40, 74-78. http://doi.org/fn2x6h
Reid, C., Romanoff, B., Algozzine, B., & Udall, A. (2000). An evaluation of alternative screening 
procedures. Journal for the Education of the Gifted, 23, 378-396.
Saban, A. (2002). Öğrenme ve öğretme [Learning and teaching: New theories and approaches]. 
Ankara: Nobel. 
Sternberg, R. J. (1985). Implicit theories of intelligence, creativity, and wisdom. Journal of 
Personality and Social Psychology, 49, 607-627. http://doi.org/cstvmp
Sternberg, R. J. (1993). The Sternberg Triarchic Abilities Test. Unpublished manuscript.
http://doi.org/cd8zdh
CHILDREN’S ACADEMIC ACHIEVEMENT 633
Sternberg, R. J. (1997). The concept of intelligence and its role in lifelong learning and success. 
American Psychologist, 52, 1030-1037. http://doi.org/dzxj2p
Sternberg, R. J. (1999a). Intelligence as developing expertise. Contemporary Educational Psychology, 
24, 359-375. http://doi.org/dzvjsj
Sternberg, R. J. (1999b). The theory of successful intelligence. Review of General Psychology, 3, 
292-316. http://doi.org/cqrkxh
Sternberg, R. J. (2000). The concept of intelligence. In R. J. Sternberg (Ed.), Handbook of intelligence 
(pp. 3-13). New York: Cambridge University Press.
Sternberg, R. J. (2000). Practical intelligence in everyday life. New York: Cambridge University 
Press. 
Sternberg, R. J., Castejón, J. L., Prieto, M. D., Hautamäki, J., & Grigorenko, E. L. (2001). 
Confirmatory factor analysis of the Sternberg Triarchic Abilities Test in three international 
samples: An empirical test of the triarchic theory of intelligence. European Journal of 
Psychological Assessment, 17, 1-16. http://doi.org/cn7tjp
Sternberg, R. J., Ferrari, M., Clinkenbeard, P. R., & Grigorenko, E. L. (1996). Identification, 
instruction, and assessment of gifted children: A construct validation of a triarchic model. Gifted 
Child Quarterly, 40, 129-137. http://doi.org/d3rf9w 
Snyder, R. F. (1999). The relationship between learning styles/multiple intelligences and academic 
achievement of high school students. High School Journal, 83, 11-20. 
Copyright of Social Behavior & Personality: an international journal is the property of
Society for Personality Research and its content may not be copied or emailed to multiple
sites or posted to a listserv without the copyright holder's express written permission.
However, users may print, download, or email articles for individual use.
Journal of Clinical Child and Adolescent Psychology
2005, Vol. 34, No. 3, 506-522
Copyright © 2005 by
Lawrence Erlbaum Associates, Inc.
Evidence-Based Assessment of Learning Disabilities
in Children and Adolescents
Jack M. Fletcher
Department of Pediatrics and the Center for Academic and Reading Skills,
University of Texas Health Science Center at Houston
David J. Francis
Department of Psychology and the Texas Institute for Measurement, Evaluation and Statistics,
University of Houston
Robin D. Morris
Department of Psychology, Georgia State University
G. Reid Lyon
Child Development and Behavior Branch, National Institute of Child Health and Human Development
The reliability and validity of 4 approaches to the assessment of children and adoles-
cents with learning disabilities (LD) are reviewed, including models based on (a) ap-
titude-achievement discrepancies, (b) low achievement, (c) intra-individual differ-
ences, and (d) response to intervention (RTI). We identify serious psychometric
problems that affect the reliability of models based on aptitude-achievement discrep-
ancies and low achievement. There are also significant validity problems for models
based on aptitude-achievement discrepancies and intra-individual differences. Mod-
els that incorporate RTI have considerable potential for addressing both the reliabil-
ity and validity issues but cannot represent the sole criterion for LD identification. We
suggest that models incorporating both low achievement and RTI concepts have the
strongest evidence base and the most direct relation to treatment. The assessment of
children for LD must reflect a stronger underlying classification that takes into ac-
count relations with other childhood disorders as well as the reliability and validity of
the underlying classification and resultant assessment and identification system. The
implications of this type of model for clinical assessments of children for whom LD is
a concern are discussed.
Assessment metbods for identifying children and
adolescents with learning disabilities (LD) are mul-
tiple, varied, and the subject of heated debates among
practitioners. Those debates involve issues that extend
beyond the value of specific tests, often reflecting dif-
ferent views of how LD is best identified. These views
reflect variations in the definition of LD and, therefore,
variations in what measures are selected to opera-
tionalize the definition (Fletcher, Foorman, et al.,
2002). Any focus on the "best tests" leads to a hopeless
Grants from the National Institute of Child Health and Human De-
velopment, P50 21888, Center for Learning and Attention Disorders,
and National Science Foundation 9979968, Early Reading Develop-
ment; A Cognitive Neuroscience Approach supported this article.
We gratefully acknowledge contributions of Rita Taylor to prep-
aration of this article.
Requests for reprints should be sent to Jack M. Fletcher, Depart-
ment of Pediatrics, University of Texas Health Science Center at
Houston, 7000 Fannin Street, UCT 2478, Houston, TX 77030.
E-mail: [email protected]
morass of confusion in an area such as LD that has not
successfully addressed the classification and definition
issues that lead to identification of who does and who
does not possesses characteristics of LD. Definitions
always reflect an implicit classification indicating how
different constructs are measured and used to identify
members of the class in terms of similarities and differ-
ences relative to other entities that are not considered
members of the class (Morris & Fletcher, 1988). For
LD, children who are members of this class are his-
torically differentiated from children who have other
achievement-related difficulties, such as mental retar-
dation, sensory disorders, emotional or behavioral dis-
turbances, and environmental causes of underachieve-
ment, including economic disadvantage, minority
language status, and inadequate instruction (Fletcher,
Francis, Rourke, Shaywitz, & Shaywitz, 1993; Lyon,
Fletcher, & Barnes, 2003). If the classification is valid,
children with LD may share characteristics that are
similar with other groups of underachievers, but they
506
ASSESSMENT OF LD
should also differ in ways that can be measured and
that can serve to define and operationalize the class of
children and adolescents with LD.
In this article, we consider evidence-based ap-
proaches to the assessment of LD in the context of differ-
ent approaches to the classification and identification of
LD. We argue that the measurement systems that are
used to identify children and adolescents with LD are in-
separable from the classifications from which the identi-
fication criteria evolve. Moreover, all measurement sys-
tems are imperfect attempts to measure a construct (LD)
that operates as a latent variable that is unknowable in-
dependently of how it is measured and therefore of how
LD is classified. The construct of LD is imperfectly
measured simply because the measurement tools them-
selves are not error free (Francis et al., 2005). Different
approaches to classification and definition capitalize on
this error of measurement in ways that reduce or in-
crease the reliability of the classification itself Simi-
larly, evaluating similarities and differences among
groups of students who are identified as LD and not LD
is a test of the validity of the underlying classification, so
long as the variables used to assess this form of validity
are not the same as those used for identification (Morris
& Fletcher, 1988). As with any form of validity, ade-
quate reliability is essential. Classifications can be reli-
able and still lack validity. The converse is not true; they
cannot be valid and lack reliability. A valid classifica-
tion of LD predicts important characteristics of the
group. Consistent with the spirit of this special section,
the most important characteristic is whether the classifi-
cation is meaningfully related to intervention. For LD, a
classification should also predict a variety of differences
on cognitive skills, behavioral attributes, and achieve-
ment variables not used to form the classification,
developmental course, response to intervention (RTI),
neurobiological variables, or prognosis (Fletcher, Lyon,
et al., 2002).
To address these issues, we consider the reliability
and validity of four approaches to the classification and
assessment of LD: (a) IQ discrepancy and other forms
of aptitude-achievement discrepancy, (b) low achieve-
ment, (c) intra-individual differences, and (d) models
incorporating RTI and some form of curriculum-based
measurement. We consider how each classification re-
flects the historically prominent concept of "unex-
pected underachievement" as the key construct in LD
assessment (Lyon et al., 2001), that is, what many early
observers characterized as a group of children unable
to master academic skills despite the absence of known
causes of poor achievement (sensory disorder, mental
retardation, emotional disturbances, economic disad-
vantages, inadequate instruction). From this perspec-
tive, a valid classification and measurement system for
LD must identify a unique group of underachievers
that is clearly differentiated from groups with other
forms of underachievement.
Defining LD
Historically, definition and classification issues
have haunted the field of LD. As reviewed in Lyon et
al. (2001), most early conceptualizations viewed LD
simply as a form of "unexpected" underachievement.
The primary approach to assessment involved the iden-
tification of intra-individual variability as a marker for
the unexpectedness of LD, along with the exclusion of
other causes of underachievement that would be ex-
pected to produce underachievement. This type of defi-
nition was explicitly coded into U.S. federal statutes
when LD was identified as an eligibility category for
special education in Public Law 94-142 in 1975; es-
sentially the same definition is part of current U.S. fed-
eral statues in the Individuals with Disabilities Educa-
tion Act (1997).
The U.S. statutory definition of LD is essentially a
set of concepts that in itself is difficult to operation-
alize. In 1977, recommendations for operationalizing
the federal definition of LD were provided to states af-
ter passage of Public Law 94—142 to help identify chil-
dren in this category of special education (U. S. Office
of Education, 1977). In these regulations, LD was
defined as a heterogeneous group of seven disorders
(oral language, listening comprehension, basic read-
ing, reading comprehension, math calculations, math
reasoning, written language) with a common marker of
intra-individual variability represented by a discrep-
ancy between IQ and achievement (i.e., unexpected
underachievement). Unexpectedness was also indi-
cated by maintaining the exclusionary criteria present
in the statutory definition that presumably lead to ex-
pected underachievement. Other parts of the regula-
tions emphasize the need to ensure that the child's edu-
cational program provided adequate opportunity to
learn. No recommendations were made concerning the
assessment of psychological processes, most likely be-
cause it was not clear that reliable methods existed
for assessing processing skills and because the field
was not clear on what processes should be assessed
(Reschly, Hosp, & Smied, 2003).
This approach to definition is now widely imple-
mented with substantial variability across schools, dis-
tricts, and states in which students are served in special
education as LD (MacMillan & Siperstein, 2002; Mer-
cer, Jordan, Allsop, & Mercer, 1996; Reschly et al.,
2003). It is also the basis for assessments of LD outside
of schools. Consider, for example, thedefinition of read-
ing disorders in the Diagnostic and Statistical Manual
of Mental Disorders (4th ed.; American Psychiatric As-
sociation, 1994), which indicates that the student must
perform below levels expected for age and IQ, and spec-
ifies only sensory disorders as exclusionary:
A. Reading achievement, as measured by individ-
ually administered standardized tests of read-
507
FLETCHER, FRANCIS, MORRIS, LYON
ing accuracy or comprehension, is substan-
tially below that expected given the person's
chronological age, measured intelligence, and
age-appropriate education.
B. The disturbance in Criterion A significantly in-
terferes with academic achievement or activi-
ties of daily living that require reading skills.
C. If a sensory deficit is present, the reading diffi-
culties are in excess of those usually associated
with it.
The International Classification of Diseases-10 has
a similar definition. It differs largely in being more spe-
cific in requiring use of a regression-adjusted discrep-
ancy, specifying cut points (achievement two standard
errors below IQ) for identifying a child with LD, and
expanding the range of exclusions.
Although these definitions are used in what are of-
ten disparate realms of practice, they lead to similar ap-
proaches to the identification of children and adoles-
cents as LD. Across these realms, children commonly
receive IQ and achievement tests. The IQ test is com-
monly interpreted as an aptitude measure or index
against which achievement is compared. Different
achievement tests are used because LD may affect
achievement in reading, math, or written language. The
heterogeneity is recognized explicitly in the U.S. statu-
tory and regulatory definitions of LD (Individuals With
Disabihties Education Act, 1997) and in the psychi-
atric classifications by the provision of separate defini-
tions for each academic domain. However, it is still
essentially the same definition applied in different do-
mains. In many settings, this basic assessment is sup-
plemented with tests of processing skills derived from
multiple perspectives (neuropsychology, information
processing, and theories of LD). The approach boils
down to administration of a battery of tests to identify
LD, presumably with treatment implications.
Underlying Classification Hypotheses
Implicit in all these definitions are slight variations
on a classification model of individuals with LD as
those who show a measurable discrepancy in some but
not all domains of skill development and who are not
identified into another subgroup of poor achievers. In
some instances, the discrepancy is quantified with two
tests in an aptitude-achievement model epitomized by
the IQ-discrepancy approach in the U.S. federal regu-
latory definition and the psychiatric classifications of
the Diagnostic and Statistical Manual of Mental Dis-
orders (4th ed.; American Psychiatric Association,
1994) and the International Classification of Dis-
eases-10. Here the classification model implicitly stip-
ulates that those who meet an IQ-discrepancy
inclusionary criterion are different in meaningful ways
from those who are underachievers and do not meet the
discrepancy criteria or criteria for one of the exclu-
sionary conditions. Some have argued that this model
lacks validity and propose that LD is synonymous with
underachievement, so that it should be identified solely
by achievement tests (Siegel, 1992), often with some
exclusionary criteria to help ensure that the achieve-
ment problem is unexpected. Thus, the contrast is re-
ally between a two-test aptitude-achievement discrep-
ancy and a one-test chronological age-achievement
discrepancy with achievement low relative to age-
based (or grade-based) expectations. If processing
measures are added, the model becomes a multitest
discrepancy model. Identification of a child as LD in
all three of these models is typically based on assess-
ment at a single point in time, so we refer to them as
"status" models. Finally, RTI models emphasize the
"adequate opportunity to learn" exclusionary criterion
by assessing the child's response to different instruc-
tional efforts over time with frequent brief assess-
ments, that is, a "change" model. The child who is LD
becomes one who demonstrates intractability in learn-
ing characteristics by not responding adequately to in-
struction that is effective with most other students.
Dimensional Nature of LD
Each of these four models can be evaluated for reli-
ability and validity. Unexpected underachievement, a
concept critically important to the validity of the under-
lying construct of LD, can also be examined. The reli-
ability issues are similar across the first three models
and stem from the dimensional nature of LD. Most pop-
ulation-based studies have shown that reading and math
skills are normally distributed (Jorm, Share, Matthews,
& Matthews, 1986; Lewis, Hitch, & Walker, 1994;
Rodgers, 1983; Shalev, Auerbach, Manor, & Gross-
Tsur, 2000; Shaywitz, Escobar, Shaywitz, Fletcher, &
Makuch, 1992; Silva, McGee, & Williams, 1985).
These findings are buttressed by behavioral genetic
studies, which are not consistent with the presence of
qualitatively different characteristics associated with
the heritability of reading and math disorders (Fisher &
DeFries, 2002; Gilger, 2002). As dimensional traits that
exist on a continuum, there would be no expectation of
natural cut points that differentiate individuals with LD
from those who are underachievers but not identified as
LD (Shaywitz et al., 1992).
The unobservable nature of LD makes two-test
and one-test discrepancy models unreliable in ways
that are psychometrically predictable but not in ways
that simply equate LD with poor achievement (Fran-
cis et al., 2005; Stuebing et al., 2002). The problem is
that the measurement approach is based on a static
assessment model that possesses insufficient informa-
tion about the underlying construct to allow for reli-
able classifications of individuals along what is es-
sentially an unobservable dimension. If LD was a
508
ASSESSMENT OF LD
manifest concept that was directly observable in the
behavior of affected individuals, or if there were nat-
ural discontinuities that represented a qualitative
breakpoint in the distribution of achievement skills or
the cognitive skills on which achievement depends,
this problem would be less of an obstacle. However,
like achievement or intelligence, LD is a latent con-
struct that must be inferred from the pattern of perfor-
mance on directly observable operationalizations of
other latent constructs (namely, test scores that index
constructs like reading achievement, phonological
awareness, aptitude, and so on). The more informa-
tion available to support the inference of LD, the
more reliable (and valid) that inference becomes, thus
supporting the fine-grained distinctions necessitated
by two-test and one-test discrepancy models. To the
extent that the latent construct, LD, is categorical, by
which we mean that the construct indexes different
classes of learners (i.e., children who learn differ-
ently) as opposed to simply different levels of
achievement, then systems of identification that rely
on one measurable variable lack sufficient informa-
tion to identify the latent classes and assign individu-
als to those classes without placing additional,
untestable, and unsupportable constraints on the sys-
tem. It is simply not possible to use a single mean
and standard deviation and to estimate separate
means and standard deviations for two (or more)
unobservable latent classes of individuals and deter-
mine the percentage of individuals falling into each
class, let alone to classify specific individuals into
those classes. Without constraints, such as specifying
the magnitude of differences in the means of the la-
tent classes, the ratio of standard deviations, and the
odds of membership in the two (or more) classes, tbe
system is under-identified, which simply means that
there are many different solutions that cannot be dis-
tinguished from one another.
When the system is under-identified, the only solu-
tion is to expand the measurement system to increase the
number of observed relations, which in one sense is
what intra-individual difference models attempt by add-
ing assessments of processing skills. Other criteria are
necessary because it is impossible to uniquely identify a
distinct subgroup of underachieving individuals consis-
tent with the construct of LD when identification is
based on a single assessment at a single time point.
Adding external criteria, such as an aptitude measure or
multiple assessments of processing skills, increases the
dimensionality of the measurement system and makes
latent classification more feasible, even when the other
criteria are themselves imperfect. But the main issues
for one-test, two-test, and multitest identification mod-
els involve the reliability of the underlying classifica-
tions and whether they identify a unique subgroup of un-
derachievers. In the next section, we examine variations
in reliability and validity for each of these models, fo-
cusing on the importance of reliability, as the validity of
the classifications can be no stronger than their reliability.
Models Based on Two-Test
Discrepancies
Although the IQ-discrepancy model is the most
widely utilized approach to identifying LD, there are
many different ways to operationalize the model. For
example, some implementations are based on a com-
posite IQ score, whereas others utilize either a verbal
or nonverbal IQ score. Qther approaches drop IQ as the
aptitude measure and use a measure such as listening
comprehension. In the validity section, we discuss
each of these approaches. The reliability issues are
similar for each example of an aptitude-achievement
discrepancy.
Reliability
Specific reliability problems for two-test discrep-
ancy models pertain to any comparison of two corre-
lated assessments that involve the determination of a
child's performance relative to a cut point on a continu-
ous distribution. Discrepancy involves the calculation
of a difference score (D) to estimate the true difference
(A) between two latent constructs. Thus, discussions
about discrepancy must distinguish between problems
with the manifest (i.e., observed) difference (D) as an
index of the true difference (A) but also must consider
whether the true difference (A) reflects the construct of
interest. Problems with the reliability of D based on
differences between two tests are well known, albeit
not in the LD context (Bereiter, 1967). However, there
is nothing that fundamentally limits the applicability of
this research to LD if we are willing to accept a notion
of A as a marker for LD. There are major problems
with this assumption that are reviewed in Francis et al.
(2005). The most significant is regression to the mean.
On average, regression to the mean indicates that
scores that are above the mean will be lower when the
test is repeated or when a second correlated test is used
to compute D. In this example, individuals who have
IQ scores above the mean will obtain achievement test
scores that, on average, will be lower than the IQ test
score because the achievement score will move toward
the mean. The opposite is true for individuals with IQ
scores below the mean. This leads to the paradox of
children with achievement scores that exceed IQ, or the
identification of low-achieving, higher IQ children
with achievement above the average range as LD.
Although adjusting for the correlation of IQ and
achievement helps correct for regression effects (Rey-
nolds, 1984-1985), unreliability also stems from the
attempt to assess a person's standing relative to a cut
point on a continuous distribution. As discussed in the
509
FLETCHER, FRANCIS, MORRIS, LYON
following section on low achievement models, this
problem makes identification with a single test—even
one with small amounts of measurement error—poten-
tially unreliable, a problem for any status model.
None of this discussion addresses the validity ques-
tion concerning A. Specifically, does A embody LD as
we would want to conceptualize it (e.g., as unexpected
underachievement), or is A merely a convenient con-
ceptualization of LD because it is a conceptualization
that leads directly to easily implemented, operational
definitions, however fiawed they might be?
Validity
The validity of the IQ-discrepancy model has been
extensively studied. Two independent meta-analyses
have shown that effect sizes on measures of achieve-
ment and cognitive functions are in the negligible to
small range (at best) for the comparison of groups
formed on the basis of discrepancies between IQ and
reading achievement versus poor readers without an IQ
discrepancy (Hoskyn & Swanson, 2000; Stuebing et
al., 2002), findings similar to studies not included in
these meta-analyses (Stanovich & Siegei, 1994). Qther
validity studies have not found that discrepant and
nondiscrepant poor readers differ in long-term prog-
nosis (Francis, Shaywitz, Stuebing, Shaywitz, & Flet-
cher, 1996; Silva et al., 1985), response to instruction
(Fletcher, Lyon, et al., 2002; Jimenez et al., 2003;
Stage, Abbott, Jenkins, & Beminger, 2003; Vellutino,
Scanlon, & Jaccard, 2003), or neuroimaging correlates
(Lyon et al., 2003; but also see Shaywitz et al., 2003,
which shows differences in groups varying in IQ but
not IQ discrepancy). Studies of genetic variability
show negligible to small differences related to IQ-dis-
crepancy models that may reflect regression to the
mean (Pennington, Gilger, Olson, & DeFries, 1992;
Wadsworth, Olson, Pennington, & DeFries, 2000).
Similar empirical evidence has been reported for LD in
math and language (Fletcher, Lyon, et al., 2002;
Mazzocco & Myers, 2003). This is not surprising given
that the problems are inherent in the underlying
psychometric model and have little to do with the spe-
cific measures involved in the model except to the ex-
tent that specific test reliabilities and intertest correla-
tions enter into the equations.
Despite the evidence of weak validity for the practice
of differentiating discrepant and nondiscrepant stu-
dents, alternatives based on discrepancy models con-
tinue to be proposed, and psychologists outside of
schools commonly implement this flawed model. How-
ever, given the reliability problems inherent in IQ dis-
crepancy models, it is not surprising that these other at-
tempts to operationalize aptitude-achievement
discrepancy have not met with success. In the Stuebing
et al. (2002) meta-analysis, 32 of the 46 major studies
had a clearly defined aptitude measure. Of these studies.
510
19 used Full Scale IQ, 8 used Verbal IQ, 4 used Perfor-
mance IQ, and 1 study used a discrepancy of listening
comprehension and reading comprehension. Not sur-
prisingly, these different discrepancy models did not
yield results that were different from those when a com-
posite IQ measure was utilized. Neither Fletcher et al.
(1994) nor Aaron, Kuchta, and Grapenthin (1988) were
able to demonstrate major differences between discrep-
ant and low achievement groups formed on the basis of
listening comprehension and reading comprehension.
The differences in these models involve slight
changes in who is identified as discrepant or low
achieving depending on the cut point and the correla-
tion of the aptitude and achievement measures. The
changes simply reflect fluctuations around the cut
point where children are most similar. It is not surpris-
ing that effect sizes comparing poor achievers with and
without IQ discrepancies are uniformly low across
these different models. Current practices based on this
approach to identification of LD epitomized by the
federal regulatory definition and psychiatric classifica-
tions are fundamentally flawed.
One-Test (Low Achievement) Models
Reliability
The measurement problems that emerge when a
specific cut point is used for identification purposes af-
fect any psychometric approach to LD identification.
These problems are more significant when the test
score is not criterion referenced, or when the score dis-
tributions have been smoothed to create a normal uni-
variate distribution. To reiterate, the presence of a natu-
ral breakpoint in the score distribution, typically
observed in multimodal distributions, would make it
simple to validate cut points. But natural breaks are not
usually apparent in achievement distributions because
reading and math achievement distributions are nor-
mal. Thus, LD is essentially a dimensional trait, or a
variation on normal development.
Regardless of normality, measurement error attends
any psychometric procedure and affects cut points in a
normal distribution (Shepard, 1980). Because of mea-
surement error, any cut point set on the observed distri-
bution will lead to instability in the identification of
class members because observed test scores will fluc-
tuate around the cut point with repeated testing or use
of an alternative measure of the same construct (e.g.,
two reading tests). This fluctuation is not just a prob-
lem of correlated tests or simply a matter of setting
better cut scores or developing better tests. Rather, no
single observed test score can capture perfectly a stu-
dent's ability on an imperfectly measured latent vari-
able. The fluctuation in identifications will vary across
different tests, depending in part on the measurement
ASSESSMENT OF LD
error. In both real and simulated data sets, fluctuations
in up to 35% of cases are found when a single test is
used to identify a cut point. Similar problems are ap-
parent if a two-test discrepancy model is used (Francis
et al., 2005; Shaywitz et al., 1992).
This problem is less of an issue for research, which
rarely hinges on the identification of individual chil-
dren. Thus, it does not have great impact on the validity
of a low achievement classification because, on aver-
age, children around the cut point who may be fluctuat-
ing in and out of the class of interest with repeated test-
ing are not very different. However, the problems for
an individual child who is being considered for special
education placement or a psychiatric diagnosis are ob-
vious. A positive identification in either example often
carries a poor prognosis.
Validity
Models based on the use of achievement markers
can be shown to have a great deal of validity (see
Fletcher, Lyon, et al., 2002; Fletcher, Morris, & Lyon,
2003; Siegel, 1992). In this respect, if groups are
formed such that the participants do not meet criteria
for mental retardation and have achievement scores
that are below the 25th percentile, a variety of compari-
sons show that subgroups of underachievers emerge
that can be validly differentiated on external variables
and help demonstrate the viability of tbe construct of
LD. For example, if children with reading and math
disabilities identified in this manner are compared to
typical achievers, it is possible to show that these three
groups display different cognitive correlates. In addi-
tion, neurobiological studies show that these groups
differ both in the neural correlates of reading and math
performance as well as the heritability of reading and
math disorders (Lyon et al., 2003). These achievement
subgroups, which by definition include children who
meet either low achievement or IQ-discrepancy crite-
ria, even differ in RTI, providing strong evidence for
"aptitude by treatment" interactions; math interven-
tions provided for children with reading problems are
demonstrably ineffective, and vice versa.
Despite this evidence for validity, concerns emerge
about definitions based solely on achievement cut
points. Simply utilizing a low achievement definition,
even when different exclusionary criteria are applied,
does not operationalize the true meaning of unexpected
underachievement. Although such an approach to
identification is deceptively simple, …
Neuron
Article
Fractionating Human Intelligence
Adam Hampshire,1,* Roger R. Highfield,2 Beth L. Parkin,1 and Adrian M. Owen1
1The Brain and Mind Institute, The Natural Sciences Centre, Department of Psychology, The University of Western Ontario,
London ON, N6A 5B7, Canada
2Science Museum, Exhibition Road, London SW72DD, UK
*Correspondence: [email protected]
http://dx.doi.org/10.1016/j.neuron.2012.06.022
SUMMARY
What makes one person more intellectually able
than another? Can the entire distribution of human
intelligence be accounted for by just one general
factor? Is intelligence supported by a single neural
system? Here, we provide a perspective on human
intelligence that takes into account how general
abilities or ‘‘factors’’ reflect the functional organiza-
tion of the brain. By comparing factor models of
individual differences in performance with factor
models of brain functional organization, we demon-
strate that different components of intelligence
have their analogs in distinct brain networks. Using
simulations based on neuroimaging data, we show
that the higher-order factor ‘‘g’’ is accounted for
by cognitive tasks corecruiting multiple networks.
Finally, we confirm the independence of these com-
ponents of intelligence by dissociating them using
questionnaire variables. We propose that intelli-
gence is an emergent property of anatomically
distinct cognitive systems, each of which has its
own capacity.
INTRODUCTION
Few topics in psychology are as old or as controversial as
the study of human intelligence. In 1904, Charles Spearman
famously observed that performance was correlated across
a spectrum of seemingly unrelated tasks (Spearman, 1904).
He proposed that a dominant general factor ‘‘g’’ accounts for
correlations in performance between all cognitive tasks, with
residual differences across tasks reflecting task-specific fac-
tors. More controversially, on the basis of subsequent attempts
to measure ‘‘g’’ using tests that generate an intelligence quotient
(IQ), it has been suggested that population variables including
gender (Irwing and Lynn, 2005; Lynn, 1999), class (Burt, 1959,
1961; McManus, 2004), and race (Rushton and Jensen, 2005)
correlate with ‘‘g’’ and, by extension, with one’s genetically pre-
determined potential. It remains unclear, however, whether
population differences in intelligence test scores are driven by
heritable factors or by other correlated demographic variables
such as socioeconomic status, education level, and motivation
(Gould, 1981; Horn and Cattell, 1966). More relevantly, it is
questionable whether they relate to a unitary intelligence factor,
Ne
as opposed to a bias in testing paradigms toward particular
components of a more complex intelligence construct (Gould,
1981; Horn and Cattell, 1966; Mackintosh, 1998). Indeed, over
the past 100 years, there has been much debate over whether
general intelligence is unitary or composed of multiple factors
(Carroll, 1993; Cattell, 1949; Cattell and Horn, 1978; Johnson
and Bouchard, 2005). This debate is driven by the observation
that test measures tend to form distinctive clusters. When
combined with the intractability of developing tests that mea-
sure individual cognitive processes, it is likely that a more
complex set of factors contribute to correlations in performance
(Carroll, 1993).
Defining the biological basis of these factors remains a
challenge, however, due in part to the limitations of behavioral
factor analyses. More specifically, behavioral factor analyses
do not provide an unambiguous model of the underlying cogni-
tive architecture, as the factors themselves are inaccessible,
being measured indirectly by estimating linear components
from correlations between the performance measures of dif-
ferent tests. Thus, for a given set of behavioral correlations, there
are many factor solutions of varying degrees of complexity, all
of which are equally able to account for the data. This ambiguity
is typically resolved by selecting a simple and interpretable
factor solution. However, interpretability does not necessarily
equate to biological reality. Furthermore, the accuracy of any
factor model depends on the collection of a large number of pop-
ulation measures. Consequently, the classical approach to intel-
ligence testing is hampered by the logistical requirements of pen
and paper testing. It would appear, therefore, that the classical
approach to behavioral factor analysis is near the limit of its
resolution.
Neuroimaging has the potential to provide additional con-
straint to behavioral factor models by leveraging the spatial
segregation of functional brain networks. For example, if one
homogeneous system supports all intelligence processes, then
a common network of brain regions should be recruited when-
ever difficulty increases across all cognitive tasks, regardless
of the exact stimulus, response, or cognitive process that is
manipulated. Conversely, if intelligence is supported by multiple
specialized systems, anatomically distinct brain networks
should be recruited when tasks that load on distinct intelligence
factors are undertaken. On the surface, neuroimaging results
accord well with the former account. Thus, a common set of
frontal and parietal brain regions is rendered when peak activa-
tion coordinates from a broad range of tasks that parametrically
modulate difficulty are smoothed and averaged (Duncan and
Owen, 2000). The same set of multiple demand (MD) regions is
activated during tasks that load on ‘‘g’’ (Duncan, 2005; Jung
uron 76, 1225–1237, December 20, 2012 ª2012 Elsevier Inc. 1225
mailto:[email protected]
http://dx.doi.org/10.1016/j.neuron.2012.06.022
Neuron
Fractionating Human Intelligence
and Haier, 2007), while the level of activation within frontoparietal
cortex correlates with individuals differences in IQ score (Gray
et al., 2003). Critically, after brain damage, the size of the lesion
within, but not outside of, MD cortex is correlated with the esti-
mated drop in IQ (Woolgar et al., 2010). However, these results
should not necessarily be equated with a proof that intelligence
is unitary. More specifically, if intelligence is formed from multiple
cognitive systems and one looks for brain responses during
tasks that weigh most heavily on the ‘‘g’’ factor, one will most
likely corecruit all of those functionally distinct systems. Similarly,
by rendering brain activation based on many task demands,
one will have the statistical power to render the networks
that are most commonly recruited, even if they are not always
corecruited. Indeed, there is mounting evidence demonstrating
that different MD regions respond when distinct cognitive
demands are manipulated (Corbetta and Shulman, 2002;
D’Esposito et al., 1999; Hampshire and Owen, 2006; Hampshire
et al., 2008, 2011; Koechlin et al., 2003; Owen et al., 1996; Pet-
rides, 2005). However, such a vast array of highly specific func-
tional dissociations have been proposed in the neuroimaging
literature as a whole that they often lack credibility, as they fail
to account for the broader involvement of the same brain regions
in other aspects of cognition (Duncan and Owen, 2000; Hamp-
shire et al., 2010). The question remains, therefore, whether intel-
ligence is supported by one or multiple systems, and if the latter
is the case, which cognitive processes those systems can most
broadly be described as supporting. Furthermore, even if
multiple functionally distinct brain networks contribute to intelli-
gence, it is unknown whether the capacities of those networks
are independent or are related to the same set of diffuse biolog-
ical factors that modulate general neural efficiency. It is unclear,
therefore, whether the pattern of individual differences in intelli-
gence reflects the functional organization of the brain.
Here, we address the question of whether human intelligence
is best conceived of as an emergent property of functionally
distinct brain networks using factor analyses of brain imag-
ing, behavioral, and simulated data. First, we break MD cortex
down into its constituent functional networks by factor
analyzing regional activation levels during the performance of
12 challenging cognitive tasks. Then, we build a model, based
on the extent to which the different functional networks are
recruited during the performance of those 12 tasks, and deter-
mine how well that model accounts for cross-task correlations
in performance in a large (n = 44,600) population sample.
Factor solutions, generated from brain imaging and behavioral
data, are compared directly, to answer the question of whether
the same set of cognitive entities is evident in the functional
organization of the brain and in individual differences in perfor-
mance. Simulations, based on the imaging data, are used to
determine the extent to which correlations between first-order
behavioral components are predicted by cognitive tasks re-
cruiting multiple functional brain networks, and the extent to
which those correlations may be accounted for by a spatially
diffuse general factor. Finally, we examine whether the behav-
ioral components of intelligence show a degree of indepen-
dence, as evidenced by dissociable correlations with the types
of questionnaire variable that ‘‘g’’ has historically been associ-
ated with.
1226 Neuron 76, 1225–1237, December 20, 2012 ª2012 Elsevier Inc
RESULTS AND DISCUSSION
Identifying Functional Networks within MD Cortex
Sixteen healthy young participants undertook the cognitive
battery in the MRI scanner. The cognitive battery consisted of
12 tasks, which, based on well-established paradigms from
the neuropsychology literature, measured a range of the types
of planning, reasoning, attentional, and working memory skills
that are considered akin to general intelligence (see Supple-
mental Experimental Procedures available online). The activation
level of each voxel within MD cortex was calculated separately
for each task relative to a resting baseline using general linear
modeling (see Supplemental Experimental Procedures) and the
resultant values were averaged across participants to remove
between-subject variability in activation—for example, due to
individual differences in regional signal intensity.
The question of how many functionally distinct networks were
apparent within MD cortex was addressed using exploratory
factor analysis. Voxels within MD cortex (Figure 1A) were
transformed into 12 vectors, one for each task, and these were
examined using principal components analysis (PCA), a factor
analysis technique that extracts orthogonal linear components
from the 12-by-12 matrix of task-task bivariate correlations.
The results revealed two ‘‘significant’’ principal components,
each of which explained more variability in brain activation than
was contributed by any one task. These components accounted
for �90% of the total variance in task-related activation across
MD cortex (Table S1). After orthogonal rotation with the Varimax
algorithm, the strengths of the task-component loadings were
highlyvariableandeasilycomprehensible(Table1andFigure1B).
Specifically, all of the tasks in which information had to be actively
maintained in short-term memory, for example, spatial working
memory, digit span, and visuospatial working memory, loaded
heavily on one component (MDwm). Conversely, all of the tasks
in which information had to be transformed in mind according
to logical rules, for example, deductive reasoning, grammatical
reasoning, spatial rotations, and color-word remapping, loaded
heavily on the other component (MDr). When factor scores
were generated at each voxel using regression and projected
back onto the brain, two clearly defined functional networks
were rendered (Figure 1D). Thus, the insula/frontal operculum
(IFO), the superior frontal sulcus (SFS), and the ventral portion
of the anterior cingulate cortex/ presupplementary motor area
(ACC/preSMA) had greater MDwm component scores, whereas
the inferior frontal sulcus (IFS), inferior parietal cortex (IPC), and
the dorsal portion of the ACC/preSMA had greater MDr compo-
nent scores. When the PCA was rerun with spherical regions of
interest (ROIs) centered on each MD subregion, with radii that
varied from 10 to 25 mm in 5 mm steps and excluding voxels
that were on average deactivated, the task loadings correlated
with those from the MD mask at r > 0.95 for both components
and at all radii. Thus, the PCA solution was robust against varia-
tions in the extent of the ROIs. When data from the whole brain
were analyzed using the same method, three significant compo-
nents were generated, the first two of which correlated with those
from the MD cortex analysis (MDr r = 0.76, MDwm r = 0.83),
demonstrating that these were the most prominent active-state
networks in the brain. The factor solution was also reliable at
.
Figure 1. Factor Analyzing Functional Brain
Imaging Data from within Multiple Demand
Cortex
(A) The MD cortex ROIs.
(B) PCA of the average activation patterns within
MD cortex for each task (x axis reports task-
component loading).
(C) PCA with each individual’s data included as
separate columns (error bars report SEM).
(D) Component scores from the analysis of MD
task-related activations averaged across individ-
uals. Voxels that loaded more heavily on the
MDwm component are displayed in red. Voxels
that loaded more heavily on the MDr network are
displayed in blue.
(E) T contrasts of component scores against zero
from the PCA with individual data concatenated
into 12 columns (FDR corrected at p < 0.05 for all
MD voxels).
Neuron
Fractionating Human Intelligence
the individual subject level. Rerunning the same PCA on each
individual’s data generated solutions with two significant compo-
nents in 13/16 cases. There was one three-component solution
and two four-component solutions. Rerunning the two-compo-
nent PCA with each individual’s data set included as 12 separate
columns (an approach that did not constrain the same task to
load on the same component across participants) demonstrated
that the pattern of task-component loadings was also highly reli-
able at the individual subject level (Figure 1C). In order to test the
reliability of the functional networks across participants, the data
were concatenated instead of averaged into 12 columns (an
approach that does not constrain the same voxels to load on
the same components across individuals), and component
Neuron 76, 1225–1237, De
scores were estimated at each voxel and
projected back into two sets of 16 brain
maps. When t contrasts were calculated
against zero at the group level, the same
MDwm and MDr functional networks
were rendered (Figure 1E).
While the PCA works well to identify the
number of significant components, a
potential weakness for this method is
that the unrotated task-component load-
ings are liable to be formed from mixtures
of the underlying factors and are heavily
biased toward the component that is ex-
tracted first. This weakness necessitates
the application of rotation to the task-
component matrix; however, rotation is
not perfect, as it identifies the task-
component loadings that fit an arbitrary
set of criteria designed to generate the
simplest and most interpretable solution.
To deal with this potential issue, the task-
functional network loadings were recalcu-
lated using independent component anal-
ysis (ICA), an analysis technique that
exploits the more powerful properties of
statistical independence to extract the sources from mixed
signals. Here, we used ICA to extract two spatially distinct func-
tional brain networks using gradient ascent toward maximum
entropy (code adapted from Stone and Porrill, 1999). The resultant
components were broadly similar, although not identical, to those
from the PCA (Table 1). More specifically, all tasks loaded posi-
tively on both independent brain networks but to highly varied
extents, with the short-term memory tasks loading heavily on
one component and the tasks that involved transforming informa-
tion according to logical rules loading heavily on the other. Based
on these results, it is reasonable to conclude that MD cortex is
formed from at least two functional networks, with all 12 cognitive
tasks recruiting both networks but to highly variable extents.
cember 20, 2012 ª2012 Elsevier Inc. 1227
Table 1. PCA and ICA of Activation Levels in 2,275 MD Voxels
during the Performance of 12 Cognitive Tasks
PCA ICA
MDr MDwm MDr MDwm
Self-ordered search 0.38 0.69 1.45 3.26
Visuospatial working
memory
0.27 0.84 1.24 2.68
Spatial span 0.17 0.86 0.51 2.23
Digit span 0.28 0.76 0.76 2.20
Paired associates 0.56 0.62 1.90 1.97
Spatial planning 0.58 0.50 2.43 2.74
Feature match 0.68 0.49 2.00 0.88
Interlocking polygons 0.74 0.31 2.11 0.61
Verbal reasoning 0.78 0.15 2.62 0.60
Spatial rotation 0.75 0.44 2.86 1.88
Color-word remapping 0.69 0.42 3.07 0.95
Deductive reasoning 0.90 0.18 3.98 0.19
PCA/ICA correlation MDr r = 0.92
PCA/ICA correlation MDwm r = 0.81
Table 2. Task-Component Loadings from the PCA of Internet
Data with Orthogonal Rotation
1 (STM) 2 (Reasoning) 3 (Verbal)
Spatial span 0.69 0.22
Visuospatial working memory 0.69 0.21
Self-ordered search 0.62 0.16 0.16
Paired associates 0.58 0.25
Spatial planning 0.41 0.45
Spatial rotation 0.14 0.66
Feature match 0.15 0.57 0.22
Interlocking polygons 0.54 0.3
Deductive reasoning 0.19 0.52 �0.14
Digit span 0.26 �0.2 0.71
Verbal reasoning 0.33 0.66
Color-word remapping 0.22 0.35 0.51
Neuron
Fractionating Human Intelligence
The Relationship between the Functional Organization
of MD Cortex and Individual Differences in Intelligence:
Permutation Modeling
A critical question is whether the loadings of the tasks on the
MDwm and MDr functional brain networks form a good predictor
of the pattern of cross-task correlations in performance
observed in the general population. That is, does the same set
of cognitive entities underlay the large-scale functional organiza-
tion of the brain and individual differences in performance? It is
important to note that factor analyses typically require many
measures. In the case of the spatial factor analyses reported
above, measures were taken from 2,275 spatially distinct ‘‘vox-
els’’ within MD cortex. In the case of the behavioral analyses,
we used scores from �110,000 participants who logged in to
undertake Internet-optimized variants of the same 12 tasks. Of
these, �60,000 completed all 12 tasks and a post task question-
naire. After case-wise removal of extreme outliers, null values,
nonsense questionnaire responses, and exclusion of partici-
pants above the age of 70 and below the age of 12, exactly
44,600 data sets, each composed of 12 standardized task
scores, were included in the analysis (see Experimental
Procedures).
The loadings of the tasks on the MDwm and MDr networks
from the ICA were formed into two vectors. These were re-
gressed onto each individual’s set of 12 standardized task
scores with no constant term. When each individual’s MDwm
and MDr beta weights (representing component scores) were
varied in this manner, they centered close to zero, showed no
positive correlation (MDwm mean beta = 0.05 ± 1.78; MDr
mean beta = 0.11 ± 2.92; MDwm-MDr correlation r = �0.20),
and, importantly, accounted for 34.3% of the total variance in
performance scores. For comparison, the first two principal
components of the behavioral data accounted for 36.6% of the
variance. Thus, the model based on the brain imaging data
captured close to the maximum amount of variance that could
1228 Neuron 76, 1225–1237, December 20, 2012 ª2012 Elsevier Inc
be accounted for by the two best-fitting orthogonal linear
components. The average test-retest reliability of the 12 tasks,
collected in an earlier Internet cohort (Table S2), was 68%.
Consequently, the imaging ICA model predicted >50% of the
reliable variance in performance. The statistical significance of
this fit was tested against 1,000 permutations, in which the
MDwm and MDr vectors were randomly rearranged both within
and across vector prior to regression. The original vectors
formed a better fit than the permuted vectors in 100% of cases,
demonstrating that the brain imaging model was a significant
predictor of the performance data relative to models with the
same fine-grained values and the same level of complexity.
Two further sets of permutation tests were carried out in which
one vector was held constant and the other randomly permuted
1,000 times. When the MDwm vector was permuted, the original
vectors formed a better fit in 100% of cases. When the MDr
vector was permuted, the original vectors formed a better fit in
99.3% of cases. Thus, both the MDwm and the MDr vectors
were significant predictors of individual differences in behavioral
performance.
The Relationship between the Functional Organization
of MD Cortex and Individual Differences in Intelligence:
Similarity of Factor Solutions
Exploratory factor analysis was carried out on the behavioral
data using PCA. There were three significant behavioral compo-
nents that each accounted for more variance than was contrib-
uted by any one test (Table S3) and that together accounted
for 45% of the total variance. After orthogonal rotation with the
Varimax algorithm, the first two components showed a marked
similarity to the loadings of the tasks on the MDwm and MDr
networks (Table 2). Thus, the first component (STM) included
all of the tasks in which information was held actively on line in
short-term memory, whereas the second component (reasoning)
included all of the tasks in which information was transformed in
mind according to logical rules. Correlation analyses between
the task to functional brain network loadings and the task to
behavioral component loadings confirmed that the two
approaches generated broadly similar solutions (STM-MDwm
.
Figure 2. Localizing the Functional-Ana-
tomical Correlates of the Verbal Component
When task-component loadings for the verbal
factor from the behavioral analysis were stan-
dardized and used as a predictor of activation
within the whole brain, a left lateralized network
was rendered, including the left inferior frontal
gyrus, and temporal lobe regions bilaterally
(p < 0.05 FDR corrected for the whole brain mass).
Neuron
Fractionating Human Intelligence
r = 0.79, p < 0.001; reasoning-MDr r = 0.64, p < 0.05). The third
behavioral component was readily interpretable and easily
comprehensible, accounting for a substantial proportion of the
variance in the three tasks that used verbal stimuli (Table 2),
these being digit span, verbal reasoning, and color-word remap-
ping. A relevant question regards why there was no third network
in the analysis of the MD cortex activation data. One possibility
was that a spatial equivalent of the verbal component did exist
in MD cortex but that it accounted for less variance than was
contributed by any one task in the imaging analysis. Extracting
three-component PCA and ICA solutions from the imaging
data did not generate an equivalent verbal component, a result
that is unsurprising, as a defining characteristic of MD cortex is
its insensitivity to stimulus category (Duncan and Owen, 2000).
A more plausible explanation was that the third behavioral
component had a neural basis in category-sensitive brain
regions outside of MD cortex. In line with this view, the task-
factor loadings from the third behavioral component correlated
closely with those from the additional third component extracted
from the PCA of all active voxels within the brain (r = 0.82,
p < 0.001). In order to identify brain regions that formed a likely
analog of the verbal component, the task-component loadings
were standardized so that they had unit deviation and zero
mean and were used to predict activation unconstrained within
the whole brain mass (see Experimental Procedures). Regions
including the left inferior frontal gyrus and the bilateral temporal
lobes were significantly more active during the performance of
tasks that weighed on the verbal component (Figure 2). This
set of brain regions had little overlap with MD cortex, an obser-
vation that was formalized using t tests on the mean beta weights
from within each of the anatomically distinct MD cortex ROIs.
This liberal approach demonstrated that none of the MD ROIs
were significantly more active for tasks that loaded on the verbal
component (p > 0.05, uncorrected and one tailed).
Determining the Likely Neural Basis of Higher-Order
Components
Based on this evidence, it is reasonable to infer that the
behavioral factors that underlie correlations in an individual’s
Neuron 76, 1225–1237, De
performance on tasks of the type typically
considered akin to intelligence have
a basis in the functioning of multiple brain
networks. This observation allows novel
insights to be derived regarding the likely
basis of higher-order components. More
specifically, in classical intelligence
testing, first-order components gener-
ated by factor analyzing the correlations between task scores
are invariably correlated positively if allowed to rotate into their
optimal oblique orientations. A common approach is to under-
take a second-order factor analysis of the correlations between
the obliquely orientated first-order components. The resultant
second-order component is often denoted as ‘‘g.’’ This
approach is particularly useful when tasks load heavily on
multiple components, as it can simplify the task to first-order
component weightings, making the factor solution more readily
interpretable. A complication for this approach, however, is
that the underlying source of this second-order component is
ambiguous. More specifically, while correlations between
first-order components from the PCA may arise because the
underlying factors are themselves correlated (for example, if
the capacities of the MDwm and MDr networks were influenced
by some diffuse factor like conductance speed or plasticity),
they will also be correlated if there is ‘‘task mixing,’’ that is,
if tasks tend to weigh on multiple independent factors. In
behavioral factor analysis, these accounts are effectively indis-
tinguishable as the components or latent variables cannot be
measured directly. Here, we have an objective measure of the
extent to which the tasks are mixed, as we know, based on the
functional neuroimaging data, the extent to which the tasks
recruit spatially separated functional networks relative to rest.
Consequently, it is possible to subdivide ‘‘g’’ into the proportion
that is predicted by the mixing of tasks on multiple functional
brain networks and the proportion that may be explained by
other diffuse factors (Figure 3).
Two simulated data sets were generated; one based on the
loadings of the tasks on the MDwm and MDr functional networks
(2F) and the other including task activation levels for the verbal
network (3F). Each of the 44,600 simulated ‘‘individuals’’ was
assigned a set of either two (2F) or three (3F) factor scores using
a random Gaussian generator. Thus, the underlying factor
scores represented normally distributed individual differences
and were assumed to be completely independent in the simula-
tions. The 12 task scores were assigned for each individual
by multiplying the task-functional network loadings from the
ICA of the neuroimaging data by the corresponding, randomly
cember 20, 2012 ª2012 Elsevier Inc. 1229
Figure 3. Determining Whether Cross-component Correlations in the Behavioral Factor Analysis Are Accounted for by the Tasks Recruiting
Multiple Independent Functional Brain Networks
A cognitivetaskcanmeasureacombinationofnoise,task-specificcomponents,andcomponentsthataregeneral,contributingtotheperformanceofmultipletasks.
In the current study, there were three first-order components: reasoning, short-term memory (STM), and verbal processing. In classical intelligence testing, the first-
order components are invariably correlated positively when allowed to rotate into oblique orientations. A factor analysis of these correlations may be undertaken to
estimate a second-order component and this is generally denoted as ‘‘g.’’ ‘‘g’’ may be generated from distinct sources: task mixing, the tendency for tasks to
corecruit multiple systems, and diffuse factors that contribute to the capacities of all of those systems. When simulations were built based on the brain imaging data,
the correlations between the first-order components from the behavioral study were entirely accounted for by tasks corecruiting multiple functional networks.
Neuron
Fractionating Human Intelligence
generated, factor score and summating the resultant values. The
scores were then standardized for each task and noise was
added by adding the product of randomly generated Gaussian
noise, the test-retest reliabilities (Table S2), and a noise level
constant. A series of iterative …
Whose IQ Is It?—Assessor Bias Variance in High-Stakes
Psychological Assessment
Paul A. McDermott
University of Pennsylvania
Marley W. Watkins
Baylor University
Anna M. Rhoad
University of Pennsylvania
Assessor bias variance exists for a psychological measure when some appreciable portion of the score
variation that is assumed to reflect examinees’ individual differences (i.e., the relevant phenomena in
most psychological assessments) instead reflects differences among the examiners who perform the
assessment. Ordinary test reliability estimates and standard errors of measurement do not inherently
encompass assessor bias variance. This article reports on the application of multilevel linear modeling to
examine the presence and extent of assessor bias in the administration of the Wechsler Intelligence Scale
for Children—Fourth Edition (WISC–IV) for a sample of 2,783 children evaluated by 448 regional
school psychologists for high-stakes special education classification purposes. It was found that nearly
all WISC–IV scores conveyed significant and nontrivial amounts of variation that had nothing to do with
children’s actual individual differences and that the Full Scale IQ and Verbal Comprehension Index
scores evidenced quite substantial assessor bias. Implications are explored.
Keywords: measurement bias, assessment, assessor variance, WISC–IV
The Wechsler scales are among the most popular and re-
spected intelligence tests worldwide (Groth-Marnat, 2009). The
many scores extracted from a given Wechsler test administra-
tion have purported utility for a multitude of applications. For
example, as pertains to the contemporary version for school-age
children (the Wechsler Intelligence Scale for Children—Fourth
Edition [WISC–IV]; Wechsler, 2003), the publisher recom-
mends that resultant scores be used to (a) assess general intel-
lectual functioning; (b) assess performance in each major do-
main of cognitive ability; (c) discover strengths and weaknesses
in each domain of cognitive ability; (d) interpret clinically
meaningful score patterns associated with diagnostic groups;
(e) interpret the scatter of subtests both diagnostically and
prescriptively; (f) suggest classroom modifications and teacher
accommodations; (g) analyze score profiles from both an inter-
individual and intraindividual perspective; and (h) statistically
contrast and then interpret differences between pairs of com-
ponent scores and between individual scores and subsets of
multiple scores (Prifitera, Saklofske, & Weiss, 2008; Wechsler,
2003; Weiss, Saklofske, Prifitera, & Holdnack, 2006).
The publisher and other writers offer interpretations for the
unique underlying construct meaning (as distinguished from the
actual nominal labels) for every WISC–IV composite score, sub-
score, and many combinations thereof (Flanagan & Kaufman,
2009; Groth-Marnat, 2009; Mascolo, 2009). Moreover, the
Wechsler Full Scale IQ (FSIQ) is routinely used to differentially
classify mental disability (Bergeron, Floyd, & Shands, 2008;
Spruill, Oakland, & Harrison, 2005) and giftedness (McClain &
Pfeiffer, 2012), to discover appreciable discrepancies between
expected and observed school achievement as related to learning
disabilities (Ahearn, 2009; Kozey & Siegel, 2008), and to exclude
ability problems as an etiological alternative in the identification of
noncognitive disorders (emotional disturbance, communication
disabilities, etc.; Kamphaus, Worrell, & Harrison, 2005).
As Kane (2013) has reminded test publishers and users, “the
validity of a proposed interpretation or use depends on how well
the evidence supports the claims being made” and “more-
ambitious claims require more support than less-ambitious claims”
(p. 1). At the most fundamental level, the legitimacy of every claim
is entirely dependent on the accuracy of test scores in reflecting
individual differences. Such accuracy is traditionally assessed
through measures of content sampling error (internal consistency
estimates) and temporal sampling error (test–retest stability esti-
mates; Allen & Yen, 2001; Wasserman & Bracken, 2013). These
estimates are commonplace in test manuals, as incorporated in a
standard error of measurement index. It is sometimes assumed that
such indexes fully represent the major threats to test score inter-
pretation and use, but they do not (Hanna, Bradley, & Holen, 198l;
This article was published Online First November 4, 2013.
Paul A. McDermott, Graduate School of Education, Quantitative Meth-
ods Division, University of Pennsylvania; Marley W. Watkins, Department
of Educational Psychology, Baylor University; Anna M. Rhoad, Graduate
School of Education, Quantitative Methods Division, University of Penn-
sylvania.
This research was supported in part by U.S. Department of Education’s
Institute of Education Sciences Grant R05C050041-05.
Correspondence concerning this article should be addressed to Paul A.
McDermott, Graduate School of Education, Quantitative Methods Divi-
sion, University of Pennsylvania, 3700 Walnut Street, Philadelphia, PA
19104-6216. E-mail: [email protected]
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
Psychological Assessment © 2013 American Psychological Association
2014, Vol. 26, No. 1, 207–214 1040-3590/14/$12.00 DOI: 10.1037/a0034832
207
mailto:[email protected]
http://dx.doi.org/10.1037/a0034832
Oakland, Lee, & Axelrad, 1975; Thorndike & Thorndike-Christ,
2010; Viswanathan, 2005). Tests administered individually by
psychologists or other specialists (in contrast to paper-and-pencil
test administrations) are highly vulnerable to error sources beyond
content and time sampling. For example, substantial portions of
error variance in scores are rooted in the systematic and erratic
errors of those who administer and score the tests (Terman, 1918).
This is referred to as assessor bias (Hoyt & Kerns, 1999; Rauden-
bush & Sadoff, 2008).
Assessor bias is manifest where, for example, a psychologist
will tend to drift from the standardized protocol for test adminis-
tration (altering or ignoring stopping rules or verbal prompts,
mishandling presentation of items and materials, etc.) and errone-
ously scoring test responses (failure to query ambiguous answers,
giving too much or too little credit for performance, erring on time
limits, etc.). Sometimes these errors appear sporadically and are
limited to a given testing session, whereas other errors will tend to
reside more systematically with given psychologists and general-
ize over a more pervasive mode of unconventional, error-bound,
testing practice. Administration and scoring biases, most espe-
cially pervasive types, undermine the purpose of testing. Their
corrupting effects are exponentially more serious when testing
purposes are high stakes, and there is abundant evidence that such
biases will operate to distort major score interpretations, to change
results of clinical trials, and to alter clinical diagnoses and special
education classifications (Allard, Butler, Faust, & Shea, 1995;
Allard & Faust, 2000; Franklin, Stillman, Burpeau, & Sabers,
1982; Mrazik, Janzen, Dombrowski, Barford, & Krawchuk, 2012;
Schafer, De Santi, & Schneider, 2011).
Recently, Waterman, McDermott, Fantuzzo, and Gadsden
(2012) demonstrated research designs to estimate the amount of
systematic assessor bias variance carried by cognitive ability
scores in early childhood. Well-trained assessors applying individ-
ually administered tests were randomly assigned to child examin-
ees, whereafter each assessor tested numerous children. Conven-
tional test-score internal consistency, stability, and generalizability
were first supported (McDermott et al., 2009), and thereafter
hierarchical linear modeling (HLM) was used to partition score
variance into that part conveying children’s actual individual
differences (the relevant target phenomena in any high-stakes
psychological assessment) and that part conveying assessor bias
(also known as assessor variance; Waterman et al., 2012). The
technique was repeated for other high-stakes assessments in
elementary school and on multiple occasions, each application
revealing whether assessor variance was relatively trivial or
substantial.
This article reports on the application of the Waterman et al.
(2012) technique to WISC–IV assessments by regional school
psychologists over a period of years. The sample comprises child
examinees who were actually undergoing assessment for high-
stakes special education classification and related clinical pur-
poses. Whereas the study was designed to investigate the presence
and extent of assessor bias variance, it was not designed to pin-
point the exact causes of that bias. Rather, multilevel procedures
are used to narrow the scope of probable primary causes and
ancillary empirical analyses, and interpretations are used to shed
light on the most likely sources of WISC–IV score bias.
Method
Participants
Two large southwestern public school districts were recruited for
this study by university research personnel, as regulated by Internal
Review Board (IRB) and respective school district confidentiality and
procedural policies. School District 1 had an enrollment of 32,500
students and included 31 elementary, eight middle, and six high
schools. Ethnic composition for the 2009 –2010 academic year was
67.2% Caucasian, 23.8% Hispanic, 4.0% African American, 3.9%
Asian, and 1.1% Native American. District 2 served 26,000 students
in 2009 –2010, with 16 elementary schools, three kindergarten
through eighth-grade schools, six middle schools, five high schools,
and one alternative school. Caucasian students comprised 83.1% of
enrollments, Hispanic 10.5%, Asian 2.9%, African American 1.7%,
and other ethnic minorities 1.8%.
Eight trained school psychology doctoral students examined ap-
proximately 7,500 student special education files and retrieved perti-
nent information from all special education files spanning the years
2003–2010, during which psychologists had administered the WISC–
IV. Although some special education files contained multiple periodic
WISC–IV assessments, only those data pertaining to the first (or only)
WISC–IV assessment for a given child were applied for this study;
this was used as a measure to enhance comparability of assessment
conditions and to avert sources of within-child temporal variance.
Information was collected for a total of 2,783 children assessed for the
first time via WISC–IV, that information having been provided by
448 psychologists over the study years, with 2,044 assessments col-
lected through District 1 files and 739 District 2 files. The assessments
ranged from one to 86 per psychologist (M � 6.5, SD � 13.2).
Characteristics of the examining psychologists were not available
through school district files, nor was such information necessary for
the statistical separation of WISC–IV score variance attributable to
psychologists versus children.
Sample constituency for the 2,783 first-time assessments included
66.0% male children, 78.3% Caucasian, 13.0% Hispanic, 5.4% Afri-
can American, and 3.3% other less represented ethnic minorities.
Ages ranged from 6 to 16 years (M � 10.3 years, SD � 2.5), where
English was the home language for 95.0% of children (Spanish the
largest exception at 3.8%) and English was the primary language for
96.7% of children (Spanish the largest exception at 2.3%).
Whereas all children were undergoing special education assess-
ment for the first time using the WISC–IV, 15.7% of those children
had undergone prior psychological assessments not involving the
WISC–IV (periodic assessments were obligatory under state policy).
All assessments were deemed as high stakes, with a primary diagnosis
of learning disability rendered for 57.6% of children, emotional dis-
turbance for 11.6%, attention-deficit/hyperactivity disorder for 8.0%,
intellectual disability for 2.6%, 12.1% with other diagnoses, and 8.0%
receiving no diagnosis. Secondary diagnoses included 10.3% of chil-
dren with speech impairments and 3.7% with learning disabilities.
Instrumentation
The WISC–IV features 10 core and five supplemental subtests,
each with an age-blocked population mean of 10 and standard
deviation of 3. The core subtests are used to form four factor
indexes, where the Verbal Comprehension Index (VCI) is based on
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
208 MCDERMOTT, WATKINS, AND RHOAD
the Similarities, Vocabulary, and Comprehension subtests; the
Perceptual Reasoning Index is based on Block Design, Matrix
Reasoning, and Picture Concepts subtests; the Working Memory
Index (WMI) on the Digit Span and Letter–Number Sequencing
subtests; and the Processing Speed Index (PSI) on the Coding and
Symbol Search subtests. The FSIQ is also formed from the 10 core
subtests. The factor indexes and FSIQ each retain an age-blocked
population mean of 100 and standard deviation of 15. The supple-
mental subtests were not included in this study because their
infrequent application precluded requisite statistical power for
multilevel analyses.
Analyses
The eight school psychology doctoral students examined each
special education case file and collected WISC–IV scores, assess-
ment date, child demographics, consequent psychological diagno-
ses, and identity of the examining psychologist. Following IRB
and school district requirements, the identity of participating chil-
dren and psychologists was concealed before data were released to
the researchers. Because test protocols were not accessible, nor
had standardized observations of test sessions been conducted, it
was not possible to determine whether specific scoring errors were
present, nor to associate psychologists with specific error types.
Rather, test score variability was analyzed via multilevel linear
modeling as conducted through SAS PROC MIXED (SAS Insti-
tute, 2011).
As a preliminary step to identify the source(s) of appreciable
score nesting, a three-level unconditional one-way random effects
HLM model was tested for the FSIQ score and each respective
factor index and subtest score, where Level 1 modeled score
variance between children within psychologists, Level 2 modeled
score variance between psychologists within school districts, and
Level 3 modeled variance between school districts. This series of
analyses sought to determine whether sufficient score variation
existed between psychologists and whether this was related to
school district affiliation. A second series of multilevel models
examined the prospect that because all data had been filtered
through a process involving eight different doctoral students, per-
haps score variation was affected by the data collection mechanism
as distinguished from the psychologists who produced the data.
Here, an unconditional cross-classified model was constructed for
FSIQ and each factor index and subtest score, with score variance
dually nested within doctoral student data collectors and examin-
ing psychologists.
Setting aside alternative hypotheses regarding influence of data
collectors and school districts, each IQ measure was examined
through a two-level unconditional HLM model in which Level 1
represented variation between children within examining psychol-
ogists and Level 2 variation between psychologists. The intraclass
correlation was derived from the random coefficient for intercepts
associated with each model and thereafter converted to a percent-
age of score variation between psychologists and between children
within psychologists.
Because psychologists were not assigned randomly to assess
given children (assignment will normally vary as a function of
random events, but also as related to which psychologists may
more often be affiliated with certain child age cohorts, schools,
educational levels, etc.), it seemed reasonable to hypothesize that
such nonrandom assignment would potentially result in some
systematic characterization of those students assessed by given
psychologists. Thus, any systematic patterns of assignments by
child demographics could somehow homogenize IQ score varia-
tion within psychologists. To ameliorate this potential, each two-
level unconditional model was augmented by addition of covari-
ates including child age, sex, ethnicity (minority vs. Caucasian),
child primary language (English as a secondary language vs.
English as a primary language), and their interactions. The binary
covariates were transformed to reflect the percentage of children
manifesting a given demographic characteristic as associated with
each psychologist, and all the covariates were grand-mean recen-
tered to capture (and control) differences between psychologists
(Hofmann & Gavim, 1998). Covariates were added systematically
to the model for each IQ score so as to minimize Akaike’s
information criterion (AIC; as recommended by Burnham & An-
derson, 2004), and only statistically significant effects were per-
mitted to remain in final models (although nonsignificant main
effects were permitted to remain in the presence of their significant
interactions). Whereas final models were tested under restricted
maximum-likelihood estimation, and are so reported, the overall
statistical consequence of the covariate augmentation for each
model was tested through likelihood ratio deviance tests contrast-
ing each respective unconditional and final conditional model
under full maximum-likelihood estimation (per Littell, Milliken,
Stroup, Wolfinger, & Schabenberger, 2006). In essence, the con-
ditional models operated to correct estimates of between-
psychologists variance (obtained through the initial unconditional
models) for the prospect that some of that variance was influenced
by the nonrandom assignment of psychologists to children.
Results
A preliminary unconditional HLM model was applied for FSIQ
and each respective factor index and subtest score, where children
were nested within psychologists and psychologists within school
districts. The coefficient for random intercepts of children nested
within psychologists was statistically significant for almost all
models, but the coefficient for psychologists nested within districts
was nonsignificant for every model. Similarly, a preliminary mul-
tilevel model for each IQ score measured cross-classified children
nested within data collectors as well as psychologists. No model
produced a statistically significant effect for collectors, whereas
most models evinced a significant effect for psychologists. There-
fore, school district and data collection effects were deemed in-
consequential, and subsequent HLM models tested a random in-
tercept for nesting within psychologists only.
For each IQ score, two-level, unconditional and conditional
HLM models were constructed, initially testing the presence of
psychologist assessor variance and thereafter controlling for dif-
ferences in child age, sex, ethnicity, language status, and their
interactions. Table 1 reports the statistical significance of the
assessor variance effect for each IQ score and the estimated
percentage of variance associated exclusively with psychologists
versus children’s individual differences. The last column indicates
the statistical significance of the improvement of the conditional
model (controlling for child demographics) over the unconditional
model for each IQ measure. Where these values are nonsignificant,
understanding is enhanced by interpreting percentages associated
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
209WHOSE IQ IS IT?
with the unconditional model, and where values are significant,
interpretation is enhanced by percentages associated with the con-
ditional model. Following this logic, percentages preferred for
interpretation are boldfaced.
The conditional models (which control for child demographics)
make a difference for FSIQ, VCI (especially its Similarities sub-
test), WMI, and PSI (especially its Coding subtest) scores. This
suggests at least that the nonrandom assignment of school psy-
chologists to children may result in imbalanced distributions of
children by their age, sex, ethnicity, and language status. This in
itself is not problematic and likely reflects the realities of requisite
quasi-systematic case assignment within school districts. Thus,
psychologists will be assigned partly on the basis of their famil-
iarity with given schools, levels of expertise with age cohorts,
travel convenience, and school district administrative divisions—
all factors that would tend to militate demographic differences
across case loads. The conditional models accommodate for that
prospect. At the same time, it should be recognized that the control
mechanisms in the conditional models are also probably overly
conservative because they will inadvertently control for assessor
bias arising as a function of children’s demographic characteristics
(race, sex, etc.) unrelated to case assignment methods.
Considering the major focus of the study (identification of that
portion of IQ score variation that without mitigation has nothing to
do with children’s actual individual differences), the FSIQ and all
four factor index scores convey significant and nontrivial
(viz. �5%) assessor bias. More troubling, bias for FSIQ (12.5%)
and VCI (10.0%) is substantial (�10%). Within VCI, the Vocab-
ulary subtest (14.3% bias variance) and Comprehension subtest
(10.7% bias variance) are the primary culprits, each conveying
substantial bias. Further problematic, under PSI, the Symbol
Search subtest is laden with substantial bias variance (12.7%).
On the positive side, the Matrix Reasoning subtest involves no
statistically significant bias (2.8%). Additionally, the Coding sub-
test, although retaining a statistically significant amount of asses-
sor variance, essentially yields a trivial (�5%) amount of such
variance (4.4%). (Note that the �5% criterion for deeming hier-
archical cluster variance as practically inconsequential comports
with the convention recommended by Snijders & Baker, 1999, and
Waterman et al., 2012.)
Discussion
The degree of assessor bias variance conveyed by FSIQ and
VCI scores effectively vitiates the usefulness of those measures for
differential diagnosis and classification, particularly in the vicinity
of the critical cut points ordinarily applied for decision making.
That is, to the extent that decisions on mental deficiency and
intellectual giftedness will depend on discovery of FSIQs � 70
or � 130, respectively, or that ability-achievement discrepancies
(whether based on regression modeling or not) will depend on
accurate measurement of the FSIQ, those decisions cannot be
Table 1
Percentages of Score Variance Associated With Examiner Psychologists Versus Children’s Individual Differences on the Wechsler
Intelligence Scale for Children—Fourth Edition
IQ score N
Unconditional modelsa Conditional modelsb
Difference between
unconditional and
conditional models (p)c
% variance between
psychologists
% variance between
children
% variance between
psychologists
% variance between
children
Full Scale IQ 2,722 16.2��� 83.8 12.5��� 87.5 .0049
Verbal Comprehension Index 2,783 14.0��� 86.0 10.0��� 90.0 �.0001
Similarities 2,551 10.6��� 89.4 7.4��� 92.6 .0069
Vocabulary 2,538 14.3��� 85.7 10.4��� 89.6 ns
Comprehension 2,524 10.7��� 87.3 9.9��� 90.1 ns
Perceptual Reasoning Index 2,783 7.1�� 92.9 5.7�� 94.3 ns
Block Design 2,544 5.3�� 94.7 3.8� 96.2 ns
Matrix Reasoning 2,520 2.8 97.2 2.4 97.6 ns
Picture Concepts 2,540 5.4� 94.6 4.9� 95.1 ns
Working Memory Index 2,782 9.8��� 90.2 8.3��� 91.7 .002
Digit Span 2,548 7.8��� 92.2 7.5��� 92.5 ns
Letter–Number Sequencing 2,486 5.2� 94.8 4.2� 95.8 ns
Processing Speed Index 2,778 12.6��� 87.4 7.6��� 92.4 �.0001
Coding 2,528 9.2��� 90.8 4.4� 95.6 �.0001
Symbol Search 2,521 12.7��� 87.3 9.9��� 90.1 ns
a Entries for percentage of variance between psychologists equal ICC � 100 as derived in hierarchical linear modeling. Percentages of variance between
children equal (1 � ICC) � 100. Boldface entries are regarded optimal for interpretation purposes (in contrast to entries under the alternative conditional
model, which do not represent significant improvement). Model specification is Yij � �00 � �0j � rij, where i indexes children within psychologists and
j indexes psychologists. Significance tests indicate statistical significance of the random coefficient for psychologists, where p values � .01 are considered
nonsignificant. ICC � interclass correlation coefficient. b Entries for percentage of variance between psychologists equal residual ICC � 100 as derived
in hierarchical linear modeling, incorporating statistically significant fixed effects for child age, sex, ethnicity, language status, and their interactions.
Percentages of variance between children equal (1 �residual ICC) � 100. Boldface entries are regarded optimal for interpretation purposes (in contrast
to entries under the alternative unconditional model). Model specification is Yij � �00 � �01MeanAgej � �02MeanPercentMalej �
�03MeanPercentMinorityj � �04MeanPercentESLj � �05(MeanAgej)(MeanPercentMalej) � . . . � rij, where i indexes children within psychologists, j
indexes psychologists, and nonsignificant terms are dropped from models. Significance tests indicate statistical significance of the residualized random
coefficient for psychologists, where p values � .01 are considered nonsignificant. c Values are based on tests of the deviance between �2 log likelihood
estimates for respective unconditional and conditional models under full maximum-likelihood estimation. ps � .01 are considered nonsignificant (ns).
� p � .01. �� p � .001. ��� p � .0001.
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
210 MCDERMOTT, WATKINS, AND RHOAD
rendered with reasonable confidence because the IQ measures
reflect substantial proportions of score variation emblematic of
differences among examining psychologists rather than among
children. The folly of basing decisions in part or in whole on such
IQ measures is accentuated where the evidence (for intellectual
disability, etc.) is anything but incontrovertible because the FSIQ
score is markedly above or below the cut point or the ability-
achievement discrepancy is so immense as to leave virtually no
doubt that real and substantial disparity exists (see also Franklin et
al., 1982; Gresham, 2009; Lee, Reynolds, & Willson, 2003;
Mrazik et al., 2012; Reynolds & Milam, 2012, on the matter of
high-stakes decisions following IQ test administration and scoring
errors).
This study is limited by virtue of its dependence on a regional
rather than a more representative national sample. Indeed, future
research should explore the broader generalization of assessor bias
effects. From one perspective, it would seem ideal if psychologists
could be randomly assigned to children because that process would
equitably disperse the myriad elements of variance that can neither
be known nor controlled. From another perspective, random as-
signment is probably infeasible because, to the extent that partic-
ipant children and their families and schools are expecting psy-
chological services from those practitioners who have the best
relationships with given schools or school personnel or expertise
with certain levels of child development, the reactivity associated
with random assignment for high-stakes assessments could do
harm or be perceived as doing harm.
Unfortunately, test protocols were inaccessible, and there were
no standardized test session observations. Thus, it was not possible
to …
				    	
					CATEGORIES
        	Economics 
        	Nursing 
        	Applied Sciences 
        	Psychology 
        	Science 
        	Management 
        	Computer Science 
        	Human Resource Management 
        	Accounting 
        	Information Systems 
        	English 
        	Anatomy 
        	Operations Management 
        	Sociology 
        	Literature 
        	Education 
        	Business & Finance 
        	Marketing 
        	Engineering 
        	Statistics  
        	Biology 
        	Political Science 
        	Reading 
        	History 
        	Financial markets 
        	Philosophy 
        	Mathematics 
        	Law 
        	Criminal 
        	Architecture and Design 
        	Government 
        	Social Science 
        	World history 
        	Chemistry 
        	Humanities
        	Business Finance
        	Writing
        	Programming
        	Telecommunications Engineering 
        	Geography 
        	Physics 
        	Spanish 
        	ach
e. Embedded Entrepreneurship
f. Three Social Entrepreneurship Models
g. Social-Founder Identity
h. Micros-enterprise Development
Outcomes
Subset 2. Indigenous Entrepreneurship Approaches (Outside of Canada)
a. Indigenous Australian Entrepreneurs Exami
        	Calculus 
        	(people influence of 
others) processes that you perceived occurs in this specific Institution Select one of the forms of stratification highlighted (focus on inter the intersectionalities 
of these three) to reflect and analyze the potential ways these (
        	American history 
        	Pharmacology 
        	Ancient history 
        	. Also
        	Numerical analysis 
        	Environmental science 
        	Electrical Engineering 
        	Precalculus 
        	Physiology 
        	Civil Engineering 
        	Electronic Engineering 
        	ness Horizons
        	Algebra 
        	Geology 
        	Physical chemistry 
        	nt
When considering both O
        	lassrooms
        	Civil 
        	Probability 
        	ions
Identify a specific consumer product that you or your family have used for quite some time. This might be a branded smartphone (if you have used several versions over the years)
        	or the court to consider in its deliberations. Locard’s exchange principle argues that during the commission of a crime
        	Chemical Engineering 
        	Ecology 
        	aragraphs (meaning 25 sentences or more). Your assignment may be more than 5 paragraphs but not less.
INSTRUCTIONS: 
To access the FNU Online Library for journals and articles you can go the FNU library link here: 
https://www.fnu.edu/library/
In order to
        	n that draws upon the theoretical reading to explain and contextualize the design choices. Be sure to directly quote or paraphrase the reading
        	ce to the vaccine. Your campaign must educate and inform the audience on the benefits but also create for safe and open dialogue. A key metric of your campaign will be the direct increase in numbers. 
Key outcomes: The approach that you take must be clear
        	Mechanical Engineering 
        	Organic chemistry 
        	Geometry 
        	nment 
Topic 
You will need to pick one topic for your project (5 pts) 
Literature search 
You will need to perform a literature search for your topic
        	Geophysics 
        	you been involved with a company doing a redesign of business processes
        	Communication on Customer Relations. Discuss how two-way communication on social media channels impacts businesses both positively and negatively. Provide any personal examples from your experience
        	od pressure and hypertension via a community-wide intervention that targets the problem across the lifespan (i.e. includes all ages).
Develop a community-wide intervention to reduce elevated blood pressure and hypertension in the State of Alabama that in
        	in body of the report
Conclusions
References (8 References Minimum)
*** Words count = 2000 words.
*** In-Text Citations and References using Harvard style.
*** In Task section I’ve chose (Economic issues in overseas contracting)"
        	Electromagnetism 
        	w or quality improvement; it was just all part of good nursing care.  The goal for quality improvement is to monitor patient outcomes using statistics for comparison to standards of care for different diseases
        	e a 1 to 2 slide Microsoft PowerPoint presentation on the different models of case management.  Include speaker notes... .....Describe three different models of case management.
        	visual representations of information. They can include numbers
        	SSAY
        	ame workbook for all 3 milestones. You do not need to download a new copy for Milestones 2 or 3. When you submit Milestone 3
        	pages):
Provide a description of an existing intervention in Canada
        	making the appropriate buying decisions in an ethical and professional manner.
Topic: Purchasing and Technology
You read about blockchain ledger technology. Now do some additional research out on the Internet and share your URL with the rest of the class 
        	be aware of which features their competitors are opting to include so the product development teams can design similar or enhanced features to attract more of the market. The more unique
        	low (The Top Health Industry Trends to Watch in 2015) to assist you with this discussion. 
  
    https://youtu.be/fRym_jyuBc0
Next year the $2.8 trillion U.S. healthcare industry will   finally begin to look and feel more like the rest of the business wo
        	evidence-based primary care curriculum. Throughout your nurse practitioner program
        	Vignette
Understanding Gender Fluidity
Providing Inclusive Quality Care
Affirming Clinical Encounters
Conclusion
References
Nurse Practitioner Knowledge
        	Mechanics 
        	and word limit is unit as a guide only.
The assessment may be re-attempted on two further occasions (maximum three attempts in total). All assessments must be resubmitted 3 days within receiving your unsatisfactory grade. You must clearly indicate “Re-su
        	Trigonometry 
        	Article writing
        	Other
        	5. June 29
        	After the components sending to the manufacturing house
        	1. In 1972 the Furman v. Georgia case resulted in a decision that would put action into motion. Furman was originally sentenced to death because of a murder he committed in Georgia but the court debated whether or not this was a violation of his 8th amend
        	One of the first conflicts that would need to be investigated would be whether the human service professional followed the responsibility to client ethical standard.  While developing a relationship with client it is important to clarify that if danger or
        	Ethical behavior is a critical topic in the workplace because the impact of it can make or break a business
        	No matter which type of health care organization
        	With a direct sale
        	During the pandemic
        	Computers are being used to monitor the spread of outbreaks in different areas of the world and with this record
        	3. Furman v. Georgia is a U.S Supreme Court case that resolves around the Eighth Amendments ban on cruel and unsual punishment in death penalty cases. The Furman v. Georgia case was based on Furman being convicted of murder in Georgia. Furman was caught i
        	One major ethical conflict that may arise in my investigation is the Responsibility to Client in both Standard 3 and Standard 4 of the Ethical Standards for Human Service Professionals (2015).  Making sure we do not disclose information without consent ev
        	4. Identify two examples of real world problems that you have observed in your personal
        	Summary & Evaluation: Reference & 188. Academic Search Ultimate
        	Ethics
        	We can mention at least one example of how the violation of ethical standards can be prevented. Many organizations promote ethical self-regulation by creating moral codes to help direct their business activities
        	*DDB is used for the first three years
        	For example
        	The inbound logistics for William Instrument refer to purchase components from various electronic firms. During the purchase process William need to consider the quality and price of the components. In this case
        	4. A U.S. Supreme Court case known as Furman v. Georgia (1972) is a landmark case that involved Eighth Amendment’s ban of unusual and cruel punishment in death penalty cases (Furman v. Georgia (1972)
        	With covid coming into place
        	In my opinion
        	with
        	Not necessarily all home buyers are the same! When you choose to work with we buy ugly houses Baltimore & nationwide USA
        	The ability to view ourselves from an unbiased perspective allows us to critically assess our personal strengths and weaknesses. This is an important step in the process of finding the right resources for our personal learning style. Ego and pride can be 
        	· By Day 1 of this week
        	While you must form your answers to the questions below from our assigned reading material
        	CliftonLarsonAllen LLP (2013)
        	5 The family dynamic is awkward at first since the most outgoing and straight forward person in the family in Linda
        	Urien
        	The most important benefit of my statistical analysis would be the accuracy with which I interpret the data. The greatest obstacle
        	From a similar but larger point of view
        	4 In order to get the entire family to come back for another session I would suggest coming in on a day the restaurant is not open
        	When seeking to identify a patient’s health condition
        	After viewing the you tube videos on prayer
        	Your paper must be at least two pages in length (not counting the title and reference pages)
        	The word assimilate is negative to me. I believe everyone should learn about a country that they are going to live in. It doesnt mean that they have to believe that everything in America is better than where they came from. It means that they care enough 
        	Data collection
        	Single Subject Chris is a social worker in a geriatric case management program located in a midsize Northeastern town. She has an MSW and is part of a team of case managers that likes to continuously improve on its practice. The team is currently using an
        	I would start off with Linda on repeating her options for the child and going over what she is feeling with each option.  I would want to find out what she is afraid of.  I would avoid asking her any “why” questions because I want her to be in the here an
        	Summarize the advantages and disadvantages of using an Internet site as means of collecting data for psychological research (Comp 2.1) 25.0\% Summarization of the advantages and disadvantages of using an Internet site as means of collecting data for psych
        	Identify the type of research used in a chosen study
        	Compose a 1
        	Optics
        	effect relationship becomes more difficult—as the researcher cannot enact total control of another person even in an experimental environment. Social workers serve clients in highly complex real-world environments. Clients often implement recommended inte
        	I think knowing more about you will allow you to be able to choose the right resources
        	Be 4 pages in length
        	soft MB-920 dumps review and documentation and high-quality listing pdf MB-920 braindumps also recommended and approved by Microsoft experts. The practical test
        	g
        	One thing you will need to do in college is learn how to find and use references. References support your ideas. College-level work must be supported by research. You are expected to do that for this paper. You will research
        	Elaborate on any potential confounds or ethical concerns while participating in the psychological study 20.0\% Elaboration on any potential confounds or ethical concerns while participating in the psychological study is missing. Elaboration on any potenti
        	3 The first thing I would do in the family’s first session is develop a genogram of the family to get an idea of all the individuals who play a major role in Linda’s life. After establishing where each member is in relation to the family
        	A Health in All Policies approach
        	Note: The requirements outlined below correspond to the grading criteria in the scoring guide. At a minimum
        	Chen
        	Read Connecting Communities and Complexity: A Case Study in Creating the Conditions for Transformational Change
        	Read Reflections on Cultural Humility
        	Read A Basic Guide to ABCD Community Organizing
        	Use the bolded black section and sub-section titles below to organize your paper.  For each section
        	Losinski forwarded the article on a priority basis to Mary Scott
        	Losinksi wanted details on use of the ED at CGH. He asked the administrative resident