Assessment Guide - Psychology
PLEASE READ THE INSTRUCTIONS ATTACHED FOR THE GUIDE.Begin the guide with a general overview of assessment, reasons for assessment referrals, and the importance of the role of each individual in the process. Within each of the remaining sections, describe the types of assessments that their readers may encounter, the purposes of each type of assessment, the different skills and abilities the instruments measure, the most valid and reliable uses of the measures, and limitations of the measures. A brief section will be included to describe the assessment process, the types of professionals who conduct the assessments, and what to expect during the assessment meetings?
Psychological assessment guides are created by psychology professionals to provide
the public with accurate and authoritative information appropriate for their current needs.
Information available to the public about psychological testing and assessment varies
widely depending on the professional creating it, the purpose of the assessment, and the
intended audience. When professionals effectively educate the public on the how, what,
and why behind assessments and the strengths and limitations of commonly used
instruments, potential clients are in a better position to be informed users of assessment
products and services. The Assessment Guides developed in this course will be
designed to provide the lay public with accurate and culturally relevant information to aid
them in making informed decisions about psychological testing. Students will develop
their Guides with the goal of educating readers to be informed participants in the
assessment process.
There is no required template for the development of the Assessment Guide. It is
encouraged to be creative while maintaining the professional appearance of the work.
While based on scholarly information, the Guide should not read like a research paper. It
is to be written like a brochure a professional might give a patient or client who is being
referred for testing. The Guide must be reader-friendly (sixth- to ninth-grade reading
level) and easy to navigate, and it must include a combination of text, images, and
graphics to engage readers in the information provided. Throughout the Guide, provide
useful examples and definitions as well as questions readers should ask their
practitioners. To ensure accuracy, use only scholarly and peer-reviewed sources for the
information in the development of the Guides.
Begin the guide with a general overview of assessment, reasons for assessment
referrals, and the importance of the role of each individual in the process. Within each of
the remaining sections, describe the types of assessments that their readers may
encounter, the purposes of each type of assessment, the different skills and abilities the
instruments measure, the most valid and reliable uses of the measures, and limitations
of the measures. A brief section will be included to describe the assessment process, the
types of professionals who conduct the assessments, and what to expect during the
assessment meetings
The Assessment Guide must include the following sections:
Table of Contents (Portrait orientation must be used for the page layout of this section.)
In this one-page section, list the following subsections and categories of assessments?
• Introduction and Overview
• Tests of Intelligence
• Tests of Achievement
• Tests of Ability
• Neuropsychological Testing
• Personality Testing
• Industrial, Occupational, and Career Assessment
• Forensic Assessment
• Special Topics (specify the students choice from the Special Topics list)
• References
Section 1: Introduction and Overview (Portrait or landscape orientation may be used for
the page layout of this section.)
Begin the Guide with a general overview of assessment. In this two-page section,
students will briefly address the major aspects of the assessment process. Students are
encouraged to develop creative titles for these topics that effectively communicate the
meanings to the intended audience.
• Definition of a Test (e.g., What is a Test?)
• Briefly define psychological assessment.
• Types of Tests
• Identify the major categories of psychological assessment.
• Reliability and Validity
• Briefly define the concepts of reliability and validity as they apply to psychological
assessment.
• Role of testing and assessment in the diagnostic process
• Briefly explain role of assessment in diagnosis.
• Professionals Who Administer Tests
• Briefly describe the types of professionals involved in various assessment
processes.
• Culture and Testing
• Briefly describe issues of cultural diversity as it applies to psychological
assessment.
Categories of Assessment (Portrait or landscape orientation may be used for the page
layout of this section.)
For each of the following, students will create a two-page information sheet or pamphlet
to be included in the Assessment Guide. For each category of assessment, students will
include the required content listed in the PSY640 Content for Testing Pamphlets and
Information Sheets. Be sure to reference the content requirements prior to completing
each of the information sheets on the following categories of assessment.
• Tests of Intelligence
• Tests of Achievement
• Tests of Ability
• Neuropsychological Testing
• Personality Testing
• Industrial, Occupational, and Career Assessment
• Forensic Assessment
• Special Topics (Students will specify which topic they selected for this pamphlet or
information sheet. Additional instructions are noted below.)
Special Topics (Student’s Choice)
In addition to the required seven categories of assessment listed above, students will
develop an eighth information sheet or pamphlet that includes information targeted
either at a specific population or about a specific issue related to psychological
assessment not covered in one of the previous sections. Students may choose from one
of the following categories:
• Testing Adolescents
• Testing Geriatric Patients
• Testing First Generation Immigrants
• Testing in Rural Communities
• Testing English Language Learners
• Testing Individuals Who Are (Select one: Deaf, Blind, Quadriplegic)
• Testing Individuals Who Are Incarcerated
• Testing for Competency to Stand Trial
References (Portrait orientation must be used for the page layout of this section.)
https://content.bridgepointeducation.com/curriculum/file/7aa7d868-2fb0-4b7f-a892-e61ee1bbb9cd/1/PSY640\%20Content\%20for\%20Testing\%20Pamphlets\%20and\%20Information\%20Sheets.pdf
https://content.bridgepointeducation.com/curriculum/file/7aa7d868-2fb0-4b7f-a892-e61ee1bbb9cd/1/PSY640\%20Content\%20for\%20Testing\%20Pamphlets\%20and\%20Information\%20Sheets.pdf
https://content.bridgepointeducation.com/curriculum/file/7aa7d868-2fb0-4b7f-a892-e61ee1bbb9cd/1/PSY640\%20Content\%20for\%20Testing\%20Pamphlets\%20and\%20Information\%20Sheets.pdf
Include a separate reference section that is formatted according to APA style 7. The
reference list must consist entirely of scholarly sources. For the purposes of this
assignment, assessment manuals, chapters from graduate-level textbooks, chapters
from professional books, and peer-reviewed journal articles may be used as resource
material. A minimum of 16 unique scholarly sources published within the last 10 years
must be used within the Assessment Guide. The bulleted list of credible professional
and/or educational online resources required for each assessment area will not count
toward these totals.
The Assessment Guide
• Must be 18 pages in length (not including title and reference pages) and formatted
according to APA style 7.
• Must include a separate title page with the following:
◦ Title of guide
◦ Student’s name
◦ Course name and number
◦ Instructor’s name
◦ Date submitted
• Must use at least 16 scholarly sources.Must document all sources in APA style7.
• Must include a separate reference page that is formatted according to APA style 7
Must incorporate at least three different methods of presenting information (e.g.,
text, graphics, images, original cartoons).
Kumar, S., Kartikey, D., & Singh, T. (2021). Intelligence Tests for Different Age
Groups and Intellectual Disability: A Brief Overview. Journal of Psychosocial
Research, 16(1), 199–209. https://doi-org.proxy-library.ashford.edu/10.32381/
JPR.2021.16.01.18. (For the Tests of Intelligence attachment).
Zucchella, C., Federico, A., Martini, A., Tinazzi, M., Bartolo, M., & Tamburin, S.
(2018). Neuropsychological testing. Practical Neurology (BMJ Publishing
Group), 18(3), 227–237. (For the Neuropsychological testing attachment)
Archer, R. P., Wheeler, E. M. A., & Vauter, R. A. (2016). Empirically supported
forensic assessment. Clinical Psychology: Science and Practice, 23(4), 348–
364. https://doi-org.proxy-library.ashford.edu/10.1111/cpsp.12171 (For the
Forensic assessment attachment)
Fisher, D. M., Milane, C. R., Sullivan, S., & Tett, R. P. (2021). A Critical
Examination of Content Validity Evidence and Personality Testing for
Employee Selection. Public Personnel Management, 50(2), 232–257. https://
doi-org.proxy-library.ashford.edu/10.1177/0091026020935582 (Personality
Testing)
https://doi-org.proxy-library.ashford.edu/10.32381/JPR.2021.16.01.18
https://doi-org.proxy-library.ashford.edu/10.32381/JPR.2021.16.01.18
https://doi-org.proxy-library.ashford.edu/10.1111/cpsp.12171
https://doi-org.proxy-library.ashford.edu/10.1177/0091026020935582
https://doi-org.proxy-library.ashford.edu/10.1177/0091026020935582
Öz, H., & Özturan, T. (2018). Computer-Based and Paper-Based Testing:
Does the Test Administration Mode Influence the Reliability and Validity of
Achievement Tests? Journal of Language and Linguistic Studies, 14(1), 67–
85. (For the tests of Achievement)
Han, K., Colarelli, S. M., & Weed, N. C. (2019). Methodological and statistical
advances in the consideration of cultural diversity in assessment: A critical
review of group classification and measurement invariance testing.
Psychological Assessment, 31(12), 1481–1496. https://doi-org.proxy-
library.ashford.edu/10.1037/pas0000731 (issues of cultural diversity as it
applies to psychological assessment)
https://doi-org.proxy-library.ashford.edu/10.1037/pas0000731
https://doi-org.proxy-library.ashford.edu/10.1037/pas0000731
PSY640 Content for Testing Pamphlets and Information Sheets
For each category of assessment listed in the assignment, students will create two pages of information.
The intent for the layout is that it be consistent with either a two-page information sheet (front and back),
or a two-sided tri-fold pamphlet that might be found in the office of a mental health professional. The
presentation of the information within each pamphlet or brochure must incorporate at least three different
visual representations of the information (e.g., text, graphics, images, original cartoons).
For each pamphlet or information sheet a minimum of three scholarly sources must be used, at least two
of which must be from peer-reviewed journal articles published within the last 10 years and obtained from
the Ashford University Library. Some sources may be relevant for more than one category of assessment;
therefore, it is acceptable to use relevant sources in more than one category. Remember that the language
for each information sheet should be at the sixth- to ninth-grade reading level to allow a broad audience at
various ages and levels of education to better understand each category of assessment.
For each category of assessment.
• Introduce and offer a brief, easy-to-understand definition for the broad assessment category being
measured. (e.g., What is intelligence?, What is achievement?, What is personality?, What does
“neuropsychological” mean? What does “forensic” mean?)
• Provide a brief overview of the types of tests commonly used within the category of assessment
explain what they measure. Compare the commonly used assessment instruments within the
category.
• Describe appropriate and inappropriate uses of tests within the category of assessment. Explain
why some tests are more appropriate for specific populations and purposes and which tests may
be inappropriate. Analyze and describe the challenges related to assessing individuals from
diverse social and cultural backgrounds. Evaluate the ethical interpretation of testing and
assessment data as it relates to the test types within the category. Describe major debates in the
field regarding different assessment approaches within the category. (e.g., Intellectual disabilities,
formerly known as “mental retardation,” cannot be determined by a single test. Thus, an
inappropriate use of an intelligence test would be to use such a test as the sole instrument to
diagnose an intellectual ability.)
• Describe the format in which assessment results may be expected. Evaluate and explain the
professional interpretation of testing and assessment data. Analyze the psychometric
methodologies typically employed in the validation of types of psychological testing within the
category. Include information about the types of scores used to communicate assessment results
consistent with the tests being discussed (e.g., scaled scores, percentile rank, grade equivalent,
age equivalent, standard age score, confidence interval).
• Explain the common terminology used in assessment in a manner that demystifies the
professional jargon (e.g., In the course of discussing intelligence testing, students would define
concepts such as I.Q., categories of intelligence, and the classification labels used to describe
persons with intellectual disabilities.)
• Include a bulleted list of at least three credible professional and/or educational online resources
where the reader can obtain more information about the various types of testing in order to aid
him or her in the evaluation and interpretation of testing and assessment data. No commercial
websites may be used. Include the name of the organization that authored the web page, the title
of the web page and/or document, and the URL. (These websites will not count toward the 12
scholarly resources required for the assignment.)
Intelligence Tests for Different Age Groups and Intellectual
Disability: A Brief Overview
Subodh Kumar, Divye Kartikey and Tara Singh
ABSTRACT
From an evolutionary point of view the one factor that helped humanity thrive and
survive against all odds was the human’s ability to use their intelligence. Intelligence
is what makes us unique among all the species in the world. The aim of this review
paper was to discuss the role of intelligence tests in measuring intelligence of
different age groups and diagnosing intellectual disability. The reviewed papers
have revealed that measuring intelligence is not a construct that can only be measured
for grown ups but it can also be measured for newborns. Although IQ tests are used
prominently in judging school performance, job performance, intellectual disability
and overall well-being, its measurement gets affected by emotions, genetics, cultural
background and environmental factors. To improve the validity or accuracy of
intelligence tests it is important to incorporate these factors.
Keywords: Intelligence, IQ, IQ tests, Intellectual Disability.
INTRODUCTION
Intelligence is the general cognitive ability to use attention and memory for learning
and developing ideas to solve problems. Modern science defines intelligence as the
ability to do abstract critical thinking, planning the strategy and solving the problem.
However, intelligence is a wide concept and also includes the ability to comprehend
complex ideas, Integrate the information, adapting oneself to situations, choosing
appropriate response to stimuli, learning from experience, changing the environment
and one’s own behaviour, overcome obstacles in life and many more. People differ in
intelligence from one another and also in their ability to conduct cognitive tasks, which
can be due to various reasons like genetic, personality or complexity of the task (Colom
et al., 2010).
Journal of Psychosocial Research
Vol. 16, No. 1, 2021, 199-209
DOI No. : https://doi.org/10.32381/JPR.2021.16.01.18
Corresponding author. Email : [email protected]
ISSN 0973-5410 print/ISSN 0976-3937 online
©2021 Dr. H. L. Kaila
http//www.printspublications.com
Subodh Kumar, Divye Kartikey and Tara Singh
J. Psychosoc. Res.
200
Studying intelligence is important because it helps us understand the strengths,
weaknesses, and unique abilities of an individual. Since intelligence is treated as a
measurable quantity, currently there are many standardized tests in use which can
measure intelligence with considerable accuracy, consistency and also predict future
performance of individuals of all ages. The measured quantity of intelligence is called
IQ or intelligence quotient and tests which measure IQ are called IQ tests. IQ can be
calculated with the formula i.e. IQ = (intelligence age/actual age)x100.
Intelligence tests can be age scale or point scale. Age scale intelligence tests are
based on the concept of calculating mental age for diagnosing intellectual disability.
The Seguin Form Board test is an example of an age scale intelligence test. Point scale
intelligence tests are based on the concept of calculating total points scored in the test
to calculate intelligence. Most intelligence tests have both verbal and non-verbal
questions. Separate scores are generated for both verbal and non-verbal, and combined
score is the score after taking into account both the scores (APA Dictionary of
Psychology, n.d.).
Classification of intelligence is expanding and new ways of analyzing it are coming
forward. According to the traditional “investment” theory, intelligence can be classified
into two main categories i.e. fluid and crystallized. Fluid intelligence is the ability to
use reasoning to solve novel problems, in the absence of any prior specific knowledge.
As we grow older fluid intelligence tends to decrease, especially in late twenties. It is
also influenced by genetics. Crystallized intelligence is the ability to use previously
learned knowledge to solve problems. As we grow older crystallized intelligence tends
to increase. (Kaufman, 2013).
Intellectual disability: IQ tests are used not only to measure intelligence but also
to diagnose intellectual disability. Intellectual disability is characterized by significant
problems in cognitive functioning and social skills of an individual. In terms of IQ
score, generally the score less than 70 and inability to carry out age specific day to day
tasks amounts to intellectual disability. (American Psychiatric Association, n.d.; Schalock
et al., 2010).
RESEARCH OBJECTIVES
The aim of this study is to discuss the role of intelligence tests in measuring intelligence
of different age groups and diagnosing intellectual disability.
METHODOLOGY
Online databases (NCBI, PUBMED, PSYCINFO, PsycNET, Frontiers in Psychology,
Google Scholar, Research gate) and websites were searched for papers published in
Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview 201
J. Psychosoc. Res.
English related to intelligence, intelligence tests and intellectual disability. Twenty
seven articles (1994 - 2020) were identified and referred.
INTELLIGENCE TESTS
APGAR (Appearance, Pulse, Grimace, Activity, and Respiration)
APGAR test is a rapid test to evaluate the physiological condition of neonates at the
time of birth to know the level of medical attention needed in case of resuscitation.
The test evaluates a newborn on five parameters i.e. appearance, pulse, grimace, activity
and respiration.
Low scores of 1 or 0 on each category are given if the newborn has pale or blue
appearance, heart rate less than 100 beats per minute or no heart beats, no grimace,
cough and crying on stimulation, less muscle activity or no muscle activity and Slow,
irregular respiration or no respiration.
High scores of 2 on each category are given if newborn have, mostly pink
appearance, heart rate more than 100 beats per minute, grimace, crying and coughing
on stimulation, active muscle activity, healthy and normal breathing
The final score out of 10 is recorded at 1 minute and 5 minute after the birth.
Scores less than 6 are considered as low whereas scores like 7 or more than 7 are
considered as good scores (Simon, 2021).
APGAR is not only the indication of health of newborns during birth but also can
predict the IQ in the later stages of life. A study in the UK found that infants having
low APGAR scores during their birth seems to have a higher risk of having low IQ by
the age of 18 (Odd, 2008).
Stanford Binet Intelligence test (SB)
The Stanford Binet intelligence scale is used to measure both the intelligence and
intellectual disability. Currently its 5th edition is under use which can be administered
to people in the age range of 2 to 89. cognitive abilities like Fluid reasoning, Knowledge,
Quantitative reasoning, visual/spatial reasoning and working memory are measured
in both verbal and non verbal format. In total there are 10 tests, for every cognitive
ability there is a verbal and a nonverbal test. (Marom, 2018).
Wechsler Test for Preschool and Primary scale of Intelligence (WPPSI)
Currently the fourth edition of the test is under use i.e. WPPSI-IV. There are 14 subtests
in it which are administered on children. For children who are in the age range of 2
years 6 months to 3 years 11 months, the subtests which are given include: block design
Subodh Kumar, Divye Kartikey and Tara Singh
J. Psychosoc. Res.
202
(constructing model with coloured blocks), information (to answer general questions),
object assembly (to fix pieces of puzzle into a standard arrangement), picture naming
(to name a picture shown), receptive vocabulary (child points out the correct picture
based on the vocabulary spoken aloud).
For children who are in the age range of 4 years to 7 years 7 months are
administered the 5 above subtests along with subtests like animal coding, comprehension,
matrix reasoning, Picture completion, picture concepts, word reasoning, vocabulary,
symbol search and similarities (Slattery, 2015).
Wechsler Intelligence Scale for Children (WISC)
Currently the 5th edition of WISC is under use i.e. WISC-V. It measures 5 things that
are as follows: visual and spatial index measures the ability of a child to process visual
and spatial information like geometrical figures, fluid reasoning index, working memory
index, processing speed index and verbal comprehension index. The individual scores
from all the above indexes are combined to form a full scale intelligence quotient
(Flanagan et al., 2010).
Kaufman Assessment Battery for children (K-ABC)
The Kaufman intelligence test was developed in 1983, by Alan S. Kaufman and Nadeen
L. Kaufman. The test was based on Laria’s neuropsychological theory of sequential
and simultaneous cognitive processing and contains four scales- (1) Sequential processing
scale, (2) Simultaneous processing scale, (3) Achievement scale and (4) mental processing
scale. Sequential processing scale measures short term memory through problem solving
related to sequential or order placement. Simultaneous processing scale measures ability
to solve problems by processing several information simultaneously. Achievement scale
measures application of learned skills on practical problems, and mental processing
scale measures the ability to solve problems by using both sequential and simultaneous
processing (Marom, 2021).
Kaufman Adolescent and Adult intelligence test (KAIT)
This test measures fluid intelligence and crystallized intelligence. It can be administered
on people from age 11 to 85. KAIT has a core battery and an extended battery. The
core battery has six subtests, total time is 65 minutes, which measure parameters like
crystallized intelligence(Gc), fluid intelligence(Gf) and composite intelligence. The
extended battery has four additional subtests in addition to the subtests of the core
battery and takes 90 minutes to complete (Fahmy, 2021).
Wechsler Adult Intelligence Scale (WAIS)
Currently the 4th edition of WAIS is under use i.e. WAIS-IV. In this test 4 major
components of intelligence are measured.
Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview 203
J. Psychosoc. Res.
1) Verbal Comprehension Index- subtests include similarities, Vocabulary,
Information, Comprehension.
2) Perceptual Reasoning Index- subtests include Block design, Matrix, Reasoning,
Visual puzzles, Pictures completion, Figure weights.
3) Working Memory Index- subtests include Digit span, Arithmetic, Letter-numbering
sequencing.
4) Processing Speed Index- subtests include symbol search, coding, cancellation.
The scores from all the subtests of all the indexes gives the full scale IQ score.
Whereas the scores of similarities, vocabulary and information subtests from verbal
comprehension index, and scores of block design, matrix reasoning and visual puzzles
subtests from perceptual reasoning index gives the score on general ability index
(Cherry, 2020).
Woodcock-Johnson Test of Cognitive abilities
The Woodcock-Johnson test of cognitive abilities was developed in 1977 by Richard
Woodcock and Mary E. Bonner Johnson. Currently it’s 4th version is under use. It can
be administered from a child of age two to an adult of age 90. The test is based on
Cattell-Horn-Caroll’s theory of intelligence which focuses on nine main cognitive abilities
like comprehensive knowledge, fluid reasoning, quantitative knowledge, reading and
writing ability, short term memory, long term storage and retrieval, visual processing,
auditory processing and processing speed (Hamour et al., 2012).
Raven’s Progressive matrices test
Raven’s progressive matrices test was developed by John C Raven in 1939. It is designed
to measure the reasoning ability of individuals who are in the age range of 5.5 years to
11.5 years, adults and senior citizens. The questions are in the form of matrices in
which pattern has to be figured to answer about the missing element in a matrix. The
difficulty level increases as the test progresses. There are three types of Raven
progressive matrices test: Standard Raven’s progressive test, Coloured Raven’s
progressive test and Advanced Raven’s progressive test. Questions are presented in
black and white pattern. It is suitable for adults who have high intelligence (Thiel,
2020).
Bhatia Battery of Performance test of intelligence
This test was developed by C.M. Bhatia in 1955. The battery is applicable to the illiterate
as well as the literate groups with separate norms provided for each group. While the
Subodh Kumar, Divye Kartikey and Tara Singh
J. Psychosoc. Res.
204
test was originally developed for the 11-16 years age group, the use of this test on
adults beyond 16 years of age is based on the assumption that intelligence does not
increase beyond 16 years of age which was set as the upper limit of the test (Barnabas
& Rao, 1994). The test contains five subtests: Koh’s block design test, Alexander pass
along test, pattern drawing test, immediate memory test and picture construction test
(Roopesh, 2020).
Binet Kamat test of intelligence
This test is the indian adaptation of the Stanford binet intelligence test developed by
Kamat for the age group of 3 years to 22 years. It is used widely in diagnosing intellectual
disability. This test is administered by increasing the difficulty level of questions until
a child fails to solve the questions. Questions below the child’s chronological age are
also asked until a child solves all of them successfully. The test contains questions
related to vocabulary, language, analogies, social reasoning, similarities and differences,
and visual motor ability (Roopesh, 2020).
Seguin form board test
The Seguin form board test was developed by Seguin in 1856 for the age group 3 years
to 15 years. The test measures the parameters like visual ability, eye and hand
coordination, non-verbal ability, psychomotor ability and spatial ability. There is a
form board which has 10 slots of different shapes and a participant has to place every
block in the right slot within the best possible time. The time needed for the
administration of the test is usually 10 minutes. A participant gets three trials and the
best time among them is noted to calculate the mental age (Koshy, 2017).
DISCUSSION
IQ tests and overall well being: Studies have pointed that there are gaps in
interpretation of IQ results. IQ tests focus only on selected mental abilities which are
useful only for school admission, college admission and job roles. Research has found
a weak correlation of IQ tests with overall individual well being. The solution is not to
limit the importance of IQ tests but rather define intelligence in an inclusive way and
also not to inflate the interpretation of IQ tests (Ganuthula, 2019).
IQ tests and emotions: Many studies have shown that motivation and emotions
decides the performance of an individual in an IQ test. Motivation of a test taker gets
affected by positive and negative emotions. Positive emotions like curiosity, resilience
and courage boosts the performance of the test taker. Whereas negative emotions like
fear and test anxiety negatively affect the performance in the IQ test. (Ganuthula,
2019).
Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview 205
J. Psychosoc. Res.
Intellectual disability and IQ tests: A study on the cognitive decline among
children suffering from medulloblastoma, a nervous system tumour, was done. The
disease is known to cause decline in information processing speed and visual motor
tasks. Many patients were logical right in the tests but because they were slow in
processing information, their IQ score on the wechsler scale was low which takes into
account such factors for generating IQ score. Hence it is important that a proper
neuropsychological assessment is done rather than just relying on a single score through
IQ score (Wegenschimmel, 2017).
Colmar et al., (2006) have found that there are psychometric issues in IQ tests and
that IQ tests have limitations in diagnosing intellectual disability. Most IQ tests are
designed for normal people and tasks in these tests may not be age appropriate for
intellectually disabeled.
IQ tests and School performance: IQ tests are used in schools for intellectual
assessment of students. However, IQ scores do not explain the reason behind a student
lagging in a specific academic task. For that, specific cognitive abilities need to be
judged along with IQ scores to come to a right conclusion.(Ortiz & Lella, 2010).
IQ tests and Job performance: The use of IQ tests in predicting job performance is
often cited as the usefulness of IQ tests. However, the problem with IQ tests predicting
job performance is that appraisals in jobs are often subjected to bias like halo bias and
may not correlate with IQ score. (Richardson, 2015).
IQ tests and Genetics: Studies on twins have found that IQ is heritable and that
genes influence intelligence. Like all other traits intelligence is also inherited, however
it is never 100\%. To what extent intelligence gets inherited depends on the type of
environment given. Unfavourable environment will never let genes fully express the
qualities related to intelligence. Various cross-sectional and longitudinal studies have
shown that the influence of genes on intelligence increases gradually from infancy to
adulthood. (Plomin, 2015) .
CONCLUSION
As the world is becoming more knowledge oriented and skill driven, the role of
intelligence and its measurement is becoming an important field of study. IQ tests are
the most common method of intellectual assessment for all ages. For diagnosing any
intellectual disability, they are the most objective tool available.Today, intelligence
tests play a vital role in determining the future of an individual. Their role in diagnosing
intellectual disability is also important for taking the right therapeutic approaches.
Since its first use, IQ tests have through many modifications, updations and adaptations.
In future also many new things will be added in IQ tests as the knowledge of intelligence
Subodh Kumar, Divye Kartikey and Tara Singh
J. Psychosoc. Res.
206
and intellectual disability increases. The papers reviewed by us have shown that
intelligence can be measured for all ages, from childhood to elderly. Intelligence tests
like APGAR can measure intelligence of a newborn and also predict the future course
of intellectual development in later life. Measuring intelligence and intellectual disability
is a complicated task and although the IQ tests are comprehensive, cannot throw light
on every aspect of one’s intelligence. The role of genetics, environment, culture and
emotions on intelligence may not get reflected on IQ tests. It is important that the
validity of IQ tests be improved so that interpretation and prediction of performance
can be made more accurate. More knowledge we have about IQ tests and its limitations,
the more they can be made better.
Implications
On the basis of our study we find that focus should be more on filling the gaps in the
interpretation of IQ scores in such a way that more realistic predictions of future
performance can be done. These gaps in interpretation of IQ score can result in error in
judgement. For example a high score in IQ tests may not necessarily translate into
good performance in the areas of life which demands high emotional regulation for
achieving desirable results. Since IQ scores depend on many factors that can affect the
cognition during the test like motivation, emotional state and many environmental
factors like family stress, it is important that these factors should be incorporated
while interpreting the performance on IQ tests. Our study has also found that for
diagnosing intellectual disability, it is important to focus on adaptability skills rather
than just IQ scores.
Contribution to the Existing Literature
Our review article comprehensively covers important intelligence tests for all age groups,
which we did not find in any other paper. Our paper will be a ready reference for
counsellors, teachers and professionals who want to have a comprehensive on
intelligence tests for different age groups starting from the birth of a child.
REFERENCES
American Psychiatric Association (n.d.). What is Intellectual Disability? https://www.psychiatry.org/
patients-families/intellectual-disability/what-is-intellectual-disability
APA Dictionary of Psychology (n.d.). Point scale. https://dictionary.apa.org/point-scale
Barnabas, I. P., and Rao, S. (1994). Comparison of two short forms of the Bhatia’s Test of Intelligence.
NIMHANS Journal, 12(1), 75–77.
Boat, T.F., Wu, J.T. (2015). Mental Disorders and Disabilities Among Low-Income Children: Clinical
Characteristics of Intellectual Disabilities. Washington (DC): National Academies Press (US);
Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview 207
J. Psychosoc. Res.
2015 Oct 28. 9. Retrieved April 08, 2021 from https://www.ncbi.nlm.nih.gov/books/
NBK332877/
Cherry, K. (2020). The Wechsler Adult Intelligence Scale. Retrieved April 12, 2021 from Verywell
Mind, https://www.verywellmind.com/the-wechsler-adult-intelligence-scale-2795283
Colmar, S., Maxwell, A., and Miller, L. (2006). Assessing Intellectual Disability in Children: Are IQ
Measures Sufficient, or Even Necessary? Australian Journal of Guidance and Counselling, 16(2),
177-188. doi:10.1375/ajgc.16.2.177
Colom, R., Karama, S., Jung, R. E., and Haier, R. J. (2010). Human intelligence and brain networks.
Dialogues in clinical neuroscience, 12(4), 489–501. https://doi.org/10.31887/DCNS.2010.12.4/
rcolom
Fahmy, A. (2021). Kaufman Adolescent and Adult Intelligence Test. The Gale Encyclopedia of Mental
Health. Retrieved April 13, 2021 from Encyclopedia.com: https://www.encyclopedia.com/
medicine/encyclopedias-almanacs-transcripts-and-maps/kaufman-adolescent-and-adult-
intelligence-test
Flanagan, D.P., Alfonso, V.C., Mascolo, J. and Hale, J. (2010). The Wechsler Intelligence Scale for
Children - Fourth Edition in Neuropsychological Practice.
Ganuthula, V., and Sinha, S. (2019). The Looking Glass for Intelligence Quotient Tests: The Interplay
of Motivation, Cognitive Functioning, and Affect. Frontiers in psychology, 10, 2857. https://
doi.org/10.3389/fpsyg.2019.02857
Girimaji S.C., Pradeep AJ. Intellectual disability in international classification of Diseases-11: A
developmental perspective. Indian J Soc Psychiatry 2018;34, Suppl S1:68-74. Accessed from
https://www.indjsp.org/article.asp?issn=0971-9962;year=2018;volume=34;issue=5;spage
=68;epage=74;aulast=Girimaji
Hamour, B., Hmouz, H., Mattar, J., Muhaidat, M. (2012).The Use of Woodcock-Johnson Tests for
Identifying Students with Special Needs-a Comprehensive Literature Review. Procedia - Social
and Behavioral Sciences. https://doi.org/10.1016/j.sbspro.2012.06.714
Kaufman, S.B. (2013). The Heritability of Intelligence: Not What You Think. Beautiful Mind, Scientific
American. Retrieved April 08, 2021 from https://blogs.scientificamerican.com/beautiful-
minds/the-heritability-of-intelligence-not-what-you-think/#:~:text=This\%20traditional
\%20theory\%20assumes\%20that,acquired\%20skills\%20and\%20learning\%20opportunities.
Koshy, B., Thomas, H., Samuel, P., Sarkar, R., Kendall, S., Kang, G. (2017). Seguin Form Board as an
intelligence tool for young children in an Indian urban slum. Family Medicine and Community
Health;5: https://doi.org/10.15212/FMCH.2017.0118
Subodh Kumar, Divye Kartikey and Tara Singh
J. Psychosoc. Res.
208
Marom, J.P. (2021). Kaufman Assessment Battery for Children. The Gale Encyclopedia of Mental
Health. Retrieved April 10, 2021 from Encyclopedia.com: https://www.encyclopedia.com/
medicine/encyclopedias-almanacs-transcripts-and-maps/kaufman-assessment-battery-children
Marom, J.P. Stanford-Binet Intelligence Scale. Gale Encyclopedia of Mental Disorders. Retrieved
April 10, 2021 from Encyclopedia.com: https://www.encyclopedia.com/psychology/
encyclopedias-almanacs-transcripts-and-maps/stanford-binet-intelligence-scale
Odd, D.E., Rasmussen, F., Gunnell, D., Lewis, G.(2008). A cohort study of low Apgar scores and
cognitive outcomes.Archives of Disease in Childhood - Fetal and Neonatal Edition 2008;93:F115-
F120.doi.http://dx.doi.org/10.1136/adc.2007.123745
Ortiz, S. and Lella, S.(2010). Intellectual ability and Assessment: A primer for Parents and Educators.
National Association of School Psychologists. Retrieved April 08, 2021 from https://
apps.nasponline.org/search-results.aspx?q=intellectual+assessment
Plomin, R., Deary, I. (2015). Genetics and intelligence differences: five special findings. Mol Psychiatry,
20, 98–108. https://doi.org/10.1038/mp.2014.105
Richardson, K., & Norgate, S.H. (2015). Does IQ Really Predict Job Performance?. Applied developmental
science, 19(3), 153–169. https://doi.org/10.1080/10888691.2014.983635
Roopesh, B. (2020). Bhatia’s Battery of Performance Tests of Intelligence: A Critical Appraisal. Indian
Journal of Mental Health, 7, 289-306.
Roopesh, B. (2020). Binet Kamat Test of Intelligence: Administration, Scoring and Interpretation -An
In-Depth Appraisal. Indian Journal of Mental Health, 7, 180-201.
Ruhl, C. (2020, July 16). Intelligence: definition, theories and testing. Simply Psychology. Retrieved
April 09, 2021 from https://www.simplypsychology.org/intelligence.html
Schalock, R. L., Borthwick-Duffy, S. A., Bradley, V. J., Buntinx, W. H. E., Coulter, D. I.Craig, E. M. 2010.
Intellectual disability. Definition, classification, and systems of support, Washington, Dc: AAIDD.
Silverman, W., Miezejeski, C., Ryan, R., Zigman, W., Krinsky-McHale, S., and Urv, T. (2010). Stanford-
Binet & WAIS IQ Differences and Their Implications for Adults with Intellectual Disability (aka
Mental Retardation). Intelligence, 38(2), 242–248. https://doi.org/10.1016/j.intell.2009.12.005
Simon, L.V., Hashmi, M.F., Bragg, B.N. (2021). APGAR score. StatPearls Publishing. Retrieved April
09, 2021 from https://www.ncbi.nlm.nih.gov/books/NBK470569/
Slattery, V. (2015). Private school testing: What’s the WPPSI-IV? How do schools use it?. Retrieved
April 12, 2021 from https://blog.lowellschool.org/blog/private-school-testing-what-is-it-how-
do-schools-use-it
Thiel, E. (2020). Raven’s progressive matrices test. Retrieved April 10, 2021 from 123test, https://
www.123test.com/raven-s-progressive-matrices-test/
Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview 209
J. Psychosoc. Res.
Wegenschimmel, B., Leiss, U., Veigl, M., Rosenmayr, V., Formann, A., Slavc, I., and Pletschko, T.
(2017). Do we still need IQ-scores? Misleading interpretations of neurocognitive outcome in
pediatric patients with medulloblastoma: a retrospective study. Journal of neuro-oncology, 135(2),
361–369. https://doi.org/10.1007/s11060-017-2582-x
ABOUT THE AUTHORS
Subodh Kumar, Research Scholar – Department of Psychology, Banaras Hindu University, Varanasi, U.P., India.
Divye Kartikey, Student – Discipline of Psychology, IGNOU, New Delhi, India.
Tara Singh, Professor – Department of Psychology, Banaras Hindu University, Varanasi, U.P., India.
Copyright of Journal of Psychosocial Research is the property of Prints Publications Pvt. Ltd.
and its content may not be copied or emailed to multiple sites or posted to a listserv without
the copyright holders express written permission. However, users may print, download, or
email articles for individual use.
1
Neuropsychological testing 1
Chiara Zucchella,
1
Angela Federico,
1,2
Alice Martini,
3
Michele Tinazzi,
1,2
Michelangelo Bartolo,
4
2
Stefano Tamburin
1,2
3
1
Neurology Unit,
Verona University Hospital, Verona, Italy 4
2
Department of Neurosciences, Biomedicine and Movement Sciences, University of Verona, 5
Verona, Italy 6
3
School of Psychology, Keele University, Staffordshire, United Kingdom 7
4
Department of Rehabilitation, Neurorehabilitation Unit, Habilita, Zingonia (BG), Italy 8
9
Corresponding author. Stefano Tamburin, MD, PhD, Department of Neurosciences, Biomedicine 10
and Movement Sciences, University of Verona, Piazzale Scuro 10, I-37134 Verona, Italy. Tel.: +39-11
045-812-4285; +39-347-523-5580; fax: +39-045-802-7276; email: [email protected] 12
13
Word counts 14
Abstract: 123 words 15
Main text: 2900 words (title page, abstract, keypoints, references, figures, legend not included) 16
Box: 1 17
Tables: 8 18
Figure: 1 19
References: 2020
2
Abstract 21
Neuropsychological testing is a key diagnostic tool for assessing people with dementia and mild 22
cognitive impairment, but can also help in other neurological conditions such as Parkinson’s 23
disease, stroke, multiple sclerosis, traumatic brain injury, and epilepsy. While cognitive screening 24
tests offer gross information, detailed neuropsychological evaluation can provide data on different 25
cognitive domains (visuo-spatial function, memory, attention, executive function, language, praxis) 26
as well as neuropsychiatric and behavioural features. We should regard neuropsychological testing 27
as an extension of the neurological examination applied to higher-order cortical function, since 28
each cognitive domain has an anatomical substrate. Ideally, neurologists should discuss the 29
indications and results of neuropsychological assessment with a clinical neuropsychologist. This 30
paper summarises the rationale, indications, main features, most common tests, and pitfalls in 31
neuropsychological evaluation.32
3
Neuropsychological testing explores cognitive functions to obtain information on the structural and 33
functional integrity of the brain, and to score the severity of cognitive damage and its impairment 34
on daily life activities. It is a core diagnostic tool for assessing people with mild cognitive 35
impairment, dementia and Alzheimer’s disease,[1] but is also relevant in other neurological 36
diseases such as Parkinson’s disease,[2] stroke,[3,4] multiple sclerosis,[5] traumatic brain injury,[6] 37
and epilepsy.[7] Given the relevance and extensive use of neuropsychological testing, it is 38
important that neurologists know when to request a neuropsychological evaluation and how to 39
understand the results. Neurologists and clinical neuropsychologists in tertiary centres often 40
discuss complex cases, but in smaller hospitals and in private practice this may be more difficult. 41
This paper presents information on neuropsychological testing in adult patients, and highlights 42
common pitfalls in its interpretation. A very recent paper published on the February 2018 issue of 43
Practical Neurology focused on neuropsychological assessment in epilepsy.[7] 44
45
NEUROPSYCHOLOGICAL TESTING AND ITS CLINICAL ROLE 46
Why is neuropsychological testing important? From early in their training, neurologists are 47
taught to collect information on a patient’s symptoms, and to perform a neurological examination to 48
identify clinical signs. They then collate symptoms and signs into a syndrome, to identify a lesion in 49
a specific site of the nervous system, and this guides further investigations. Since cognitive 50
symptoms and signs suggest damage to specific brain areas, comprehensive cognitive 51
assessment should also be part of the neurological examination. Neuropsychological testing may 52
be difficult to perform during office practice or at the bedside but the data obtained nevertheless 53
can clearly complement the neurological examination. 54
When is neuropsychological testing indicated and useful? Neuropsychological assessment is 55
indicated when detailed information about cognitive function will aid clinical management: 56
to assess the presence or absence of deficits and to delineate their pattern and severity 57
to help to establish a diagnosis (e.g., Alzheimer’s disease or fronto-temporal dementia) or 58
to distinguish a neurodegenerative condition from a mood disorder (e.g., depression or 59
anxiety) 60
4
to clarify the cognitive effects of a known neurological condition (multiple sclerosis, stroke 61
or brain injury). 62
Neuropsychological testing may address questions about cognition in helping to guide a 63
(differential) diagnosis, obtain prognostic information, monitor cognitive decline, control the 64
regression of cognitive–behavioural impairment in reversible diseases, guide prescription of a 65
medication, measure the treatment response or adverse effects of a treatment, define a baseline 66
value to plan cognitive rehabilitation, or to provide objective data for medico-legal situations (Box 67
1). When requesting a neuropsychological assessment, neurologists should mention any previous 68
testing, and attach relevant reports, so that the neuropsychologist has all the available relevant 69
information. 70
Conversely, there are situations when cognitive evaluation should not be routinely recommended, 71
e.g., when patient is too severely affected, the diagnosis is already clear, testing may cause the 72
patient distress and/or anxiety, the patient has only recently undergone neuropsychological 73
assessment, there is only a low likelihood of an abnormality (though the test may still bring 74
reassurance), and when there are neuropsychiatric symptoms (Table 1). Neuropsychological 75
assessment is time-consuming (1–2 hours) and demanding for the patient, and so neurologists 76
much carefully select subjects for referral. 77
How is neuropsychological testing done? Neuropsychological evaluation requires a neurologist 78
or a psychologist with documented experience in cognitive evaluation (i.e., a neuropsychologist). 79
The clinician starts with a structured interview, then administers tests and questionnaires (Table 2), 80
and then scores and interprets the results. 81
The interview aims to gather information about the medical and psychological history, the 82
severity and the progression of cognitive symptoms, their impact on daily life, the patient’s 83
awareness of their problem, and their attitude, mood, spontaneous speech, and behaviour. 84
Neuropsychological tests are typically presented as ‘pencil and paper’ tasks; they are 85
intrinsically performance based, since patients have to prove their cognitive abilities in the 86
presence of the examiner. The tests are standardised, and so the procedures, materials, 87
and scoring are consistent. Therefore, different examiners can use the same methods at 88
5
different times and places, and still reach the same outcomes. 89
The scoring and analysis of the test results allow the clinician to identify any defective 90
functions, and to draw a coherent cognitive picture. The clinician should note any 91
associations and dissociations in the outcomes, and use these to compare with data 92
derived from the interview including observation of the patient, the neuroanatomical 93
evidence, and theoretical models, to identify a precise cognitive syndrome. 94
What information can neuropsychological testing offer? Neuropsychological assessment 95
provides general and specific information about cognitive performance. 96
Brief cognitive screening tools, such as the Mini-Mental State Examination (MMSE), the 97
Montreal Cognitive Assessment (MoCA), and the Addenbrookes Cognitive Examination (ACE-R), 98
provide a quick and easy global, although rough, measure of a person’s cognitive function,[8,9] 99
when more comprehensive testing is not practical or available. Table 3 gives the most common 100
cognitive screening tests, along with scales for measuring neuropsychiatric and behavioural 101
problems, and their impact on daily life. This type of screening test may suffice in some cases, e.g. 102
when the score is low and patient’s history strongly suggests dementia, or for staging and 103
following-up cognitive impairment with repeated testing. However, neurologists should be aware of 104
the limitations of such cognitive screening tools. Their lack of some subdomains may result in poor 105
sensitivity, e.g., MMSE may give false negative findings in ‘Parkinson’s disease-related mild 106
cognitive impairment’ because it does not sufficiently explore the executive functions that are the 107
first cognitive subdomains to be involved in Parkinson’s disease. The MMSE is particularly feeble 108
in assessing patients with fronto-temporal dementia, many of whom score within the ‘normal’ range 109
on the test, yet cannot function in social or work situations. [10] Also, young patients with a high 110
level of education may have normal screening tests because these are too easy and poorly 111
sensitive to mild cognitive alterations. Such patients therefore need a thorough assessment. 112
A comprehensive neuropsychological evaluation explores several cognitive domains 113
(perception, memory, attention, executive function, language, motor and visuo-motor function). The 114
areas and subdomains addressed in neuropsychological examination and the tests chosen depend 115
upon the referral clinical question, the patients and caregiver’s complaints and symptoms, and the 116
6
information collected during the interview. Observations made during test administration may guide 117
further exploration of some domains and subdomains. Failure in a single test does not imply the 118
presence of cognitive impairment, since it may have several reasons (e.g., reduced attention in 119
patients with depression). Also, single tests are designed to explore a specific domain or sub-120
domain preferentially, but most of them examine multiple cognitive functions (e.g. clock drawing 121
test, Table 4). For these reasons, neuropsychological assessment is performed as a battery, with 122
more than one test for each cognitive domain. 123
The main cognitive domains with their anatomical bases are reviewed below; Table 4 summarises 124
the most widely used cognitive tests for each domain. The neuropsychologist chooses the most 125
reliable and valid test according to the clinical question, the neurological condition, the age, and 126
other specific factors. 127
Parallel forms (alternative versions using similar material) may reduce the effect of learning effect 128
from repeated evaluations. They may help to track cognitive disorders over time, to stage disease 129
severity, and to measure the effect of pharmacological or rehabilitative treatment. 130
131
MAIN COGNITIVE DOMAINS AND THEIR ANATOMICAL BASES 132
Most cognitive functions involve networks of brain areas.[11] Our summary below is not intended 133
as an old-fashioned or phrenological view about cognition, but rather to provide rough clues on 134
where the brain lesion or disease may be. 135
Perception. This process allows recognition and interpretation of sensory stimuli. Perception is 136
based on the integration of processing from peripheral receptors to cortical areas (‘bottom-up’), 137
and a control (‘top-down’) to modulate and gate afferent information based on previous 138
experiences and expectations. According to a traditional model, visual perception involves a 139
ventral temporo-occipital pathway for objects and faces recognition, and a dorsal parieto-occipital 140
pathway for perception and movement in space.[12] Acoustic perception involves temporal areas. 141
Motor control. The classical neurological examination involves evaluation of strength, 142
coordination, and dexterity. Neuropsychological assessment explores other motor features ranging 143
from speed to planning. Visuo-motor ability requires integration of visual perception and motor 144
7
skills and is usually tested by asking the subject to copy figures or perform an action. Apraxia is a 145
higher-order disorder of voluntary motor control, planning and execution characterised by difficulty 146
in performing tasks or movements when asked, and not due to paralysis, dystonia, dyskinesia, or 147
ataxia. The traditional model divides apraxia into ideomotor (i.e., the patient can explain how to 148
perform an action, but cannot imagine it or make it when required), and ideational (i.e., the patient 149
cannot conceptualise an action, or complete the correct motor sequence).[13] However, in clinical 150
practice, there is limited practical value in distinguishing ideomotor from ideational apraxia – see 151
recent review in this journal.[14,15] Apraxia can be explored during routine neurological 152
examination, but neuropsychological assessment may offer a more detailed assessment. 153
Motor control of goal-orientated voluntary tasks depends on the interplay of limbic and associative 154
cortices, basal ganglia, cerebellum, and motor cortices. 155
Memory. Memory and learning are closely related. Learning involves acquiring new information, 156
while memory involves retrieving this information for later use. An item to be remembered must first 157
be encoded, then stored, and finally retrieved. There are several types of memory. Sensory 158
memory—the ability briefly to retain impressions of sensory information after the stimulus has 159
ended—is the fastest memory process. It represents an essential step for storing information in 160
short-term memory, which lasts for a few minutes without being placed into permanent memory 161
stores. Working memory allows information to be temporarily stored and managed when 162
performing complex cognitive tasks such as learning and reasoning. Therefore, short-term memory 163
involves only storage of the information, whilst working memory allows actual manipulation of the 164
stored information. Finally, long-term memory, the storage of information over an extended period 165
of time, can be subdivided into implicit memory (unconscious/procedural; e.g., how to drive a car) 166
and explicit memory (intentional recollection; e.g., a pet’s name). Within explicit memory, episodic 167
memory refers to past experiences that took place at a specific time and place, and can be 168
accessed by recall or by recognition. Recall implies retrieving previously stored information, even if 169
they are not currently present. Recognition refers to the judgment that a stimulus presented has 170
previously occurred. 171
The neuroanatomical bases of memory are complex.[16] The initial sensory memory includes the 172
8
areas of the brain that receive visual (occipital cortex), auditory (temporal cortex), tactile or 173
kinesthetic (parietal cortex) information. Working memory links to the dorsolateral prefrontal cortex 174
(involved in monitoring information) and the ventrolateral prefrontal cortex (involved in maintaining 175
the information). Long-term memory requires a consolidation of information through a chemical 176
process that allows the formation of neural traces for later retrieval. The hippocampus is 177
responsible for early storage of explicit memory; the information is then transmitted to a larger 178
number of brain areas. 179
Attention. Attention includes the ability to respond discretely to specific stimuli (focused attention), 180
to maintain concentration over time during continuous and repetitive tasks (sustained attention), to 181
attend selectively to a specific stimulus filtering out irrelevant information (selective attention), to 182
shift the focus among two or more tasks with different cognitive requirements (alternating 183
attention), and to perform multiple tasks simultaneously (divided attention). Spatial neglect refers to 184
failure to control the spatial orientation of attention, and consequently the inability to respond to 185
stimuli.[17] 186
The occipital lobe is responsible for visual attention, while visuo-spatial analysis involves both the 187
occipital and parietal lobes. Attention to auditory stimuli requires functioning of the temporal lobes, 188
especially the dominant (usually left) one for speech. Complex features of attention require the 189
anterior cingulate and frontal cortices, the basal ganglia and the thalamus. 190
Executive functions. Executive functions include complex cognitive skills, such as the ability to 191
inhibit or resist an impulse, to shift from one activity or mental set to another, to solve problems or 192
to regulate emotional responses, to begin a task or activity, to hold information in mind for 193
completing a task, to plan and organise current and future tasks, and to monitor one’s own 194
performance.[18] Taken together, these skills are part of a supervisory or meta-cognitive system to 195
control behaviour that allows us to engage in goal-directed behaviour, prioritise tasks, develop 196
appropriate strategies and solutions, and be cognitively flexible. These executive functions require 197
normal functioning of the frontal lobe, anterior cingulate cortex, basal ganglia, and many inward 198
and outward connections to the cortical and subcortical areas. 199
Language. Language includes several cognitive abilities that are crucial for understanding and 200
9
producing spoken and written language, as well as naming. Given its complexity, we usually 201
explore language with batteries of tests that use different tasks to investigate its specific aspects 202
(Table 4). According to the traditional neuroanatomical view, language relies primarily on the 203
dominant brain: specifically comprehension lies on the superior temporal lobe, language production 204
on the frontal regions and fronto-parietal/temporal circuits, and conceptual–semantic processing on 205
a network that includes the middle temporal gyrus, the posterior middle temporal regions and 206
superior temporal and inferior frontal lobes.[19] However, recent data from stroke patients do not 207
support this model, but instead indicate that language impairments result from disrupted 208
connectivity within the left hemisphere, and within the bilaterally distributed supporting processes, 209
which include auditory processing, visual attention, and motor planning.[11] 210
Intellectual ability. Regardless of the theoretical model, there is agreement that intellectual 211
ability—or intellectual quotient (IQ)—is a multi-dimensional construct. This construct includes 212
intellectual and adaptive functioning, communication, caring for ones own person, family life, social 213
and interpersonal skills, community resource use, self-determination, school, work, leisure, health 214
and safety skills. The Wechsler adult intelligence scale revised (WAIS-R) is the best-known 215
intelligence test used to measure adult IQ. WAIS-R comprises 11 subtests grouped into verbal and 216
performance scales (Table 4). Any mismatch between verbal and performance scores might 217
suggest different pattern of impairments, i.e., memory and language vs. visuo-spatial and 218
executive. 219
220
COMPARING TO NORMATIVE VALUES 221
A person’s performance on a cognitive test is interpreted by comparing it to that of a group of 222
healthy individuals with similar demographic characteristics. Thus, the raw score is generally 223
corrected for age, education and sex, and the corrected score rated as normal or abnormal. 224
However, not all neuropsychologists use the same normative values. Furthermore, there are no 225
clear guidelines or criteria for judging normality of cognitive testing. For example, the diagnostic 226
guidelines for mild cognitive impairment in Parkinson’s disease stipulate a performance on 227
neuropsychological tests that is 1–2 standard deviations (SDs) below appropriate norms, whereas 228
10
for IQ, a performance that is significantly below average is defined as ≤ 70, i.e., 2 SD below the 229
average score of 100.[2] Sometimes, the neuropsychological outcome is reported as an equivalent 230
score, indicating a level of performance (Figure 1). Understanding how normality is defined—how 231
many SDs below normal values, and the meaning of an equivalent score—is crucial for 232
understanding neuropsychological results correctly, and for comparing the outcomes of evaluations 233
performed in different clinical settings. Furthermore, estimating the premorbid cognitive level, e.g., 234
using the National Adult Reading Test (Table 3), helps to interpret the patient score. ‘Crystallised 235
intelligence’ refers to consolidated abilities that are generally preserved until late age, compared 236
with other abilities such as reasoning, which show earlier decline. In people with a low crystallised 237
intelligence—and consequently a low premorbid cognitive level—a low-average 238
neuropsychological assessment score may not represent a significant cognitive decline. 239
Conversely, for people with high premorbid cognitive level, a low-average score might suggest a 240
significant drop in cognitive functioning. 241
242
REACHING A DIAGNOSIS THROUGH NEUROPSYCHOLOGICAL TESTING 243
Although the score on a single test is important, it is only the performance on the whole 244
neuropsychological test battery that allows clinicians to identify a person’s patterns of cognitive 245
strengths and weaknesses; together with motor and behavioural abnormalities, these may fit into 246
known diagnostic categories (Tables 5, 6). 247
The neuropsychologist reports the information collected through neuropsychological evaluation in a 248
written clinical report that usually includes the scores of each test administered. The conclusions of 249
the neuropsychological report are important to guide further diagnostic workup, to predict 250
functionality and/or recovery, to measure treatment response and to verify correlations with 251
neuroimaging and laboratory findings. 252
As well as these quantified scores, it is critically important to have a patient’s self-report of 253
functioning, plus qualitative data including observation of how the patient behaved during the test. 254
Psychiatric confounders require particular attention. Neuropsychologists apply scales for 255
depression (e.g., Beck’s depression inventory, geriatric depression scale) or anxiety (e.g., state–256
11
trait anxiety inventory) during testing; these may offer information on how coexisting conditions 257
may influence cognition through changes in mood or motivational state. For example, it may be 258
difficult to distinguish between dementia and depressive pseudo-dementia, because depression 259
and dementia are intimately related.[20] Table 7 shows some of the features that may help. Note 260
that antidepressants may ameliorate cognitive deficits, particularly attention and memory, and that 261
opioids may worsen cognitive symptoms. 262
Knowing that there are other potential factors that may influence neuropsychological testing (and 263
usually worsening performance) should help clinicians to avoid misinterpreting the results (Table 264
8). For example, in Parkinson’s disease, it is important to pay particular care to motor fluctuations, 265
neuropsychiatric symptoms, pain, and drug side effects that can worsen cognitive performance.[21] 266
Conversely, patients with long-lasting psychiatric disease, such as bipolar disorder or 267
schizophrenia, are often referred for neurological and cognitive assessment when they begin to 268
perform worse in daily activities. Frontal changes are common in bipolar disorders and so finding 269
prefrontal dysfunction in such patients should not lead clinicians to suspect an ongoing 270
neurological disorder. Discussion with the clinical neuropsychologist and the psychiatrist may help 271
to understand potential drug side effects and, eventually, to revise treatment.272
12
Key points 273
For many neurological diseases, neuropsychological testing offers relevant clinical information 274
that complements the neurological examination 275
Neuropsychological tests can identify patterns of cognitive strengths and weaknesses that are 276
specific to particular diagnostic categories 277
Neuropsychological testing involves tests that investigate different cognitive functions in a 278
standardised way, and so the procedures, materials, and scoring are consistent; it also involves 279
an anamnestic interview, scoring and interpreting the results, and comparing these with other 280
clinical data, to build a diagnostic hypothesis 281
Neuropsychological evaluation must be interpreted in the light of coexisting conditions, in 282
particular sensory, motor, and psychiatric disturbances as well as drug side effects, to avoid 283
misinterpreting the results284
13
Provenance and peer review. Commissioned. Externally peer reviewed. 285
This paper was reviewed by Nick Fox, London, UK. 286
287
Acknowledgments. None. 288
289
Competing interests. None. 290
291
Funding. None. 292
293
Contributorship statement. CZ, AF, AM, ST designed the article, collected and interpreted the 294
data, drafted the manuscript and revised it. MT, MB designed the article, collected and interpreted 295
the data, and revised the manuscript for important intellectual content. All Authors approved the 296
final version of the article. CZ and ST take full responsibility for the content of this review. 297
298
I, Stefano Tamburin, Corresponding Author of this article contained within the original manuscript 299
which includes any diagrams & photographs and any related or stand alone film submitted (the 300
Contribution) have the right to grant on behalf of all authors and do grant on behalf of all authors a 301
licence to the BMJ Publishing Group Ltd and its licensees to permit this Contribution (if accepted) 302
to be published in any BMJ Group products and to exploit all subsidiary rights, as set out in our 303
licence set out at: http://group.bmj.com/products/journals/instructions-for-304
authors/wholly_owned_licence.pdf.305
Table 1. Conditions in which neuropsychological testing is usually not recommended 306
Condition Reason
Patient too severely affected
Not or only slightly informative assessment
The cost in terms of burden for the patient (i.e., fatigue, anxiety, feeling of failure) may exceed the benefit
of gaining information from the assessment
Clear diagnosis
If the diagnosis is clear and neuropsychological testing is required for diagnostic purposes only, it should
not be routinely prescribed
Distress and/or anxiety might be produced Diagnosis has already been defined and it is clear that the patient will fail in testing
Recent (<6 months) neuropsychological
assessment
Significant cognitive decline is unlikely in the short time, unless a neurological event has occurred or the
patient is affected by rapidly progressive dementia
Short-interval repeated evaluation may be biased by learning effect, except when parallel versions of
tests are used
The a priori likelihood of an abnormality is low
Neuropsychological testing should not be routinely performed when clinical history and examination
exclude a neurological or cognitive condition
Consider prescribing neuropsychological testing, if it is the only way to provide reassurance when a
healthy individual is concerned about cognitive decline
Confusion or psychosis
Neuropsychological assessment is not reliable and could exacerbate confusion and/or abnormal
behaviour
Table 2. Structure of the neuropsychological evaluation 307
Stage Contents
Interview with the patient,
relative, or caregiver
Reason for referral (i.e., what the physician and patient want to know)
Medical history, including family history
Lifestyle and personal history (e.g., employment, education, hobbies)
Premorbid personality
Symptoms onset and evolution
Previous examinations (e.g., CT scan, MR scan,
electroencephalography, positron-emission tomography scan)
Sensory deficits (loss of vision, or hearing)
Qualitative assessment of
cognition, mood and
behaviour
Mood and motivation (i.e., depression, mania, anxiety, apathy)
Self-control, or disinhibition
Subjective description and awareness of cognitive disorders, and their
impact on the activities of daily life
Expectations and beliefs about the disease
Verbal (fluency, articulation, semantic content) and non-verbal (eye
contact, tone of voice, posture) communication
Clothing, and personal care
Interview with the relative/caregiver to confirm patient’s information,
provide explanations, and acquire information on how the patient
behaves in daily life
Test administration Standardised administration of validated tests
Final report
Personal …
Empirically Supported Forensic Assessment
Robert P. Archer, Eastern Virginia Medical School
Elizabeth M. A. Wheeler and Rebecca A. Vauter, Central State Hospital
The field of Forensic Psychology has greatly expanded
over the past several decades, including the use of psy-
chological assessment in addressing forensic issues. A
number of surveys have been conducted regarding the
tests used commonly by forensic psychologists. These
surveys show that while tests specifically designed to
address forensic issues have proliferated, traditional
clinical assessment tests continue to play a crucial role
in many forensic evaluations. The current article identi-
fies some of the most salient characteristics of empiri-
cally supported forensic tests and provides examples of
tests felt to meet each of these five criteria. These crite-
ria include adequate standardization, acceptable relia-
bility and validity, general acceptance within the
community of forensic evaluators, availability of test
data from cross-cultural and cross-ethnic samples, and
comparison data relevant to specific forensic popula-
tions. Although the guidelines provided in this article
provide a helpful framework for evaluating the useful-
ness of forensic tests, the establishment of a national
review panel or workgroup to address this issue would
be highly useful, particularly in the potential controver-
sial task of identifying those tests that meet reasonable
guidelines to be identified as empirically supported
forensic assessment instruments.
Key words: forensic, forensic assessment, survey,
testing. [Clin Psychol Sci Prac 23: 348–364, 2016]
BRIEF REVIEW OF THE HISTORY AND DEFINITION OF
FORENSIC PSYCHOLOGY
The genesis of what would later be termed Forensic Psy-
chology can be traced back to the early 20th century
when psychologists first became involved in the attempt
to understand the limitations of eyewitness testimony
and, in particular, Hugo M€unsterberg’s advocacy for an
increased role for psychologists within the legal system
(Vaccaro & Hogan, 2004). Indeed, the scope of the
psychologist’s role in addressing psycholegal issues both
inside and outside of the courtroom has expanded
greatly as a result of several court rulings. For example,
the landmark 1962 ruling in Jenkins v. United States
helped to facilitate psychologists’ ability to testify inside
the courtroom as expert witnesses, by asserting that psy-
chologists could be qualified to give expert testimony
on issues of mental health. Over the past several dec-
ades, Forensic Psychology has established recognized
training models, board certification requirements, and
an American Psychological Association (APA) division
focused on this area of practice (Division 41, American
Psychology-Law Society).
For many laypeople, the term Forensic Psychology may
invoke dramatic images from CSI (Crime Scene Inves-
tigation). For many psychologists not actively involved
in Forensic Psychology, the idea of testifying in court
may invoke a strong anxiety response. While inaccurate
views of the role of forensic psychologists can easily be
dismissed as a product of the popular media, and the
testimony anxiety of those not actively involved in
Forensic Psychology can be effectively addressed by
Address correspondence to Robert P. Archer, Eastern Vir-
ginia Medical School – Psychiatry & Behavioral Sciences, 825
Fairfax Avenue, Norfolk, VA 23507. E-mail: [email protected]
cox.net.
doi:10.1111/cpsp.12171
© 2016 American Psychological Association. Published by Wiley Periodicals, Inc., on behalf of the American Psychological Association.
All rights reserved. For permissions, please email: [email protected]
T
hi
s
do
cu
m
en
t i
s
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
tio
n
or
o
ne
o
f i
ts
a
lli
ed
p
ub
lis
he
rs
.
T
hi
s
ar
tic
le
is
in
te
nd
ed
s
ol
el
y
fo
r t
he
p
er
so
na
l u
se
o
f t
he
in
di
vi
du
al
u
se
r a
nd
is
n
ot
to
b
e
di
ss
em
in
at
ed
b
ro
ad
ly
.
training and experience in the courtroom, there
remains a need for a clear and concise definition of the
term Forensic Psychology (Huss, 2009). In this regard, rel-
atively recent guidelines by the APA have helped to
clarify the term and define when psychologists are
working within the scope of Forensic Psychology.
The most recent edition of the APA’s Specialty
Guidelines for Forensic Psychology (APA, 2011) defines
Forensic Psychology as “professional practice by any
psychologist working with any sub-discipline of psy-
chology (e.g., clinical, developmental, social, cognitive)
when applying the scientific, technical, or specialized
knowledge of psychology to the law to assist in
addressing legal, contractual, and administrative mat-
ters” (p. 1). Thus, Forensic Psychology can encompass
not only direct legal issues (e.g., competency and sanity
evaluations) but also administrative/contractual issues
such as fitness for duty and disability evaluations.
The APA Guidelines also helped to define when a
psychologist is serving as a “forensic practitioner” in
the following statement:
Forensic practitioner refers to a psychologist when
engaged in the practice of Forensic Psychology.. . .
Such professional conduct is considered forensic
from the time the practitioner reasonably expects to,
agrees to, or is legally mandated to, provide exper-
tise on an explicitly psycholegal issue. (2011, p. 1)
The APA Guidelines do not rely on individuals’
typical area of practice in order to determine whether
they are providing forensic services, but instead focus
on what they are doing in the specific case. The
Guidelines state that simply being involved in a court-
related matter does not make one a forensic psycholo-
gist (e.g., someone testifying only on his or her treat-
ment of an individual would not be considered
forensic practice). In contrast, a neuropsychologist
retained to assess and subsequently testify regarding a
psycholegal issue (e.g., whether an individual suffered a
brain injury as the result of a car accident for which he
or she was seeking compensation) would be practicing
as a forensic psychologist regardless of her activities in
other areas of her practice (i.e., the rest of her caseload
was not court-involved individuals). Thus, anyone pro-
viding a psycholegal opinion in a legally mandated
matter is providing services as a forensic psychologist,
regardless of the nature of his or her typical practice
area.
DIFFERENCE BETWEEN OBJECTIVES OF CLINICAL
ASSESSMENT AND FORENSIC ASSESSMENT
There are some important differences between assess-
ments undertaken in the forensic context and clinical
assessment. Ackerman (1999), in his book Essentials of
Forensic Psychological Assessment, provided a summary
table that presents some of the salient differences
between clinical and forensic relationships. The key
components from that table will be reviewed here.
The first difference between clinical and forensic
assessments has to do with who is identified as the cli-
ent. In clinical assessment, the individual being assessed
is typically the identified client, whereas in a forensic
setting, an attorney or the court is usually the client of
record. This is a crucial distinction, and one from
which many of the other differences are derived
between the clinical and forensic evaluations.
A second major difference between the two types of
assessment concerns the rules that govern the disclosure
of information. In clinical assessments, patient–therapist
privilege and the Health Insurance Portability and
Accountability Act (HIPAA) provide the guidelines
that cover disclosures. In forensic assessment, the scope
of disclosures may either be mandated by statute (e.g.,
in competency and sanity evaluations, the state or fed-
eral statutes dictate who has access to the report) or
covered under attorney–client privilege (e.g., in immi-
gration cases or privately retained personal injury evalu-
ations in which the psychologist is typically retained by
one side). Overall, it is likely that the report produced
at the conclusion of a forensic evaluation will be more
widely distributed than a clinical assessment report, and
in some cases, the report can become part of a case file
that could be open to the public (e.g., in a sanity case
in a court of record such as a circuit court). Further, a
forensic psychologist may find parts of their report
quoted in case law, or even in the local or national
newspapers, an event much less likely to occur when
practicing clinical assessment.
A third significant difference is the stance the psy-
chologist takes toward the client/examinee in the eval-
uation process. In a clinical assessment, the psychologist
FORENSIC ASSESSMENT � ARCHER ET AL. 349
T
hi
s
do
cu
m
en
t i
s
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
tio
n
or
o
ne
o
f i
ts
a
lli
ed
p
ub
lis
he
rs
.
T
hi
s
ar
tic
le
is
in
te
nd
ed
s
ol
el
y
fo
r t
he
p
er
so
na
l u
se
o
f t
he
in
di
vi
du
al
u
se
r a
nd
is
n
ot
to
b
e
di
ss
em
in
at
ed
b
ro
ad
ly
.
typically assumes an empathetic, supportive, and non-
judgmental stance. In contrast, in forensic assessment,
the evaluator is often advised to present in a more neu-
tral and objective manner. Indeed, presenting oneself as
empathetic or supportive during a forensic evaluation
may be viewed as being potentially misleading and
inappropriate. At the heart of the differences between
clinical and forensic assessment is the adversarial nature
of the legal system. While in clinical assessment, a rela-
tionship with a client would rarely be adversarial, it is
much more frequently the case in forensic assessment
that at least part of the process may be viewed as
adversarial by the examinee.
A fourth area of difference is that forensic psycholo-
gists are often skeptical of the accuracy of the exami-
nee’s self-report. Thus, the forensic psychologist is
more likely to rely on a variety of sources of informa-
tion to confirm the examinee’s self-report, while a
clinical psychologist conducting an assessment is more
likely to rely more heavily on the subject’s self-report.
This difference is related to the level of scrutiny and
collaboration applied to the self-report of the exami-
nee. Individuals participating in forensic evaluations
may have a wide variety of reasons to provide inaccu-
rate information during the evaluation. For example,
examinees may over-report their psychological stability
and parenting skills in order to obtain custody of their
children, while other examinees may exaggerate or
malinger mental health problems in order to avoid
criminal responsibility or to obtain monetary gain in
personal injury cases. Further, threats to validity in
forensic assessment may also come from a lack of client
self-awareness rather than a conscious intent to be dis-
honest or malinger. Thus, psychologists in forensic
assessments often rely on collateral sources of informa-
tion, including medical records; police reports; victim
statements; reports from employers, family members, or
friends; and even taped phone calls, e-mails, or text
messages to confirm or disconfirm the self-report infor-
mation provided by the examinee.
Finally, the goals of the clinical and forensic assess-
ments are typically quite different, and may even be in
conflict. Specifically, the goal of clinical assessment is
to assist the client (patient). This is typically done by
answering broader diagnostic questions to assist clients
in understanding more about themselves and to
facilitate treatment planning. In contrast, the goal of
the forensic assessment is to assist the court, or retain-
ing party such as an attorney, in providing opinions
regarding a psycholegal question. Thus, forensic assess-
ment may or may not be helpful to the individual
being assessed, and may not even provide a diagnosis
(e.g., it may not be necessary to provide a diagnosis in
a competency evaluation if it does not impact the
defendant’s abilities in court). Indeed, it is possible that
the outcome of a forensic assessment could be harmful
to an individual’s case (e.g., an individual wants to
plead not guilty by reason of insanity, but the evaluator
does not find that there is evidence to support such a
plea).
BROADNESS OF THE FIELD OF FORENSIC ASSESSMENT
As previously noted, Forensic Psychology encompasses
a large practice area. Huss (2009) reviewed the major
areas of Forensic Psychology and included various
topics, such as risk assessment at the time of sentencing,
insanity (criminal responsibility), competency to stand
trial, sex offender evaluations, juvenile transfers to adult
court, child custody, civil commitment, personal injury
cases, worker’s compensation, and competency to make
medical decisions. We would also add to this list the
additional practice areas of fitness for duty evaluations,
capital sentencing mitigation, immigration evaluations,
jury selection consultation, guardianship/conservator-
ship evaluations, and juvenile evaluations conducted at
the court’s request, although it is important to note
that this is not an all-inclusive list. It is quite clear that
two psychologists may report that they both practice
Forensic Psychology without engaging in similar evalu-
ations (e.g., one might conduct child custody and par-
enting evaluations while another might build a practice
on competency and sanity evaluations). Further, being
competent in one area of forensic practice does not
make a practitioner competent in other forensic prac-
tice areas.
Most practice areas in Forensic Psychology fall
under two broad categories: civil and criminal. Civil
areas of practice typically involve the relationship
between members of the community, and the goal is
not to punish a wrongdoer but to prevent or compen-
sate for a wrong. Examples of civil cases would be
evaluations of custody evaluations and personal injury
350 CLINICAL PSYCHOLOGY: SCIENCE AND PRACTICE � V23 N4, DECEMBER 2016
T
hi
s
do
cu
m
en
t i
s
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
tio
n
or
o
ne
o
f i
ts
a
lli
ed
p
ub
lis
he
rs
.
T
hi
s
ar
tic
le
is
in
te
nd
ed
s
ol
el
y
fo
r t
he
p
er
so
na
l u
se
o
f t
he
in
di
vi
du
al
u
se
r a
nd
is
n
ot
to
b
e
di
ss
em
in
at
ed
b
ro
ad
ly
.
evaluations. In contrast, criminal practice typically
involves cases in which it is alleged that criminal laws
have been broken, and the goal is to appropriately
punish the wrongdoer. Examples of Forensic Psychol-
ogy involvement in criminal law would include com-
petency to stand trial evaluations and sanity evaluations,
as well as capital sentencing mitigation.
APA GUIDELINES (DIVISION 41 GUIDELINES)
Because Forensic Psychology encompasses such a vast
array of practice areas, there are a variety of specific
pathways through which psychologists can become
competent and proficient in their specialty area. The
Specialty Guidelines for Forensic Psychology (APA, 2011)
provide some general guidance regarding the relevant
issue in conducting forensic assessments. The following
is a brief summary of these guidelines:
• Forensic psychologists seek to focus on the legally
relevant factors (i.e., they understand and are guided
by relevant legal statutes and case law).
• Forensic practitioners seek the appropriate use of
assessment procedures (i.e., they understand the
strengths and limitations of the tests they select as
applied to the relevant forensic issues/populations).
These guidelines state that forensic practitioners also
seek to consider the strengths and limitations of
employing traditional assessment procedures in
forensic examinations.
• Given the stakes involved in forensic contexts,
forensic practitioners recognize the need to take spe-
cial care to ensure the integrity and security of test
materials and results.
The guidelines also state that when test score valid-
ity has not been firmly established in a forensic con-
text, the petitioner should seek to describe test score
strengths and limitations and to explain these issues in
the forensic context. This explanation may include the
observation that the context in which the assessments
have been given may warrant adjustments in test score
interpretation.
The guidelines also suggest that forensic psycholo-
gists should
• take into account individual examinee differences,
including “situational, personal, linguistic, and
cultural differences that might affect their judge-
ments or reduce the accuracy of their interpreta-
tions” (APA, 2011, p.15);
• take reasonable steps to explain, in an understandable
manner, the results of the test to either the examinee
or his or her representative. If this feedback is not pos-
sible, this restriction should be explained in advance;
• document all the data sources they considered as part
of the evaluation. Further, this documentation
should be made available based on proper subpoenas
or legal consent; and
• maintain careful and detailed records of the evalua-
tion process.
IMPACT OF THE DAUBERT SUPREME COURT DECISION
Psychologists’ involvement in court-related matters has
long been governed, at least in part, by rules regarding
the admissibility of evidence. For over 50 years, this
involvement in federal courts and in many state courts
was governed primarily by the Supreme Court stan-
dard, which was developed out of the Frye v. United
States (1923) ruling. This ruling established that evi-
dence could be admissible if it were based on a tech-
nique or method generally accepted in the field. Thus,
for example, psychologists’ testimony may be found to
be admissible if based on a practice, test, or technique
generally accepted by other psychologists within that
field. For obvious reasons, however, determining what
method is “generally accepted” may be difficult when
applied to a specific test used within a specific context.
In 1993, the U.S. Supreme Court modified the
admissibility standard in its ruling in the case of Daubert
v. Merrell Dow Pharmaceuticals. The Daubert case
involved the admissibility of testimony based on tech-
nical or scientific data. The Supreme Court heard the
case because it felt that there were differences regarding
how lower courts had been determining the proper
standard for admitting expert testimony. The Court
noted that the Frye standard had been superseded by
the Federal Rules of Evidence. Indeed, the Federal
Rules of Evidence have one section for the “ordinary
witness” (701) and one for expert witnesses (702). The
Court noted that the additional requirements in 702
were not surprising given that “an expert is permitted
wide latitude in offering opinions, including those that
are not based on firsthand knowledge or observation.”
FORENSIC ASSESSMENT � ARCHER ET AL. 351
T
hi
s
do
cu
m
en
t i
s
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
tio
n
or
o
ne
o
f i
ts
a
lli
ed
p
ub
lis
he
rs
.
T
hi
s
ar
tic
le
is
in
te
nd
ed
s
ol
el
y
fo
r t
he
p
er
so
na
l u
se
o
f t
he
in
di
vi
du
al
u
se
r a
nd
is
n
ot
to
b
e
di
ss
em
in
at
ed
b
ro
ad
ly
.
They observed that this relaxation of the requirement
for firsthand knowledge, a central part of common law,
is “premised on an assumption that the expert’s opin-
ion will have a reliable basis in the knowledge and
experience of his discipline.” Thus, the Court then
outlined the following guidelines for Rule 702 expert
witness testimony:
. . .If scientific, technical, or other specialized knowl-
edge will assist the trier of fact to understand the
evidence or to determine a fact in issues, a witness
qualified as an expert by knowledge, skill, experi-
ence, training, or education, may testify thereto in
the form of an opinion or otherwise.
The Court also noted that nothing in this rule
specifically required “general acceptance” for admissi-
bility. Therefore, the Court stated, “That austere stan-
dard [Frye], absent from and incompatible with the
Federal Rules of Evidence, should not be applied in
federal trials.” The Court went on to outline the limits
which were put into place by the Rules of Evidence,
including that “any and all scientific testimony or evi-
dence admitted is not only relevant, but reliable.”
The justices noted that many factors could benefi-
cially influence a judge’s decision-making process, and
therefore, they were not attempting to create a rigidly
defined checklist or test, but they did provide the fol-
lowing considerations in evaluating admissibility issues:
• Is the evidence/opinion based on a technique or
method that has been tested and has established stan-
dards controlling its use and operation?
• Has the theory or technique undergone peer review
and publication?
• Does the technique have a known and established
error rate?
• Finally, “the general acceptance of a technique” still
has a bearing on admissibility of testing, but no
longer serves as the sole or exclusive criterion.
Functionally, the Court developed a flexible stan-
dard to determine admissibility of expert testimony and
opinions. Two crucial Supreme Court cases followed
Daubert. In the 1997 ruling in General Electric Company
v. Joiner and the 1999 ruling in Kumho Tire Company v.
Carmichael, the Supreme Court clarified aspects of the
Daubert ruling regarding the admission of expert wit-
ness and testimony. Generally, criteria for admissibility
under this series of cases support the view that testi-
mony based on psychometrically reliable and valid
instruments, with test conclusions based on empirical
support, is more likely to meet admissibility guidelines
in federal courts and in the numerous states that follow
Daubert criteria.
Goodman-Delahunty (1997) argued that the effects
of Daubert have been to cause forensic psychologists
“to be more explicit about the scientific foundations of
their opinions” (p. 121), and Underwager and Wake-
field (1993) discussed the influence of the Daubert deci-
sion on psychological testimony in an article entitled
“A Paradigm Shift for Expert Witnesses.” A growing
number of articles have reviewed specific psychological
tests such as the MMPI-2-RF (Ben-Porath, 2012; Sell-
bom, 2012) and MCMI (Rogers, Salekin, & Sewell,
1999) in terms of their ability to meet Daubert criteria,
and there seems little doubt that any test used to form
the basis of an expert’s opinion in court is potentially
subject to review based on Daubert factors.
Many states have adopted the Daubert standard, but
several continue to use Frye or other admissibility crite-
ria. It is incumbent on forensic practitioners to know
which standard their state uses and how it might
impact admissibility of psychological evidence. For pur-
poses of the current article, however, it is most impor-
tant to note that various psychological tests, or
components of tests, have been submitted to a “Daubert
Challenge” or “Daubert Motion,” that is, a hearing
conducted before a judge where validity and admissi-
bility of expert testimony (in this case based on a speci-
fic test or tests) are challenged by opposing counsel as
failing to meet Daubert standards of methodology, relia-
bility, validity, and general acceptance.
WHAT DO WE KNOW ABOUT THE “GENERAL ACCEPTANCE”
OF TESTS USED IN FORENSIC EVALUATIONS?
A growing body of research literature has addressed the
issue of the types of assessment instruments typically
used in a variety of forensic settings. This research is an
important aspect of determining those psychological
tests that are generally accepted by practitioners to
address varied forensic issues, in turn contributing to
identifying tests that are generally accepted within the
352 CLINICAL PSYCHOLOGY: SCIENCE AND PRACTICE � V23 N4, DECEMBER 2016
T
hi
s
do
cu
m
en
t i
s
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
tio
n
or
o
ne
o
f i
ts
a
lli
ed
p
ub
lis
he
rs
.
T
hi
s
ar
tic
le
is
in
te
nd
ed
s
ol
el
y
fo
r t
he
p
er
so
na
l u
se
o
f t
he
in
di
vi
du
al
u
se
r a
nd
is
n
ot
to
b
e
di
ss
em
in
at
ed
b
ro
ad
ly
.
field of psychology. We shall review some of the more
prominent of these studies that have focused on the
tests used across a variety of forensic settings, and then
briefly review some of the surveys of assessment instru-
ments used for specific forensic purposes, such as cus-
tody evaluations, violence risk assessment, and criminal
forensic evaluations.
SURVEYS OF TEST USAGE IN GENERAL FORENSIC SETTINGS
Boccaccini and Brodsky (1999) noted that psychologists
have become increasingly involved in forensic assess-
ments. The authors conducted a survey among 80
members of APA Divisions 12 (Clinical) and 40 (Neu-
ropsychology) to evaluate what instruments these prac-
titioners used in emotional injury assessments and to
see whether practitioners used these tests in a manner
consistent with the Daubert (1993) Supreme Court rul-
ing on the admissibility of expert witness testimony.
The 80 psychologists who completed surveys in this
study had conducted over 10,500 emotional injury
evaluations during their career, and listed a total of 67
different assessment instruments they had employed in
emotional injury evaluations during the past year.
However, only 11 of these tests were used by five or
more survey respondents, and no practitioners used
exactly the same combination of tests in their standard
battery for injury evaluations. The five most frequently
employed tests reported in this study included the
Minnesota Multiphasic Personality Inventory (MMPI)
or MMPI-2 (94\%), the Wechsler Adult Intelligence
Scale–Revised (WAIS-R) or WAIS-III (54\%), the Mil-
lon Clinical Multiaxial Inventory-II (MCMI-II) or
MCMI-III (Millon, 1994; 50\%), the Rorschach Inkblot
Technique (41\%), and the Beck Depression Inventory
(31\%). The authors noted that respondents indicated
that Daubert-related criteria such as general acceptance
of the test within the field and the presence of inde-
pendent research validation played an important role in
their selection of instruments. However, Boccaccini
and Brodsky (1999) reported that respondents also cited
test selection factors unrelated to the Daubert standard,
such as their personal clinical experience with a partic-
ular test, as popular reasons for test selection. Thus, in
response to the question concerning whether psycholo-
gists selected tests in forensic evaluations based on cri-
teria outlined in the Daubert decision, the authors
concluded, “Our findings indicate that the answer is
yes and no” (p. 257).
Since the survey by Boccaccini and Brodsky (1999),
several other surveys have been conducted on test
usage by psychologists in forensic settings. Table 1 pro-
vides a summary of the results from these studies.
Lally (2003), for example, surveyed 64 diplomates in
Forensic Psychology concerning the frequency with
which they used various tests, and their opinions con-
cerning the acceptability of a wide variety of specific
psychological tests, in six areas of forensic practice,
including mental state at the time of an offense, risk for
future violence, risk for future sexual violence,
Table 1. Result summary of test usage in general forensic settings
Study Sample
Most Frequently
Used/Recommended
Tests
Boccaccini
and Brodsky
(1999)
80 APA Division 12
or Division
40 members
MMPI/MMPI-2 (94\%)
WAIS-R/WAIS-III (54\%)
MCMI-II/MCMI-III (50\%)
Rorschach (41\%)
Beck Depression
Inventory (31\%)
Lally (2003) 64 Forensic
psychology
diplomates
MMPI-2
WAIS-III
PCL-R
Luria-Nebraska
Halstead-Reitan
Archer et al.
(2006)
152 APA Division 41
members or AAFP
diplomates
Cognitive Assessment
Wechsler Intelligence
Scales
Single-Scale Tests
Beck Depression
Inventory
Beck Anxiety Inventory
Multiscale Tests
MMPI-2
Children/Adolescents
MMPI-A
Projective Tests
Rorschach
Viljoen et al. (2010) 215 psychologists
who perform
forensic
evaluations
of adults or
children
Wechsler Intelligence
Scales (75.3\%)
MMPI-2/MMPI-A (66\%)
Structured Assessment
of Violence Risk in Youth
(SAVRY)
(Borum, Bartel, & Forth,
2003; 35.1\%)
MCMI-III or MACI
(31.2\%)
PCL-R or PCL:YV
(24.7\%)
Austin and
Wygant (2012)
284 psychologists
conducting
forensic
evaluations
MMPI-2 (59\%)
PAI (38\%)
MMPI-2-RF (29\%)
MCMI-III (26\%)
Rorschach (22\%)
FORENSIC ASSESSMENT � ARCHER ET AL. 353
T
hi
s
do
cu
m
en
t i
s
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
tio
n
or
o
ne
o
f i
ts
a
lli
ed
p
ub
lis
he
rs
.
T
hi
s
ar
tic
le
is
in
te
nd
ed
s
ol
el
y
fo
r t
he
p
er
so
na
l u
se
o
f t
he
in
di
vi
du
al
u
se
r a
nd
is
n
ot
to
b
e
di
ss
em
in
at
ed
b
ro
ad
ly
.
competency to stand trial, competency to waive Mir-
anda rights, and evaluation of …
Available online at www.jlls.org
JOURNAL OF LANGUAGE AND LINGUISTIC STUDIES
ISSN: 1305-578X
Journal of Language and Linguistic Studies, 14(1), 67-85; 2018
Computer-based and paper-based testing: Does the test administration mode
influence the reliability and validity of achievement tests?
Hüseyin Öz
a
* , Tuba Özturan
b
a
Department of Foreign Language Education, Hacettepe University, Ankara 06800, Turkey
b
School of Foreign Languages, Erzincan University, Erzincan 24000, Turkey
APA Citation:
Öz, H., & Özturan, T. (2018). Computer-based and paper-based testing: Does the test administration mode influence the reliability and
validity of achievement tests? Journal of Language and Linguistic Studies, 14(1), 67-85.
Submission Date: 23/11/2017
Acceptance Date: 06/03/2018
Abstract
This article reports the findings of a study that sought to investigate whether computer-based vs. paper-based
test-delivery mode has an impact on the reliability and validity of an achievement test for a pedagogical content
knowledge course in an English teacher education program. A total of 97 university students enrolled in the
English as a foreign language (EFL) teacher education program were randomly assigned to the experimental
group that took the computer-based achievement test online and the control group that took the same test in
paper-and-pencil based format. Results of Spearman Rank order and Mann-Whitney U tests indicated that test-
delivery mode did not have any impact on the reliability and validity of the tests administered in either way.
Findings also demonstrated that there was not any significant difference in test scores between participants who
took the computer-based test and those who took the paper-based test. Findings were discussed in terms of the
idea that computer technology could be integrated into the curriculum not only for instructional practices but
also for assessment purposes.
© 2018 JLLS and the Authors - Published by JLLS.
Keywords: Computer-based testing; paper-based testing; reliability; validity; English teacher education
1. Introduction
With the introduction of the digital revolution, educators have begun to benefit from modern
computer technology to carry out accurate and efficient assessment of learning outcomes both in
primary/secondary and higher education. In recent years, Turkish institutions of higher education have
also started integrating e-learning and assessment initiatives into their undergraduate programs. It is
assumed that Turkish educational institutions will gradually move components of their assessment
systems to online delivery or computerized mode. There are several reasons for implementing
computerized assessments in education. We can reduce the “lag time” in reporting scores, increase the
efficiency of assessment, achieve the flexibility in terms of time and place, give immediate feedback
and announce students’ scores immediately, analyze student performance that cannot be investigated
from paper-based tests by implementing individualized assessments customized to student needs and
*
Corresponding author. Tel.: +90-312-780-5521
E-mail address: [email protected]
This article is part of the second author’s master thesis, completed with the supervision of the first author.
mailto:[email protected]
https://orcid.org/0000-0002-6512-4931
https://orcid.org/0000-0002-7099-0076
https://orcid.org/0000-0002-6512-4931
https://orcid.org/0000-0002-7099-0076
68 H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85
minimize the paper consumption and cost as well as duplicate or mail test materials (Alderson, 2000;
Bennett, 2003; Noyes & Garland, 2008; Paek, 2005; Roever, 2001). This paper reports on findings of
a study that investigated whether computer-based and paper-based tests as test delivery modes would
influence the reliability and validity of the achievement test for a pedagogical content knowledge
course in an English as a foreign language (EFL) teacher education program.
1.1. Reliability and validity criteria of tests
Defining the aims of tests and choosing the most suitable test type should be done before
administering a test. However, these are not enough in order to have an effective test. In this sense,
educators have to first consider some specific principles. Validity and reliability are foremost among
these principles. As the most essential criterion for the quality of any assessment, validity is the
relation between the aim and the form of the assessment and refers to whether a test truly measures
what we claim it to measure. In other words, the tests measure what they are supposed to measure
once the tests are valid (Fulcher & Davidson, 2007; Stobart, 2012). As it is a very crucial criterion for
conducting tests, this following question lingers: how can instructors create valid tests or increase the
validity of tests? There are some tips available to them, documented in available academic literature.
Firstly, direct testing should be done whenever feasible, and explanations should be made clear.
Secondly, scoring should be directly in relation to the targets of tests. Lastly, reliability has to be
satisfied. Otherwise, validity cannot be assured (Hughes, 2003).
Reliability, on the other hand, is the degree to which a test measures a skill and/or knowledge
consistently (Scheerens, Glas, & Thomas, 2005, p. 93). Therefore, similar scores are commonly
achieved on a reliable test once the same exam is administered on two different days or on two
different but parallel formats. It is important to note that Brown and Abeywickrama (2010) and
Hughes (2003) both emphasize that the interval between the administrations of two tests should be
neither too long as students might learn new things nor too short as it might change students’ ability to
remember the exam questions. Once the test is reliable, the test-takers will get more or less the same
score no matter when the test is administered, on a certain day or on coming days, and teachers have to
prepare and administer reliable tests so as to obtain similar results from the same students, but at a
different time (Hughes, 2003, p. 36). Reliable tests give predictions about to what extent
measurement-related factors may have impact on test scores. These factors can be grouped into the
following categories: test factors that refer to the clarity of instruction, items, paper-layout and the
length of the test; situational factors that refer to the conditions of the room; and individual factors that
cover the physical and psychological state of test-taker. All these factors should be considered while
interpreting the reliability of any test scores.
1.2. Computer-based testing alternatives
Computers are undoubtedly part of our daily lives; they take part in many different walks of life
actively. This role change in computer applications goes back to the late 1970s. Since then, computers
have had a vital place in the world, especially for educational purposes. In addition to the widespread
use of web and computers as teaching sources both inside and outside the class (especially for distance
education), computers have come to offer testing alternatives for teachers as well. Today, it is
estimated that nearly 1000 computer-assisted assessments are done each day in the UK (Lilley, Barker,
& Britton, 2004). These assessment models do not only refer to the traditional tests that are
administered on computers in class under the supervision of proctors. It has different sorts of
alternatives which are named as computer-based testing (CBT), web-based testing (WBT) and
computer-adaptive testing (CAT). These are briefly introduced below.
. H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85 69
Computer-based testing roughly refers to making use of computers while preparing questions,
administering exams and scoring them (Chapelle, 2001), and with the advent of using computers as
testing devices since the 1980s, a different point of view has been gained so that more authentic, cost-
saving, scrutinized and controlled testing environment can be achieved, comparing to traditional
paper-and-pencil based one (Jeong, 2014; Linden, 2002; Parshall, Spray, Kalohn, & Davey, 2002;
Wang, 2010; Ward, 2002). Computer-based testing, which started in the late 1970s or in the early
1980s, was always thought as an alternative to paper-based testing (Folk & Smith, 2002), because
“one size fits all” solution across testing programs was not desired at all (Ward, 2002, p. 37).
Computers have brought many advantages. First of all, they have the potential to offer realistic test
items like media, graphics, pictures, video and sound (Chapelle & Douglas, 2006, p. 9; Linden, 2002,
p. 9). Therefore, students can be involved in a real-life testing environment where there are many
integrated activities. In other words, students can respond to computers orally, draw on the screen
while answering the question, see and interpret graphics or tables for an open-ended question and so
on, and handicapped test-takers can take the exams on computer with great ease. CBT also supplies
immediate feedback and scoring (Chapelle & Douglas, 2006; Parshall et al., 2002), which has
significant impact over pedagogy (test-takers can grasp their mistakes when immediate feedback is
offered upon the completion of the test) and eases teachers’ workload of scoring all papers – teachers
may spend much time on scoring exam papers, and also, generally they cannot give enough feedback
about each student’s mistakes, or even if they provide feedback, it may be so late that students do not
remember the questions or their answers. Another issue that should be mentioned here is that
especially for open-ended questions, subjective-scoring may be in due. However, thanks to computer
technology, objective scoring can be achieved, and problems caused by handwriting disappear, too.
And the last important feature of CBT or Computer-Assisted Assessment (henceforth CAA) is that the
examiners can collect data about the exam such as how many questions have been answered correctly,
how many of them have been omitted and how many minutes have been spent for each question,
which is called as response latency (Parshall et al., 2002, p. 2).
Since the beginning of using computers as testing tools, many different computer-based test
delivery modes have come to scene: computer-adaptive testing (CAT), linear-on-the-fly testing
(LOFT) or computerized fixed tests (CFT), computerized mastery testing (CMT) (Ward, 2002, p. 38)
and automated test assembly (ATA) (Parshall et al., 2002, p. 11). CAT is totally performance or
individual based testing. The more a candidate answers questions correctly, the more challenging
questions appear on the screen, and vice versa. On the contrary, LOFT or CFT has fixed time and test-
length for all test-takers. Exam security is the main goal in LOFT, rather than having psychometric
values as in CAT (Parshall et al., 2002). As for CMT, it aims to divide test-takers into mastery and
non-mastery groups (Folk and Smith, 2002, pp. 49-50). Lastly, ATA chooses items from an item pool
in regard to the test plan and makes use of content and statistical knowledge. This kind of test has
fixed time and is not in adaptive mode (Parshall et al., 2002, p. 10).
Kearsley (1996) emphasized the importance of web and its future potential as an educational tool
many years ago. Not only is Web a means of delivering information, material, news and so on from
one part of the world to the whole, but also it is the most commonly used and significant benefit of
teachers for a variety of things like searching different types of materials, teaching for distance
education, presenting, preparing tests and delivering them. The reason lying behind this change is that
since 1990s, international connectivity has not been limited only to teaching staff at universities and to
their use of network in computer labs, and without any doubt, it has brought many differences. As for
the testing applications, universal access to computer-assisted assessment has been introduced, and a
bulk of opportunities for autonomous learning and self-assessment has spread all around the world,
70 H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85
and so have computer-based applications. Today, thanks to web-based applications, students and
teachers can be universal and universally in touch (Chapelle, 2001, p. 23).
As an alternative of CAA, web-based testing is specifically driven and delivered by means of web,
and it means that the tests can be taken anywhere and anytime, which constitutes the great advantage
over traditional paper-based and computer-based tests (Roever, 2001). Moreover, the web system also
makes it possible to create unique exams, and it is based on an important mathematical content
(McGough, Mortensen, Johnson, & Fadali, 2001). As Roever (2001, pp. 90-91) mentions that WBT is
threefold as low-stakes assessment, medium-stakes assessment and high-stakes assessment, which can
address for different needs: low-stakes tests are used to give feedback about examinees’ performances
over a certain subject or skill. The examinees can take these tests wherever they want. On the other
hand, medium-stakes assessment covers midterm and final exams done in classes, placement tests or
any tests that have impact on the examinees’ lives. These kinds of tests are carried out by proctors in a
lab. And lastly, high-stakes assessment is the one the results of which may affect greatly the
examinee’s life like being accepted to a university or certification programs or citizenship tests and so
on. Among these three types, WBT is much more useful when it is done for low-stakes assessment.
In three phases (preparation, delivery and assessment), a question can be created on the web.
Accordingly, an item is on the threshold of being created at authoring time. Teachers can prepare
questions and store them in an item bank by using web tools. Then, questions or items are selected in
order to conduct the test. The selection of the items is done either statistically by teachers themselves
or dynamically by the system at run time (Brusikolovsky & Miller; 1999, p. 2). After delivering the
items and conducting the exam, examinees’ answers are assessed as correct, incorrect or partially
correct. On the web technology, preparing, delivering and assessing questions are based on HTML
codes (Brusikolovsky & Miller, 1999, pp. 2-3).
The last mode of CAA, computer-adaptive testing (CAT) that is based on each student’s
performance during the exam has been utilized for many years. The cycle of CAT begins with a
question that is neither so easy nor so difficult. According to the answer of each test-taker to the item,
which question to be asked from the item pool is decided. More clearly, if a test-taker answers a
question correctly, the next one will be harder or on equal difficulty. On the contrary, if a test-taker
answers a question incorrectly, the next one will be easier. Hence, CAT is said to be based on
performance (Chapelle, 2001; Flaugher, 2000; Guzmand & Conejo, 2005; Lilley et al., 2004), and
definitely, this new individualized exam model (Wainer & Eignor, 2000, p. 1) offers more confidential
testing atmosphere for both teachers and students (Guzman & Conejo, 2005; Linden & Glas, 2002).
Students can see each item on screen at a time, and they cannot skip the questions. While the test-
takers are busy with each question, the system calculates the scores and decides which question will be
next in relation to the previous answers given by the test-takers (Brown, 2004; Hughes, 2003). This
measurement model in CATs is known as Item Response Theory (IRT) or Latent Trait Theory, the
mathematical bases of which were outlined by Lord and Novick around the 1970s (Stevenson and
Gross, 1991, p. 224; Tung, 1986, pp. 4-5).
The idea lying behind IRT goes back to the psychological measurement model, put forward by
Alfred Binet and today known as the Stanford-Binet IQ test (Linden & Glas, 2002). Binet’s idea of
measuring each test-taker separately and according to their performance while they are taking the test
has been accepted as the only adaptive testing approach for more than fifty years (Cisar, Radosav,
Markoski, Pinter, & Cisar, 2010), but there was one drawback stated about this smart system: despite
its truly adaptive side, experienced and skilled teachers (examiners) might be needed in order to
administer large-scale tests. Therefore, it was practical only for small-scale tests (Madsen, 1991).
Today, CAT is used not only for small-scale exams but also for large-scale high-stakes exams as well.
For example, Graduate Management Admission Test, Microsoft Certified Professional and Test of
. H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85 71
English as a Foreign Language have been administered in the CAT mode (Lilley et al, 2004, p. 110),
and SIETTE is a web-based CAT system used in Spain (Guzman & Conejo, 2005, p. 688).
Many schools and universities have started to benefit from web technology while administering
exams. One of them is Iowa State University that has created the WebCT. This smart system does not
require any technical information so as to use it, and teachers can easily create and publish online
courses and exams (Chapelle & Douglas, 2006, p. 63). Among other online tools to be utilized are Hot
Potatoes, Discovery School Quiz Center, Blackboard and Questionmark (Chapelle & Douglas, 2006,
pp. 72-73).
1.3. Studies on comparability of reliability and validity by test mode
Over the last two decades a number of comparability studies have concentrated on the effects of the
test delivery mode on student performance, i.e., whether the test scores obtained from computer- and
paper-based tests are interchangeable; these are referred to as “mode effects” (Bennett, 2003; Choi,
Kim, & Boo, 2003; Dunkel, 1991; Paek, 2005; TEA, 2008; Wang, Jiao, Young, Brooks, & Olson,
2007). These studies often revealed mixed results regarding the comparability issues of CBT and PBT
in different content areas. Some studies show that CBTs are more challenging than PBTs (Creed,
Dennis, & Newstead, 1987; Laborda, 2010) or vice versa (Chin,1990; Dillon, 1994; Yağcı, Ekiz &
Gelbal, 2011), whereas some studies conclude that CBTs and PBTs are comparable (Akdemir &
Oğuz, 2008; APA, 1986; Bugbee, 1996; Choi, et al., 2003; Choi & Tinkler, 2002, cited in Wang &
Shin, 2009; Higgings, Russell, & Hoffmann, 2005; Jeong, 2014; Kim & Hyunh, 2007; Logan, 2015;
Muter, Latremouille, Treurniet, & Beam, 1982; Paek, 2005; Parshall & Kromrey, 1993; Retnawati,
2015; Russell, Goldberg, & O’conner, 2003; Stevenson & Gross, 1991; Tsai & Shin, 2012; Wang et
al., 2007; Wang & Shin, 2009; Yaman & Çağıltay, 2010).
In her comprehensive review, Paek (2005, p. 17) concludes that overall CBT and PBT “versions of
traditional multiple-choice tests are comparable across grades and academic content.” Higgings et al
(2005) conducted a survey with 219 4th grade students in an attempt to define any probable score
differences in reading comprehension between groups, resulting from the test-mode effect; their
research revealed no statistically significant differences. Similarly, in the study of Akdemir and Oğuz
(2008), 47 prospective teachers in the departments of Primary School Teaching and Turkish Language
and Literature took an achievement test, including thirty questions, both on computer and on paper. At
the end of the study, it was revealed that there was not statistical difference between the test-takers’
scores in line with the test-administration mode. Hence, the researchers mentioned that “computer-
based testing could be an alternative to paper-based testing” (p. 123). Hosseini, Abidin, and
Baghdarnia (2014) compared reading comprehension test with multiple-choice items administered on
computer and on paper; at the end of the study, no significant difference was found. Retnawati (2015)
compared the scores of the participants who took paper-based Test of English Proficiency with the
ones who took computer-based version of the test as well, and the results revealed that scores in both
exam modes were quite similar. Lastly, Logan (2015) aimed to search the students’ performance
differences up to exam administration mode within the frame of mathematics course. In total, 807 6
th
grade Singaporean students took the mathematics test with 24 items and the paper folding test either
on computer or on paper. The results displayed that there was no significant difference. In contrast,
Choi et al. (2003) found out that taking a listening test on computer offered an advantage for the test-
takers since they got higher scores compared to a paper-based listening test. Yağcı et al. (2011) at a
state university carried out a similar study on this topic. This time participants were 75 vocational
school students in the department of business administration. In order to reveal the probable academic
success differences among participants, the exam was done in two ways (CBT versus PBT), and at the
end, participants’ scores were compared. It was found that students who had taken the computer-
72 H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85
assisted exam outperformed. Hensley (2015) carried out a study with 142 students in the department of
mathematics at the University of Iowa with an aim to compare the students’ test scores taken from
paper-based tests and computer-based tests. At the end, it was found that the test scores could not be
compared because there was a significant difference between the two test modes. A recent study done
by Hakim (2017) with 200 female students whose English language command at B1 level in Saudi
Arabia displayed that tests done in two different versions, CBT versus PBT, had statistically
significant differences.
Although professional assessment standards attach great importance to the comparability of CBTs
and PBTs, there has been little empirical research that examines the impact of technology on the two
main aspects of the assessment, which include the concepts of validity and reliability (Al-Amri, 2008;
Chapelle, 1998; 1999; 2001; Chapelle & Douglas, 2006). For example, in a recent study, Chua (2012)
compared the reliabilities of CBTs and PBTs by using computer- and paper-based versions of the
multiple-choice Yanpiaw Creative-Critical Styles test (YBRAINS) and the Testing Motivation
Questionnaire (TMQ) with a five-point Likert scale. The findings revealed that the reliability values
were close to each other in CBTs and PBTs. However, Chua (2012) stated that the results might have
been different if achievement tests had been used in the study since the test takers’ motivation, desire
to achieve high scores and context of the test might affect the scores. Dermo (2009) also carried out a
study with 130 undergraduate students who took online tests. The research had six perspectives such
as affective factors, validity, practical issues, reliability, security and learning and teaching. According
to the results, it was concluded that taking online tests was regarded as a practical and secure domain
by the participants. As for the validity and reliability of online tests, both factors seemed to be
appropriate and related to the curriculum. Al-Amri (2008) administered three tests to each participant
who took the same test once on computer and once on paper. In order to determine the effect of the
testing mode on reliability, he examined the internal consistency (Cronbach’s alpha) of CBTs and
PBTs and the results indicated that the internal reliability coefficients ranged between .57 and .70, not
as high as expected. In order to check concurrent validity of the tests, on the other hand, a correlational
analysis was conducted and the results indicated that each PBT significantly correlated with its
computerized version. Overall, there was not any significant effect of the test administration mode on
the overall reliability and validity of the tests. In another study (Boo, 1997, cited in Al-Amiri), the test
administration mode did not have any impact on the reliability of tests. Utilizing an EFL test battery
entitled the Test of English Proficiency developed by Seoul National University (TEPS), Choi et al.
(2003) investigated the comparability between PBT and CBT based on content and construct
validation. Although they did not focus on the measurement of course learning outcomes in higher
education, their findings supported comparability between the CBT and PBT versions of the TEPS
subtests (listening comprehension, grammar, vocabulary, and reading comprehension) in question.
On the other hand, Semerci and Bektaş (2005) conducted a survey about how to improve the
validity of web-based tests. In this regard, they collected data from four different state universities
(Anadolu, Sakarya, Fırat Universities and METU) in Turkey, where web-based tests were being
administered. The researchers sent emails to a total of 45 people at those universities as to collect data
for the study, and only 33 of them wrote back. After the data were analyzed, some ways to improve
the validity of web-based tests were defined: Digital identities like fingerprint and voice control should
be used; teachers should encourage learners to make projects and research; mini-quizzes and video-
conferencing can foster learning, so teachers should make use of them in their courses. Within a
similar vein, Delen (2015) aimed to focus on how to increase the validity and reliability of computer-
assisted assessment. In this sense, optimum item response time for each question was shown on the
screen when the participants were busy with answering the exam items, and the findings revealed that
. H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85 73
if students were offered optimum item response time, more valid and reliable tests would be achieved
than paper-based tests.
Our review of the related literature indicates that although there have been numerous studies that
compare CBTs and PBTs in terms of mean scores, there is little research that specifically deals with
the criteria of adequate reliability and accuracy of measurement. Wang and Kolen (2001) developed a
framework of criteria for evaluating the comparability between CAT and PBT: (1) validity, (2)
psychometric/reliability, and (3) statistical assumption/test administration. We assume that these three
criteria can also be used to evaluate the comparability between the linear CBTs and PBTs.
1.4. Research questions
To the best of our knowledge, at a time when Turkish institutions of higher education are on the
eve of considering the computerized administration of assessments, there is not even a single study
that deals with the comparability of computer- and paper-based tests in English language teacher
education programs. Thus, the present research grew out of a desire to learn whether the validity and
reliability principles of assessment would be influenced by the test administration mode when pre-
service English teachers would take an achievement test for their pedagogical content knowledge
course. Thus, the following research questions were formulated to guide the present study:
1. To what extent are the results of a paper-based test (PBT) comparable to those of its CBT version?
2. If the PBT in question has satisfied the criteria of adequate reliability and accuracy of
measurement, can its CBT version be considered to have equal reliability and accuracy of
measurement?
2. Method
The quantitative research model of the study covers the experimental study - a posttest only design.
Accordingly, there is no place for pretests in the study, just the posttests are used. After the
participants of the study had been randomly assigned to two groups, the control group took the
achievement test in a traditional way while the experimental group took the same exam through a
computer-assisted system. When the exam was over, both groups were administered a questionnaire
adapted to state some background information of participants and their attitudes towards computer-
assisted assessment.
2.1. Participants
The participants for this study consisted of a total of 100 student teachers enrolled in Approaches to
ELT course in the English language teaching (ELT) department at Hacettepe University. They had
already been enrolled in three different sections of the course and taking it from the same faculty
member before the study started. They were randomly assigned to the …
Methodological and Statistical Advances in the Consideration of Cultural
Diversity in Assessment: A Critical Review of Group Classification and
Measurement Invariance Testing
Kyunghee Han, Stephen M. Colarelli, and Nathan C. Weed
Central Michigan University
One of the most important considerations in psychological and educational assessment is the extent to
which a test is free of bias and fair for groups with diverse backgrounds. Establishing measurement
invariance (MI) of a test or items is a prerequisite for meaningful comparisons across groups as it ensures
that test items do not function differently across groups. Demonstration of MI is particularly important
in assessment settings where test scores are used in decision making. In this review, we begin with an
overview of test bias and fairness, followed by a discussion of issues involving group classification,
focusing on categorizations of race/ethnicity and sex/gender. We then describe procedures used to
establish MI, detailing steps in the implementation of multigroup confirmatory factor analysis, and
discussing recent developments in alternative procedures for establishing MI, such as the alignment
method and moderated nonlinear factor analysis, which accommodate reconceptualization of group
categorizations. Lastly, we discuss a variety of important statistical and conceptual issues to be
considered in conducting multigroup confirmatory factor analysis and related methods and conclude with
some recommendations for applications of these procedures.
Public Significance Statement
This article highlights some important conceptual and statistical and issues that researchers should
consider in research involving MI to maximize the meaningfulness of their results. Additionally, it
offers recommendations for conducting MI research with multigroup confirmatory factor analysis
and related procedures.
Keywords: test bias and fairness, categorizations of race/ethnicity and sex/gender, measurement
invariance, multigroup CFA
Supplemental materials: http://dx.doi.org/10.1037/pas0000731.supp
When psychological tests are used in diverse populations, it
is assumed that a given test score represents the same level of
the underlying construct across groups and predicts the same
outcome score. Suppose that two hypothetical examinees, a
middle aged Mexican immigrant woman and a Jewish European
American male college student, each produced the same score
on a measure of depression. We would like to conclude that the
examinees exhibit the same severity and breadth of depression
symptoms and that their therapists would rate them similarly on
relevant behavioral and symptom measures. If empirical evi-
dence indicates otherwise, and such conclusions are not justi-
fied, scores on the measure are said to be biased.
Although it has been defined variously, a representative
definition refers to psychometric bias as “systematic error in
estimation of a value”). A biased test “is one that systematically
overestimates or underestimates the value of the variable it is
intended to assess” due to group membership, such as ethnicity
or gender (Reynolds & Suzuki, 2013, p. 83). The “value of the
variable it is intended to assess” can either be a “true score” (see
S1 in the online supplemental materials) on the latent construct
or a score on a specified criterion measure. The former appli-
cation concerns what is sometimes termed measurement bias, in
which the relationship between test scores and the latent attri-
bute that these test scores measure varies for different groups
(Borsboom, Romejin, & Wicherts, 2008; Millsap, 1997),
whereas the latter application concerns what is referred to as
predictive bias, which entails systematic inaccuracies in the
prediction of a criterion from a test depending upon group
membership (Clearly, 1968; Millsap, 1997).
Kyunghee Han, Stephen M. Colarelli, and Nathan C. Weed, Department
of Psychology, Central Michigan University.
This article has not been published elsewhere, nor has it been
submitted simultaneously for publication elsewhere. The author(s) de-
clared no potential conflicts of interest with respect to the research,
authorship, and/or publication of this article. The author(s) received no
funding for this study.
Correspondence concerning this article should be addressed to Kyung-
hee Han, Department of Psychology, Central Michigan University, Mount
Pleasant, MI 48859. E-mail: [email protected]
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
Psychological Assessment
© 2019 American Psychological Association 2019, Vol. 31, No. 12, 1481–1496
1040-3590/19/$12.00 http://dx.doi.org/10.1037/pas0000731
1481
http://dx.doi.org/10.1037/pas0000731.supp
mailto:[email protected]
http://dx.doi.org/10.1037/pas0000731
Test bias should not be confused with test fairness. Although the
two concepts have been used interchangeably at times (e.g., Hunter
& Schmidt, 1976), test fairness entails a broader and more sub-
jective evaluation of assessment outcomes from perspectives of
social justice (Kline, 2013), whereas test bias is an empirical
property of test scores, estimated statistically (Jensen, 1980). Ap-
praisals of test fairness include multifaceted aspects of the assess-
ment process, lack of test bias being only one facet (American
Educational Research Association, American Psychological Asso-
ciation [APA], & National Council on Measurement in Education,
2014; Society for Industrial Organizational Psychology, 2018; see
S2 in the online supplemental materials).
In the example above, the measure of depression may be unfair
for the Mexican female client if an English language version of the
measure was used without evaluating her English proficiency, if
her score was derived using American norms only, if computerized
administration was used, or if use of the test leads her to be less
likely than members of other groups to be hired for a job. Although
test bias is not a necessary condition for test unfairness to exist, it
may be a sufficient condition (Kline, 2013). Accordingly, it is
especially important to evaluate whether test scores are biased
against vulnerable groups.
The evaluation of test bias and test fairness each entails a
comparison of one group of people with another. While asking the
question, “Is a test biased?” we are also implicitly asking “against
or for which group?” Similarly, if we are concerned about using a
test fairly, we must ask: are the outcomes based on the results of
the test apportioned fairly to groups of people who have taken the
test? Thus, the categorization of people into distinct groups is a
sine qua non of many aspects of psychological assessment re-
search. Racial/ethnic and sex/gender categories are prominent fea-
tures of the social, cultural, and political landscapes in the United
States (e.g., Helms, 2006; Hyde, Bigler, Joel, Tate, & van Anders,
2019; Jensen, 1980; Newman, Hanges, & Outtz, 2007), and have
therefore been the most commonly studied group variables in bias
research (e.g., Warne, Yoon, & Price, 2014). Most of the initial
research on and debates about test bias and fairness in the United
States stemmed from political movements addressing race and sex
discrimination (e.g., Sackett & Wilk, 1994). In service of pressing
research on questions of discrimination and economic inequality, it
thus became commonplace among psychologists and social scien-
tists to categorize people crudely into groups (based primarily on
race, ethnicity, and sex/gender) without much thought to the mean-
ing and validity of those categorizations (e.g., Hyde et al., 2019;
Yee, 1983; Yee, Fairchild, Weizmann, & Wyatt, 1993). This has
changed somewhat over the past two decades as scholarship by
psychologists and others has increasingly focused on nuances of
identity, multiculturalism, intersectionality, and multiple position-
alities (Cole, 2009; Song, 2017). This scholarship has emphasized
that racial, ethnic, and gender classifications can be complex,
ambiguous, and debatable—and that identities are often self-
constructed and can be fluid (Helms, 2006; Hyde et al., 2019). The
first goal of this review, therefore, is to overview contemporary
issues involving race/ethnicity and sex/gender classifications in
bias research and to describe alternative approaches to the mea-
surement of these variables.
The psychometric methods used to examine test bias usually
depend on the definition of test bias operating for a given appli-
cation. Evaluating predictive bias (i.e., establishing predictive in-
variance) often involves regressing total scores from a criterion
measure onto total scores on the measure of interest, and compar-
ing regression slopes and intercepts across groups (Clearly, 1968).
Evaluating measurement bias (i.e., establishing measurement in-
variance [MI]) often necessitates more advanced quantitative meth-
ods, such as confirmatory factor analysis (CFA) or methods deriv-
ing from item response theory, to compare the properties of item
scores and scores on latent variables across different groups.
Multigroup confirmatory factor analysis (MGCFA) has been one
of the most commonly used techniques to examine MI (Davidov,
Meuleman, Cieciuch, Schmidt, & Billiet, 2014) because it pro-
vides a comprehensive framework for evaluating different forms
of MI. The second goal of this review is to provide a broad
overview of MGCFA and related procedures and their relevance to
psychological assessment.
Although MGCFA is a well-established procedure in the eval-
uation of MI, it has limitations. MGCFA is not an optimal method
for conducting MI tests when many groups are involved. More-
over, the grouping variable in MGCFA must be categorical, and
therefore does not permit MI testing with continuous grouping
variables (e.g., age). As modern research questions may require MI
testing across many groups, and with continuous reconceptualiza-
tions of some of the grouping variables (e.g., gender), more flex-
ible techniques are needed. Our third goal, therefore, is to describe
two recent alternative methods for MI testing, the alignment
method and moderated nonlinear factor analysis, that aim to over-
come these limitations. We conclude the review with a discussion
of some important statistical and conceptual issues to be consid-
ered when evaluating MI, and include a list of recommended
practices.
Group Classifications Used in Bias Research
Racial and Ethnic Classifications
Race and ethnicity (see S3 in the online supplemental materials)
are conceptually vague and empirically complex social constructs
that have been examined by numerous researchers across many
disciplines (Betancourt & López, 1993; Helms, Jernigan, &
Mascher, 2005; Yee et al., 1993). Consider race. As a biological
concept, it is essentially meaningless. In most cases, there is more
genetic variation within so-called racial groups than between racial
groups (Witherspoon et al., 2007). Even if we allow race to be
defined by a combination of specific morphological features and
ancestry, few “racial” populations are pure (Gibbons, 2017). Most
are mixed—like real numbers, with infinite gradations. For exam-
ple, although many African Americans trace their ancestry to West
Africa, about 20\% to 30\% of their genetic heritage is from Euro-
pean and American Indian ancestors (Parra et al., 1998), and racial
admixture continues as the frequency of interracial marriages
increases (Rosenfeld, 2006; U.S. Census Bureau, 2008). Even if
one were to accept race as a combination of biological features and
cultural and social identities (shared cultural heritage, hardships,
and discrimination), there is the problem of degree. For example,
while many Black Americans share social and cultural identities
based on roots in American slavery and racial discrimination, not
all do, such as recent Black immigrants from the Caribbean. Racial
and ethnic classifications are often conflated. In psychological
research, “Asian” is commonly used both as a cultural (Nisbett,
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1482 HAN, COLARELLI, AND WEED
http://dx.doi.org/10.1037/pas0000731.supp
http://dx.doi.org/10.1037/pas0000731.supp
Peng, Choi, & Norenzayan, 2001) and racial category (Rushton,
1994). Yet it is a catch-all term based primarily on geography. It
typically refers to people from (or whose ancestors are from)
South, Southeast, and Eastern Asia. The term Hispanic often
conflates linguistic, cultural, and sometimes even morphological
features (Humes, Jones, & Ramirez, 2010).
In public policy, mixtures of racial (or ethnic) background has
only recently begun to be addressed. The U.S. Census, for exam-
ple, did not include a multiracial category until 2000 (Nobles,
2000). We are only beginning to see assessment studies that parse
people from traditional broad groupings into smaller, more mean-
ingful and homogeneous groups. In one of the few studies that
identified different types of Asians, Appel, Huang, Ai, and Lin
(2011) found significant (and sometimes major) differences in
physical, behavioral, and mental health problems among Chinese,
Vietnamese, and Filipina women in the U.S. More recently, Tal-
helm et al. (2014) found important differences in culture and
thought patterns within only one Asian country, China. People in
northern China were significantly more individualistic than those
in southern China, who were more collectivistic. With current and
historical farming practices as their theoretical centerpiece, they
examined farming practices as causal factors. In northern China
wheat has been farmed as a staple crop for millennia, whereas in
southern China rice has been (and is) the staple crop. Talhelm et al.
argued that the farming practices required by these two crops
required different types of social organization that, over time,
influenced cultural values and cognition. The work by Talhelm and
colleagues is important because it is one of the first studies to
show—along with a powerful theoretical rationale—that there are
important cultural differences between people from what has typ-
ically been thought of as a relatively homogeneous racial and
cultural group.
In another seminal article, Gelfand and colleagues (2011) ex-
amined the looseness-tightness dimension of cultures in 33 coun-
tries. This dimension reflects the strength of norms and the toler-
ance of deviant behavior. Loose cultures have weaker norms and
are more tolerant of deviant behavior. While there was substantial
variation between countries, there was still considerable variation
among countries typically considered “Asian.” Hong Kong was the
loosest (6.3), while Malaysia was the tightest (11.8), with the
People’s Republic of China (7.9), Japan (8.6), South Korea (10),
and Singapore (10.4) in between. To say that all Asian countries
are culturally similar is untenable when, for example, Malaysian
culture is 88\% tighter than Hong Kong culture.
Gender Classifications
Binary categories. Gender (see S4 in the online supplemental
materials) differences or similarities on psychological constructs
have been a widely researched topic (e.g., Feingold, 1994; Hyde,
2005) since the 1970s (Eagly & Riger, 2014), with many studies
assuming the existence of clear qualitative and quantitative differ-
ences between genders (Brizendine, 2006; Ruigrok et al., 2014).
These include numerous studies examining bias or MI across
gender (e.g., Baker & Mason, 2010; Linn & Kessel, 2010) in
which researchers have tended to employ a binary categorization
of gender. However, the binary gender categorization and the
presumption of qualitative gender difference have recently been
challenged (APA, 2015, 2017; Richards et al., 2016; Richards,
Bouman, & Barker, 2017; Schellenberg & Kaiser, 2018).
In a recent article, Hyde and colleagues (2019) reviewed em-
pirical findings from five disciplines and challenged the legitimacy
of binary gender classification in each: (a) neuroscience (sexual
dimorphism of the human brain), (b) behavioral neuroendrocrinol-
ogy (the argument for genetically fixed, nonoverlapping, sexually
dimorphic hormonal systems), (c) research on psychological vari-
ables (the inference of clear gender differences on psychological
constructs), (d) research with transgender and nonbinary individ-
uals (the assumption that gender identities and experiences are
consistent with gender assigned at birth), and (e) research from
developmental psychology (arguing that binary gender categories
are culturally universal and unmalleable).
Nonbinary identities. There is a wide variety of nonbinary gen-
der identities (APA, 2015; Hyde et al., 2019; Richards et al., 2016,
2017). Intersex individuals have physical characteristics outside
the typical binary male-female categories, although most still
identify their gender within the binary system (Richards & Barker,
2013). More common are people who are not physiologically
intersex but who have nonbinary gender identities. Although terms
are often subsumed within the umbrella terms of nonbinary or
genderqueer identities, various more specific labels have been
used: androgynous, mixed gender, or pangender to indicate incor-
porating aspects of both male and female, but having a fixed
identity; bigender or gender fluid to indicate movement between
gender in a fluid way; trigender to indicate moving between more
than two genders; third gender or other gender to identify a
specific additional gender; gender queer to challenge the binary
gender system; agender, gender neutral, genderless, nongendered,
or neuter to indicate no gender (Richards et al., 2016). The
frequency of people with nonbinary gender identities, while small
compared to the population at large, is not trivial. For example, one
Dutch study found that 4.6\% of people assigned male identities at
birth and 3.2\% assigned female identities at birth have ambivalent
gender identities (Kuyper & Wijsen, 2014). Others—people who
regard themselves as asexual— have what might be called no gender
identity (Carrugan, 2015).
The trend toward acceptance of diversity in gender classification
has been reflected in various professional and societal contexts.
Within the context of mental health care, the Diagnostic and
Statistical Manual, 5th edition (American Psychiatric Association,
2013) removed the Diagnostic and Statistical Manual–IV–TR
(American Psychiatric Association, 2000) diagnosis of gender
identity disorder and recognized nonbinary genders within the
diagnostic taxonomy. People with nonbinary gender identities
have become more politically active, with visible results in recent
years both in official documentation and in media coverage
(Scelfo, 2015). For example, New Zealand passport holders can
claim one of three gender identities: male, female, or X (other).
New York, New York, has recently joined Oregon, California, Wash-
ington, and New Jersey in offering a nonbinary gender marker (X) on
birth certificates for residents who do not identify as male or female
(Hafner, 2019). In an effort to foster an environment of inclusiveness
and supporting students’ preferred form of self-identification, many
universities in the United States allow students to choose preferred
gender pronouns (https://www.npr.org/2015/11/08/455202525/more-
universities-move-to-include-gender-neutral-pronouns). This move-
ment has also been reflected in product development and marketing;
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1483METHODOLOGICAL AND STATISTICAL ADVANCES
http://dx.doi.org/10.1037/pas0000731.supp
https://www.npr.org/2015/11/08/455202525/more-universities-move-to-include-gender-neutral-pronouns
https://www.npr.org/2015/11/08/455202525/more-universities-move-to-include-gender-neutral-pronouns
numerous vendors specialize in gender neutral clothing or other items
(e.g., https://www.lgbtqnation.com/2017/07/target-hits-back-binary-
gender-neutral-clothing-line/; http://www.foxnews.com/us/2015/08/
13/target-going-gender-neutral-in-some-sections.html).
Emerging Recommendations Regarding Race/Ethnicity
and Sex/Gender Categorizations
Results of studies of test bias rest to a considerable degree on
how grouping variables such as race/ethnicity and sex/gender are
operationalized. Cultural researchers have recently proposed a
number of suggestions for reconceptualizing and operationalizing
these grouping variables. Hyde et al. (2019) and other researchers
(e.g., Bittner & Goodyear-Grant, 2017; Schellenberg & Kaiser,
2018; Tate, Ledbetter, & Youssef, 2013) warn about the costs of
binary sex/gender categorization in research and provide helpful
suggestions for reconceptualizing and measuring sex/gender: (a)
providing multiple categories of gender identity within available
response options (“female,” “male,” “transgender female,” “trans-
gender male,” “genderqueer” [click for more options: “agender,”
“bigender,” etc.], “intersex,” and other [specify]); (b) asking about
both birth-assigned and self-assigned gender/sex identities; (c)
using an open-ended response format (e.g., “what is your gen-
der?”); (d) treating gender/sex constructs as multidimensional (use
of multiple measures of gender/sex identities, stereotypes, or be-
haviors), dynamic, and continuous (e.g., “In the past 12 months,
have you thought of yourself as a man?” “In the past 12 months,
have you thought of yourself as a woman?” “How would you rate
yourself on the continuum from 0 to 100 regarding “maleness?”
“How would you rate yourself on the continuum from 0 to 100
regarding “femaleness?”); and (e) asking about gender identity at
the end of a study so that it does not influence responding.
Numerous researchers (e.g., Cole, 2009; Helms & colleagues,
2005; Yee, 1983) have proposed ways to reconceptualize race/
ethnicity in research. When researchers separate people into
groups and compare them, careful attention must be given to (a)
what the categories mean, (b) attempt to ensure their homogeneity,
(c) how the categories are theoretically related to the substantive
constructs under study, (d) treating race/ethnicity constructs as
multidimensional (use of multiple measures of race/ethnicity iden-
tities, stereotypes, socioeconomic status [SES], or racism), and (d)
intersectional views of the categories (see below). Racial catego-
rization should be avoided in research without clear conceptual
reasons.
It is clear that people fall into multiple grouping categories—
each person is simultaneously a member of a gender, race, ethnic-
ity, class, and sexual orientation (APA, 2017). The concept of
“intersectionality” has been developed by feminist and critical race
theorists (Cole, 2009; Eagly & Riger, 2014) to encourage research-
ers to understand individuals from “a vast array of cultural, struc-
tural, sociobiological, economic, and social contexts by which
individuals are shaped and with which they identify” (APA, 2017,
p. 19). How might we conceptualize and examine test bias across
combinations of categories? Studies of psychometric bias are
typically conducted from the perspective of one group membership
at a time mainly due to methodological complications in imple-
menting multiple group categories into testable models. Research
teams such as Corral and Landrine (2010) and Else-Quest and
Hyde (2016) have provided recommendations for incorporating
intersectional approaches to numerous facets of research (i.e.,
theory, design, sampling techniques, measurement, data analytic
strategies, and interpretation and framing). More work building on
these recommendations will be needed to meet the research chal-
lenges of intersectionality in the context of bias testing.
Testing Bias
As mentioned earlier, the statistical methods used to examine
test bias usually depend on the definition of test bias operating for
a given application (see S5 in the online supplemental materials).
If test scores can be used to predict some future outcome (a
criterion), they are said to demonstrate predictive validity (Clearly,
1968). In general, a test is considered not biased (demonstrating
“predictive invariance”) if its scores predict future outcomes
equally well for individuals from different groups. However, if test
scores are better at predicting outcomes for some groups than for
others, the test scores are said to reflect differential predictive
validity or slope bias (Camilli, 2006). If test scores systematically
overpredict or underpredict criterion scores for one group relative
to another, the scores are said to reflect intercept bias (see S6 in the
online supplemental materials).
Relative to predictive invariance, MI has been investigated more
frequently and more rigorously in recent years due to advances in
statistical procedures, including the development of software that
allows researchers to carry out MI testing more easily. Researchers
have examined how predictive invariance is related to MI and
argued that evidence for one form of invariance is not evidence in
support of the other, but may in some cases serve as evidence
against the other (Borsboom et al., 2008; see Millsap, 1995, 1997,
2007, for a mathematical proof of this argument). Investigating
both forms of invariance simultaneously in the same study is ideal,
but seldom achieved (Millsap, 2007), with some exceptions (Cul-
hane, Morera, Watson, & Millsap, 2009; Wright, Kutschenko,
Bush, Hannum, & Braddy, 2015). Researchers tend to prefer to
evaluate one form over the other depending upon the context of
assessment. For example, in assessment research related to per-
sonnel section, evaluating predictive invariance is favored over
evaluating MI (Borsboom, 2006; Society for Industrial Organiza-
tional Psychology, 2018). However, even in this context, item-
level MI analyses are recommended when conducting cross-
cultural research involving linguistically different populations.
The ease of evaluating predictive bias varies greatly depend-
ing upon the psychometric integrity of the criterion variable in
question. It is possible for differential predictive validity to
reflect statistical artifact due to measurement error in scores on
the criterion and predictor (Kane & Mroch, 2010; Warne et al.,
2014). Moreover, other confounding variables (e.g., time gap
between obtaining test and criterion data) need to be controlled
across groups when examining predictive invariance. These
complications (see Borsboom et al., 2008, for a comprehensive
discussion) are especially problematic in cross-cultural research
involving translated measures. Therefore, Borsboom and col-
league (2008) argued that psychologists should favor MI testing
over predictive invariance testing in defining and measuring
bias. For the rest of this article, we focus on procedures asso-
ciated with MI.
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
an
P
sy
ch
ol
og
ic
al
A
ss
oc
ia
ti
on
or
on
e
of
it
s
al
li
ed
pu
bl
is
he
rs
.
T
hi
s
ar
ti
cl
e
is
in
te
nd
ed
so
le
ly
fo
r
th
e
pe
rs
on
al
us
e
of
th
e
in
di
vi
du
al
us
er
an
d
is
no
t
to
be
di
ss
em
in
at
ed
br
oa
dl
y.
1484 HAN, COLARELLI, AND WEED
https://www.lgbtqnation.com/2017/07/target-hits-back-binary-gender-neutral-clothing-line/
https://www.lgbtqnation.com/2017/07/target-hits-back-binary-gender-neutral-clothing-line/
http://www.foxnews.com/us/2015/08/13/target-going-gender-neutral-in-some-sections.html
http://www.foxnews.com/us/2015/08/13/target-going-gender-neutral-in-some-sections.html
http://dx.doi.org/10.1037/pas0000731.supp
http://dx.doi.org/10.1037/pas0000731.supp
Testing Measurement Invariance: Multigroup
Confirmatory Factor Analysis
The popularity of research involving MI has increased exponen-
tially in recent years. An informal PsycINFO database search of
titles, keywords, and abstracts associated with peer reviewed jour-
nal articles from 1980 to January 31, 2018, on group and mea-
surement and invariance or equivalence located 55 unique articles
published from 1980 to 1989, 91 articles published from 1990 to
1999, 459 articles published from 2000 to 2009, and 1,504 articles
published from 2010 to 2017. The increased interest in this topic
may reflect globalization of the social sciences, increased empha-
sis on fair assessment for diverse populations, and statistical ad-
vances permitting rigorous testing of MI (Davidov et al., 2014;
Sass, 2011; Vandenberg, 2002).
Since its development by Jöreskog (1971) and Sörbom (1974)
for use with continuous indicators, MI testing via MGCFA …
https://doi.org/10.1177/0091026020935582
Public Personnel Management
2021, Vol. 50(2) 232 –257
© The Author(s) 2020
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/0091026020935582
journals.sagepub.com/home/ppm
Article
A Critical Examination of
Content Validity Evidence
and Personality Testing for
Employee Selection
David M. Fisher1 , Christopher R. Milane2,
Sarah Sullivan3, and Robert P. Tett1
Abstract
Prominent standards/guidelines concerning test validation provide contradictory
information about whether content-based evidence should be used as a means
of validating personality test inferences for employee selection. This unresolved
discrepancy is problematic considering the prevalence of personality testing, the
importance of gathering sound validity evidence, and the deference given to these
standards/guidelines in contemporary employee selection practice. As a consequence,
test users and practitioners are likely to be reticent or uncertain about gathering
content-based evidence for personality measures, which, in turn, may cause such
evidence to be underutilized when personality testing is of interest. The current
investigation critically examines whether (and how) content validity evidence should
be used for measures of personality in relation to employee selection. The ensuing
discussion, which is especially relevant in highly litigious contexts such as personnel
selection in the public sector, sheds new light on test validation practices.
Keywords
test validation, content validity, personality testing, employee selection
An essential consideration when using any test or measurement tool for employee
selection is gathering and evaluating relevant validity evidence. In the contemporary
employee selection context, validity evidence is generally understood to mean
1The University of Tulsa, OK, USA
2Qualtrics, Provo, UT, USA
3Rice University, Houston, TX, USA
Corresponding Author:
David M. Fisher, Assistant Professor of Psychology, The University of Tulsa, 800 S. Tucker Drive, Tulsa,
OK 74104, USA.
Email: [email protected]
935582PPMXXX10.1177/0091026020935582Public Personnel ManagementFisher et al.
research-article2020
https://us.sagepub.com/en-us/journals-permissions
https://journals.sagepub.com/home/ppm
mailto:[email protected]
Fisher et al. 233
evidence that substantiates inferences made from test scores. Various sources provide
standards and guidelines for gathering validity evidence, including the Uniform
Guidelines on Employee Selection Procedures (Equal Employment Opportunity
Commission, Civil Service Commission, Department of Labor, & Department of
Justice, 1978; hereafter, Uniform Guidelines, 1978), the Principles for the Validation
and Use of Personnel Selection Procedures (Society for Industrial and Organizational
Psychology [SIOP], 2003; hereafter, SIOP Principles, 2003), and Standards for
Educational and Psychological Testing (American Educational Research Association,
American Psychological Association, & National Council on Measurement in
Education, 1999/2014; hereafter, Joint Standards, 1999/2014), as well as the academic
literature (e.g., Aguinis et al., 2001). Having such a variety of sources available is
beneficial, but challenges arise when the various sources provide ambiguous or con-
tradictory information. Such ambiguity can be particularly troublesome in highly liti-
gious contexts, such as the public sector, where adherence to regulations governing
selection is of paramount importance.
The current investigation attempts to shed light on one such area of ambiguity—
whether evidence based on test content should be used as a means of validating per-
sonality test inferences for employee selection. Rothstein and Goffin (2006) noted, “It
has been estimated that personality testing is a $400 million industry in the United
States and it is growing at an average of 10\% a year” (Hsu, 2004, p. 156). Given this
reality, it is important to carefully consider appropriate validation procedures for such
measures. However, the various sources mentioned above present conflicting direc-
tions on this issue, specifically in relation to content-based validity evidence. On one
hand, evidence based on test content is one of five potential sources of validity evi-
dence described by the Joint Standards (1999/2014), which is similarly endorsed by
the SIOP Principles (2003). This form of evidence has further been suggested by some
to be particularly relevant to personality tests (e.g., Murphy et al., 2009; O’Neill et al.,
2009), and especially under challenging validation conditions, such as small sample
sizes, test security concerns, or lack of a reliable criterion measure (Landy, 1986; Tan,
2009; Thornton, 2009). On the other hand, the Uniform Guidelines (1978) assert that
“. . . a content strategy is not appropriate for demonstrating the validity of selection
procedures which purport to measure traits or constructs, such as intelligence, apti-
tude, personality, commonsense, judgment, leadership, and spatial ability [emphasis
added]” (Section 14.C.1). Other sources similarly convey reticence toward content
validity for measures of traits or constructs (e.g., Goldstein et al., 1993; Lawshe, 1985;
Wollack, 1976). Thus, there appears to be conflicting guidance on the use of content
validity evidence to support personality measures.
In light of this discrepancy, the current investigation offers a critical examination of
content validity evidence and personality testing for employee selection. Such an
investigation is valuable for several reasons. First, an important consequence of the
inconsistency noted above is that content-based evidence may be overlooked as a
valuable approach to validation when personality testing is of interest. Evidence for
this can be seen in the fact that other approaches such as criterion-related validation
are sometimes viewed as the only option for personality measures (Biddle, 2011).
234 Public Personnel Management 50(2)
Similarly, prominent writings on personality testing in the workplace (e.g., Morgeson
et al., 2007b; O’Neill et al., 2013; Ones et al., 2007; Rothstein & Goffin, 2006; Tett &
Christiansen, 2007) have tended to ignore the applicability of content validation to
personality measures. Furthermore, considering the deference given to the various
standards and guidelines in contemporary employee selection practice (Schmit &
Ryan, 2013), those concerned about strict adherence to such standards/guidelines are
likely to be reticent or uncertain about gathering content-based evidence for personal-
ity measures—in no small part due to conflicting or ambiguous recommendations. The
above circumstances tend to relegate content-based evidence to be seen as less desir-
able or otherwise viewed as an afterthought. In turn, this represents a missed opportu-
nity for valuable insight into the use of personality measures.
Second, the neglect or underutilization of content-based evidence is, in many ways,
antithetical to the broader goal of developing a theory-based and scientifically
grounded understanding of tests and measures used for employee selection (Binning
& Barrett, 1989). For example, as elaborated below, there are various situations in
which content-based evidence may be more optimal than criterion-based evidence, not
the least of which includes an insufficient sample size for a criterion-based investiga-
tion (McDaniel et al., 2011). Similarly, an exclusive focus on empirical prediction
ignores the importance of underlying theory, which is critical for advancing employee
selection research. Of relevance, the examination of content validity evidence forces
one to carefully consider the correspondence between selection measures and underly-
ing construct domains, as informed by theoretical considerations. Evidence for the
value of content validity can also be found in trait activation theory (Tett & Burnett,
2003; Tett et al., 2013), which highlights the importance of a clear conceptual linkage
between the content of personality traits/constructs and the job domain in question.
Thus, content validity evidence should be of primary importance for personality test
validation.
Third, it is useful to acknowledge that the prohibition against content validity evi-
dence in relation to personality measures noted in the Uniform Guidelines (1978)
appears to be at odds with contemporary thinking on validation (Joint Standards,
1999/2014). The focal passage quoted above from the Uniform Guidelines has been
described as being “. . . as destructive to the interface of psychological theory and
practice as any that might have been conceived” (Landy, 1986, p. 1189). Although
there have been well-argued critiques of the Uniform Guidelines (e.g., McDaniel
et al., 2011), in addition to thoughtful elaboration of issues surrounding content valid-
ity (e.g., Binning & LeBreton, 2009), a direct attempt at resolving the noted contradic-
tion remains conspicuously absent from the literature. This contradiction, in
conjunction with the absence of a satisfactory explanation, is problematic given the
importance of gathering sound validity evidence pertaining to psychological test use.
As such, a critical examination of this issue is warranted.
Finally, the findings of the current investigation are likely to have broad applicabil-
ity. Namely, although focused on personality testing, the discussion below is relevant
to measures of other commonly assessed attributes classified under the Uniform
Guidelines (1978) as “traits or constructs” (Section 14.C.1). Similarly, while we
Fisher et al. 235
address the Uniform Guidelines—which some argue are outdated (e.g., Jeanneret &
Zedeck, 2010) and further limited by their applicability to employee selection in the
United States—we believe the value of this discussion extends far beyond these guide-
lines. It is important to carefully consider appropriate validation strategies in all cir-
cumstances where psychological tests are used. Hence, the discussion presented herein
is likely to be of relevance for content-based validation efforts in other areas beyond
employee selection in the United States (e.g., educational testing, clinical practice,
international employee selection efforts).
Following a brief overview of validity and content-based validation, our investiga-
tion is organized around three fundamental questions. Question 1 asks whether current
standards and guidelines support the use of content validity evidence for validation of
personality test inferences in an employee selection context. Based on the concerns
raised above, a preliminary answer to this question is that it is unclear. Question 2 then
asks about the underlying bases of the inconsistency. Building on the identified causes
of disagreement, Question 3 asks how one might actually gather evidence based on test
content for personality measures. Ultimately, our goal in this effort is to reduce ambigu-
ity and promote clarity regarding content-based validation of personality measures.
Overview of Validity and Evidence Based on Test
Content
Broadly speaking, validity in measurement refers to how well an assessment device
measures what it is supposed to (Schmitt, 2006). The focus of measurement is typi-
cally described as a construct (Joint Standards, 1999/2014), which represents a latent
attribute on which individuals can vary (e.g., cognitive ability, diligence, interpersonal
skill, knowledge, the capacity to complete a given task). Importantly, a person’s level
or relative standing with regard to the construct of interest is inferred from the test
scores (SIOP Principles, 2003). As such, the notion of validity addresses the simple yet
fundamental issue of whether test scores actually reflect the attribute or construct that
the test is intended to measure. However, this succinct characterization of validity also
belies the true complexity of this topic (Furr & Bacharach, 2014). Two particular com-
plexities bear discussion in light of our current aims.
First, contemporary thinking holds that validity is not the property of a test per se,
but rather of the inferences made from test scores (Binning & Barrett, 1989; Furr &
Bacharach, 2014; Joint Standards, 1999/2014; Landy, 1986; SIOP Principles, 2003).
The value of this approach can be seen when the same test is used for two different
purposes—for example, when an interpersonal skills test developed for the selection
of sales personnel is used for hiring both sales representatives and accountants.
Notably, the test itself does not change, but the inferences made from the test scores
regarding the job performance potential of the applicants may be more or less valid
given the focal job in question. In accord with this perspective, the Joint Standards
(1999/2014) describe validity as “the degree to which evidence and theory support the
interpretations of test scores for proposed uses of the test” (p. 11). Inherent in this view
is the idea that validity is difficult to fully assess without a clear explication of the
236 Public Personnel Management 50(2)
intended interpretation of scores and corresponding purpose of testing. Thus, substan-
tiating relevant inferences in terms of the intended purpose of the test is of primary
concern in the contemporary view of validity.
Second, validity has come to be understood as a unitary concept, as compared with
the dated notion of distinct types of validity (Binning & Barrett, 1989; Furr &
Bacharach, 2014; Joint Standards, 1999/2014; Landy, 1986; SIOP Principles, 2003).
The older trinitarian view (Guion, 1980) posits three different types of validity, includ-
ing criterion-related, content, and construct validity, each relevant for different test
applications (Lawshe, 1985). By contrast, the more recent unitarian perspective
(Landy, 1986) emphasizes that all measurement attempts are ultimately about assess-
ing a target construct, and validation entails the collection of evidence to support the
argument that test scores actually reflect the construct (and that the construct is rele-
vant to the intended use of the test). Consistent with this latter perspective, the Joint
Standards (1999/2014) espouse a unitary view of validity and identify five sources of
validity evidence, including evidence based on test content, response processes, inter-
nal structure, relations to other variables, and consequences of testing. In summary, the
contemporary view of validity suggests that measurement efforts ultimately implicate
constructs, and different sources of evidence can be marshaled to substantiate the
validity of inferences based on test scores.
Drawing on the above discussion, evidence based on test content represents one of
several potential sources of evidence for validity judgments. The collection of content-
based evidence has become well-established as an important and viable validation
strategy, as can be seen in the common discussion and endorsement of content validity
in the academic literature (e.g., Aguinis et al., 2001; Binning & Barrett, 1989; Furr &
Bacharach, 2014; Haynes et al., 1995; Landy, 1986) as well as in legal, professional,
and technical standards or guidelines (e.g., Joint Standards, 1999/2014; SIOP
Principles, 2003; Uniform Guidelines, 1978). The specific manner in which evidence
based on test content can substantiate the validity of test score inferences is via an
informed and judicious examination of the match between the content of an assess-
ment tool (e.g., test instructions, item wording, response format) and the target con-
struct in light of the assessment purpose (Haynes et al., 1995). For the sake of simplicity
and ease of exposition, throughout this article, we use various terms interchangeably
to represent the concept of evidence based on test content, such as content validity
evidence, content validation strategy, content-based strategy, or simply content valid-
ity. However, each reference to this concept is intended to reflect contemporary think-
ing regarding validity as described above—specifically, content validity evidence is
not a separate “type” of validity but rather, a category of evidence that can be used to
substantiate the validity of inferences regarding test scores.
Do Current Standards Support Content Validity for
Personality?
Having introduced the concepts of validity and evidence based on test content, we now
turn to our primary purpose of discussing whether a content validation strategy should
Fisher et al. 237
be used as a means of validating personality test inferences for employee selection
purposes. In doing so, a preliminary question becomes whether current standards and
guidelines support this practice. The following four sources/areas are considered: (a)
the Uniform Guidelines (1978), (b) the SIOP Principles (2003), (c) the Joint Standards
(1999/2014), and (c) a general review of relevant academic literature. A summary of
information derived from these sources is shown in Table 1.
The Uniform Guidelines (1978)
The Uniform Guidelines (1978) are federally endorsed standards pertaining to
employee selection procedures, which were jointly developed by the Equal
Employment Opportunity Commission, the Civil Service Commission, the Department
of Labor, and the Department of Justice in the United States. Regarding content vali-
dation, the guidelines state that,
Evidence of the validity of a test or other selection procedure by a content validity study
should consist of data showing that the content of the selection procedure is representative
of important aspects of performance on the job for which the candidates are to be
evaluated. (Section 5.B)
The guidelines go on to describe specific technical standards and requirements for
content validity studies. For example, a content validity study should include a review
of information about the job under consideration (Section 14.A; Section 14.C.2).
Furthermore, when the selection procedure focuses on work tasks or behaviors, it must
be shown that the selection procedure includes a representative sample of on-the-job
behaviors or work products (Section 14.C.1; Section 14.C.4). Conversely, under cer-
tain circumstances, the guidelines also permit content validation where the selection
procedure focuses on worker requirements or attributes, including knowledge, skills,
or abilities (KSAs). In such cases, beyond showing that the selection procedure reflects
a representative sample of the implicated KSA, it must additionally be documented
that the KSA is needed to perform important work tasks (Section 14.C.1; Section
14.C.4), and the KSA must be operationally defined in terms of observable work
behaviors (Section 14.C.4).
The above notwithstanding, the Uniform Guidelines (1978) explicitly prohibit con-
tent validity for tests focusing on traits or constructs, including personality (Section
14.C.1). The logic underlying this restriction appears to be based on the seemingly
reasonable notion that content-based validation becomes increasingly difficult as the
focus of the selection test is farther removed from actual work behaviors (Section
14.C.4; Landy, 1986; Lawshe, 1985). This logic was confirmed in a subsequent
“Questions and Answers” document, where it is stated that,
The Guidelines emphasize the importance of a close approximation between the content of
the selection procedure and the observable behaviors or products of the job, so as to minimize
the inferential leap between performance on the selection procedure and job performance
[emphasis added]. (See http://www.uniformguidelines.com/questionandanswers.html)
http://www.uniformguidelines.com/questionandanswers.html
238 Public Personnel Management 50(2)
Table 1. Review of Various Sources Regarding Content Validity and Personality Testing.
Source Description of content validity Position on personality measures
Uniform
Guidelines
(1978)
“Evidence of the validity of
a test or other selection
procedure by a content
validity study should consist
of data showing that the
content of the selection
procedure is representative
of important aspects of
performance on the job for
which the candidates are to
be evaluated” (Section 5.B)
Explicit prohibition related to the use of
content validity for tests that focus on
traits or constructs, such as personality:
“. . . a content strategy is not appropriate
for demonstrating the validity of selection
procedures which purport to measure
traits or constructs, such as intelligence,
aptitude, personality, commonsense,
judgment, leadership, and spatial ability”
(Section 14.C.1)
SIOP
Principles
(2003)
“Evidence for validity based on
content typically consists of
a demonstration of a strong
linkage between the content
of the selection procedure
and important work
behaviors, activities, worker
requirements, or outcomes
on the job” (p. 21)
Approval of content validity approach for
personality measures can be inferred from
[1] the absence of an explicit prohibition
against the use of content validity evidence
for tests that focus on traits or constructs
and [2] the stated scope of applicability for
content-based evidence, which includes
tests that focus on knowledge, skills,
abilities, and other personal characteristics
Joint
Standards
(1999/2014)
“Important validity evidence
can be obtained from an
analysis of the relationship
between the content of a
test and the construct it is
intended to measure” (p. 14)
Approval of content validity approach for
personality measures can be inferred from
[1] the absence of an explicit prohibition
against the use of content validity evidence
for tests that focus on traits or constructs,
[2] the explicit description of content
validity as pertaining to “the relationship
between the content of a test and the
construct it is intended to measure” (p.
14), and [3] the broad definition of the
term construct (see p. 217), which makes
it clear that personality variables would fall
under the definition of a construct
General
review of
academic
literature
Most, if not all, descriptions of
content validity found in the
literature embody the core
notion of documenting the
linkage between the content
of a test and a particular
domain that represents
the target of measurement
and/or purpose of testing
(Haynes et al., 1995)
The sources that specifically discuss this
issue collectively indicate mixed opinions;
while some authors have expressed
reticence toward the use of content-based
evidence for measures of personality
(e.g., Goldstein et al., 1993; Lawshe,
1985; Wollack, 1976); others consider
this restriction to be problematic (e.g.,
Landy, 1986; McDaniel et al., 2011) or view
content validity as particularly relevant
to personality testing (e.g., Murphy et al.,
2009; O’Neill et al., 2009)
Note. SIOP = Society for Industrial and Organizational Psychology.
Fisher et al. 239
Interestingly, in an apparent application of this logic, the guidelines permit content
validation for selection procedures focusing on KSAs (as noted in the preceding para-
graph). In such cases, the inferential leap necessary to link KSAs to job performance
is ostensibly greater than if the selection procedures were to focus directly on work
behaviors, which explains why the guidelines include additional requirements related
to the content validation of tests focusing on these worker attributes (see Sections
14.C.1 and 14.C.4). Presumably, these additional requirements serve to bridge the
larger inferential leap made when the test does not directly focus on work behaviors.
Thus, the Uniform Guidelines do not limit the use of content validity to actual samples
of work behavior, but additional evidence is needed to help bridge the larger inferen-
tial leap made when selection tests target worker attributes (i.e., KSAs)—yet this same
reasoning is not extended to what the guidelines characterize as traits or constructs.
The SIOP Principles (2003)
The SIOP Principles (2003) embody the formal pronouncements of the Society for
Industrial and Organizational Psychology pertaining to appropriate validation and use
of employee selection procedures. For content validation, the principles state that,
“Evidence for validity based on content typically consists of a demonstration of a
strong linkage between the content of the selection procedure and important work
behaviors, activities, worker requirements, or outcomes on the job” (p. 21). Like the
Uniform Guidelines (1978), the SIOP Principles stress the importance of capturing a
representative sample of the target of measurement and further establishing a close
correspondence between the selection procedure and the work domain. The principles
also acknowledge that content validity evidence can be either “logical or empirical”
(p. 6), highlighting the role of job analysis and expert judgment in generating content-
based evidence. However, unlike the Uniform Guidelines, the SIOP Principles do not
make a substantive distinction between work tasks/behaviors and worker require-
ments/attributes in relation to content-based evidence but rather, collectively, consider
selection procedures that focus on “work behaviors, activities, and/or worker KSAOs”
(p. 21). Importantly, the addition of “O” to the KSA acronym represents “other per-
sonal characteristics,” which are generally understood to include “interests, prefer-
ences, temperament, and personality characteristics [emphasis added]” (Brannick
et al., 2007, p. 62). Accordingly, although not explicitly stated, the use of content
validity evidence as a mean of validating personality test inferences for employee
selection purposes appears to be consistent with the SIOP Principles.
The Joint Standards (1999/2014)
The Joint Standards (1999/2014) are a set of guidelines for test development and valida-
tion in the areas of psychological and educational testing, which were developed by a
joint committee including representatives from the American Educational Research
Association, the American Psychological Association, and the National Council of
Measurement in Education. According to the standards, content validity is examined by
240 Public Personnel Management 50(2)
specifying the content domain to be measured and then conducting “logical or empirical
analyses of the adequacy with which the test content represents the content domain and of
the relevance of the content domain to the proposed interpretation of test scores” (p. 14).
In other words, content validity is described as pertaining to “the relationship between the
content of a test and the construct it is intended to measure” (p. 14), where “construct” is
defined as “The concept or characteristic that a test is designed to measure” (p. 217).
Because personality traits are easily understood as constructs, the Joint Standards suggest
that personality test inferences may be subject to content-based validation.
Academic Literature
It is also informative to examine the academic literature regarding validation and per-
sonality testing. In doing so, several general observations can be made. First, most if
not all definitions of content validity share the core notion of documenting the linkage
between the content of a test and a particular domain that represents the target of mea-
surement and/or purpose of testing (e.g., Aguinis et al., 2001; Goldstein et al., 1993;
Haynes et al., 1995; Sireci, 1998). Second, as noted previously, prominent writings on
personality testing in the workplace (e.g., Morgeson et al., 2007b; O’Neill et al., 2013;
Ones et al., 2007; Rothstein & Goffin, 2006; Tett & Christiansen, 2007) have tended
to ignore the applicability of content validation to personality measures. Third, the
sources that do specifically address this issue present mixed opinions. While some
have expressed reticence about content-based evidence for measures of personality
(e.g., Goldstein et al., 1993; Lawshe, 1985; Wollack, 1976), others consider this
restriction to be problematic (e.g., Landy, 1986; McDaniel et al., 2011) or view content
validity as particularly relevant to personality testing (e.g., Murphy et al., 2009;
O’Neill et al., 2009). Thus, as with the technical standards and guidelines discussed
above, those turning to the academic literature for guidance might similarly come
away uncertain regarding the use of content validity evidence to support personality
measures in an employee selection context.
What Are the Bases of Inconsistency?
This section attempts to identify the conceptual issues that form the bases for disagree-
ment/misunderstanding regarding the use of content validity evidence for personality
measures. Making these underlying matters explicit will help to identify some com-
mon ground and the potential for a way forward. Based on the review of documents
and literature above, the primary areas to be addressed include (a) vestiges of the trini-
tarian view of validity, (b) the focus of the content match, and (c) a clear understanding
of the inferences to be substantiated.
Vestiges of the Trinitarian View of Validity
Although it is now well-established …
CATEGORIES
Economics
Nursing
Applied Sciences
Psychology
Science
Management
Computer Science
Human Resource Management
Accounting
Information Systems
English
Anatomy
Operations Management
Sociology
Literature
Education
Business & Finance
Marketing
Engineering
Statistics
Biology
Political Science
Reading
History
Financial markets
Philosophy
Mathematics
Law
Criminal
Architecture and Design
Government
Social Science
World history
Chemistry
Humanities
Business Finance
Writing
Programming
Telecommunications Engineering
Geography
Physics
Spanish
ach
e. Embedded Entrepreneurship
f. Three Social Entrepreneurship Models
g. Social-Founder Identity
h. Micros-enterprise Development
Outcomes
Subset 2. Indigenous Entrepreneurship Approaches (Outside of Canada)
a. Indigenous Australian Entrepreneurs Exami
Calculus
(people influence of
others) processes that you perceived occurs in this specific Institution Select one of the forms of stratification highlighted (focus on inter the intersectionalities
of these three) to reflect and analyze the potential ways these (
American history
Pharmacology
Ancient history
. Also
Numerical analysis
Environmental science
Electrical Engineering
Precalculus
Physiology
Civil Engineering
Electronic Engineering
ness Horizons
Algebra
Geology
Physical chemistry
nt
When considering both O
lassrooms
Civil
Probability
ions
Identify a specific consumer product that you or your family have used for quite some time. This might be a branded smartphone (if you have used several versions over the years)
or the court to consider in its deliberations. Locard’s exchange principle argues that during the commission of a crime
Chemical Engineering
Ecology
aragraphs (meaning 25 sentences or more). Your assignment may be more than 5 paragraphs but not less.
INSTRUCTIONS:
To access the FNU Online Library for journals and articles you can go the FNU library link here:
https://www.fnu.edu/library/
In order to
n that draws upon the theoretical reading to explain and contextualize the design choices. Be sure to directly quote or paraphrase the reading
ce to the vaccine. Your campaign must educate and inform the audience on the benefits but also create for safe and open dialogue. A key metric of your campaign will be the direct increase in numbers.
Key outcomes: The approach that you take must be clear
Mechanical Engineering
Organic chemistry
Geometry
nment
Topic
You will need to pick one topic for your project (5 pts)
Literature search
You will need to perform a literature search for your topic
Geophysics
you been involved with a company doing a redesign of business processes
Communication on Customer Relations. Discuss how two-way communication on social media channels impacts businesses both positively and negatively. Provide any personal examples from your experience
od pressure and hypertension via a community-wide intervention that targets the problem across the lifespan (i.e. includes all ages).
Develop a community-wide intervention to reduce elevated blood pressure and hypertension in the State of Alabama that in
in body of the report
Conclusions
References (8 References Minimum)
*** Words count = 2000 words.
*** In-Text Citations and References using Harvard style.
*** In Task section I’ve chose (Economic issues in overseas contracting)"
Electromagnetism
w or quality improvement; it was just all part of good nursing care. The goal for quality improvement is to monitor patient outcomes using statistics for comparison to standards of care for different diseases
e a 1 to 2 slide Microsoft PowerPoint presentation on the different models of case management. Include speaker notes... .....Describe three different models of case management.
visual representations of information. They can include numbers
SSAY
ame workbook for all 3 milestones. You do not need to download a new copy for Milestones 2 or 3. When you submit Milestone 3
pages):
Provide a description of an existing intervention in Canada
making the appropriate buying decisions in an ethical and professional manner.
Topic: Purchasing and Technology
You read about blockchain ledger technology. Now do some additional research out on the Internet and share your URL with the rest of the class
be aware of which features their competitors are opting to include so the product development teams can design similar or enhanced features to attract more of the market. The more unique
low (The Top Health Industry Trends to Watch in 2015) to assist you with this discussion.
https://youtu.be/fRym_jyuBc0
Next year the $2.8 trillion U.S. healthcare industry will finally begin to look and feel more like the rest of the business wo
evidence-based primary care curriculum. Throughout your nurse practitioner program
Vignette
Understanding Gender Fluidity
Providing Inclusive Quality Care
Affirming Clinical Encounters
Conclusion
References
Nurse Practitioner Knowledge
Mechanics
and word limit is unit as a guide only.
The assessment may be re-attempted on two further occasions (maximum three attempts in total). All assessments must be resubmitted 3 days within receiving your unsatisfactory grade. You must clearly indicate “Re-su
Trigonometry
Article writing
Other
5. June 29
After the components sending to the manufacturing house
1. In 1972 the Furman v. Georgia case resulted in a decision that would put action into motion. Furman was originally sentenced to death because of a murder he committed in Georgia but the court debated whether or not this was a violation of his 8th amend
One of the first conflicts that would need to be investigated would be whether the human service professional followed the responsibility to client ethical standard. While developing a relationship with client it is important to clarify that if danger or
Ethical behavior is a critical topic in the workplace because the impact of it can make or break a business
No matter which type of health care organization
With a direct sale
During the pandemic
Computers are being used to monitor the spread of outbreaks in different areas of the world and with this record
3. Furman v. Georgia is a U.S Supreme Court case that resolves around the Eighth Amendments ban on cruel and unsual punishment in death penalty cases. The Furman v. Georgia case was based on Furman being convicted of murder in Georgia. Furman was caught i
One major ethical conflict that may arise in my investigation is the Responsibility to Client in both Standard 3 and Standard 4 of the Ethical Standards for Human Service Professionals (2015). Making sure we do not disclose information without consent ev
4. Identify two examples of real world problems that you have observed in your personal
Summary & Evaluation: Reference & 188. Academic Search Ultimate
Ethics
We can mention at least one example of how the violation of ethical standards can be prevented. Many organizations promote ethical self-regulation by creating moral codes to help direct their business activities
*DDB is used for the first three years
For example
The inbound logistics for William Instrument refer to purchase components from various electronic firms. During the purchase process William need to consider the quality and price of the components. In this case
4. A U.S. Supreme Court case known as Furman v. Georgia (1972) is a landmark case that involved Eighth Amendment’s ban of unusual and cruel punishment in death penalty cases (Furman v. Georgia (1972)
With covid coming into place
In my opinion
with
Not necessarily all home buyers are the same! When you choose to work with we buy ugly houses Baltimore & nationwide USA
The ability to view ourselves from an unbiased perspective allows us to critically assess our personal strengths and weaknesses. This is an important step in the process of finding the right resources for our personal learning style. Ego and pride can be
· By Day 1 of this week
While you must form your answers to the questions below from our assigned reading material
CliftonLarsonAllen LLP (2013)
5 The family dynamic is awkward at first since the most outgoing and straight forward person in the family in Linda
Urien
The most important benefit of my statistical analysis would be the accuracy with which I interpret the data. The greatest obstacle
From a similar but larger point of view
4 In order to get the entire family to come back for another session I would suggest coming in on a day the restaurant is not open
When seeking to identify a patient’s health condition
After viewing the you tube videos on prayer
Your paper must be at least two pages in length (not counting the title and reference pages)
The word assimilate is negative to me. I believe everyone should learn about a country that they are going to live in. It doesnt mean that they have to believe that everything in America is better than where they came from. It means that they care enough
Data collection
Single Subject Chris is a social worker in a geriatric case management program located in a midsize Northeastern town. She has an MSW and is part of a team of case managers that likes to continuously improve on its practice. The team is currently using an
I would start off with Linda on repeating her options for the child and going over what she is feeling with each option. I would want to find out what she is afraid of. I would avoid asking her any “why” questions because I want her to be in the here an
Summarize the advantages and disadvantages of using an Internet site as means of collecting data for psychological research (Comp 2.1) 25.0\% Summarization of the advantages and disadvantages of using an Internet site as means of collecting data for psych
Identify the type of research used in a chosen study
Compose a 1
Optics
effect relationship becomes more difficult—as the researcher cannot enact total control of another person even in an experimental environment. Social workers serve clients in highly complex real-world environments. Clients often implement recommended inte
I think knowing more about you will allow you to be able to choose the right resources
Be 4 pages in length
soft MB-920 dumps review and documentation and high-quality listing pdf MB-920 braindumps also recommended and approved by Microsoft experts. The practical test
g
One thing you will need to do in college is learn how to find and use references. References support your ideas. College-level work must be supported by research. You are expected to do that for this paper. You will research
Elaborate on any potential confounds or ethical concerns while participating in the psychological study 20.0\% Elaboration on any potential confounds or ethical concerns while participating in the psychological study is missing. Elaboration on any potenti
3 The first thing I would do in the family’s first session is develop a genogram of the family to get an idea of all the individuals who play a major role in Linda’s life. After establishing where each member is in relation to the family
A Health in All Policies approach
Note: The requirements outlined below correspond to the grading criteria in the scoring guide. At a minimum
Chen
Read Connecting Communities and Complexity: A Case Study in Creating the Conditions for Transformational Change
Read Reflections on Cultural Humility
Read A Basic Guide to ABCD Community Organizing
Use the bolded black section and sub-section titles below to organize your paper. For each section
Losinski forwarded the article on a priority basis to Mary Scott
Losinksi wanted details on use of the ED at CGH. He asked the administrative resident