Assessment Guide - Psychology
PLEASE READ THE INSTRUCTIONS ATTACHED FOR THE GUIDE.Begin the guide with a general overview of assessment, reasons for assessment referrals, and the importance of the role of each individual in the process. Within each of the remaining sections, describe the types of assessments that their readers may encounter, the purposes of each type of assessment, the different skills and abilities the instruments measure, the most valid and reliable uses of the measures, and limitations of the measures. A brief section will be included to describe the assessment process, the types of professionals who conduct the assessments, and what to expect during the assessment meetings? Psychological assessment guides are created by psychology professionals to provide the public with accurate and authoritative information appropriate for their current needs. Information available to the public about psychological testing and assessment varies widely depending on the professional creating it, the purpose of the assessment, and the intended audience. When professionals effectively educate the public on the how, what, and why behind assessments and the strengths and limitations of commonly used instruments, potential clients are in a better position to be informed users of assessment products and services. The Assessment Guides developed in this course will be designed to provide the lay public with accurate and culturally relevant information to aid them in making informed decisions about psychological testing. Students will develop their Guides with the goal of educating readers to be informed participants in the assessment process. There is no required template for the development of the Assessment Guide. It is encouraged to be creative while maintaining the professional appearance of the work. While based on scholarly information, the Guide should not read like a research paper. It is to be written like a brochure a professional might give a patient or client who is being referred for testing. The Guide must be reader-friendly (sixth- to ninth-grade reading level) and easy to navigate, and it must include a combination of text, images, and graphics to engage readers in the information provided. Throughout the Guide, provide useful examples and definitions as well as questions readers should ask their practitioners. To ensure accuracy, use only scholarly and peer-reviewed sources for the information in the development of the Guides. Begin the guide with a general overview of assessment, reasons for assessment referrals, and the importance of the role of each individual in the process. Within each of the remaining sections, describe the types of assessments that their readers may encounter, the purposes of each type of assessment, the different skills and abilities the instruments measure, the most valid and reliable uses of the measures, and limitations of the measures. A brief section will be included to describe the assessment process, the types of professionals who conduct the assessments, and what to expect during the assessment meetings The Assessment Guide must include the following sections: Table of Contents (Portrait orientation must be used for the page layout of this section.) In this one-page section, list the following subsections and categories of assessments? • Introduction and Overview  • Tests of Intelligence • Tests of Achievement • Tests of Ability • Neuropsychological Testing • Personality Testing • Industrial, Occupational, and Career Assessment • Forensic Assessment • Special Topics (specify the students choice from the Special Topics list) • References Section 1: Introduction and Overview (Portrait or landscape orientation may be used for the page layout of this section.) Begin the Guide with a general overview of assessment. In this two-page section, students will briefly address the major aspects of the assessment process. Students are encouraged to develop creative titles for these topics that effectively communicate the meanings to the intended audience. • Definition of a Test (e.g., What is a Test?) • Briefly define psychological assessment. • Types of Tests • Identify the major categories of psychological assessment. • Reliability and Validity • Briefly define the concepts of reliability and validity as they apply to psychological assessment. • Role of testing and assessment in the diagnostic process • Briefly explain role of assessment in diagnosis. • Professionals Who Administer Tests • Briefly describe the types of professionals involved in various assessment processes. • Culture and Testing • Briefly describe issues of cultural diversity as it applies to psychological assessment. Categories of Assessment (Portrait or landscape orientation may be used for the page layout of this section.) For each of the following, students will create a two-page information sheet or pamphlet to be included in the Assessment Guide. For each category of assessment, students will include the required content listed in the PSY640 Content for Testing Pamphlets and Information Sheets. Be sure to reference the content requirements prior to completing each of the information sheets on the following categories of assessment. • Tests of Intelligence • Tests of Achievement • Tests of Ability • Neuropsychological Testing • Personality Testing • Industrial, Occupational, and Career Assessment • Forensic Assessment • Special Topics (Students will specify which topic they selected for this pamphlet or information sheet. Additional instructions are noted below.) Special Topics (Student’s Choice) In addition to the required seven categories of assessment listed above, students will develop an eighth information sheet or pamphlet that includes information targeted either at a specific population or about a specific issue related to psychological assessment not covered in one of the previous sections. Students may choose from one of the following categories: • Testing Adolescents • Testing Geriatric Patients • Testing First Generation Immigrants • Testing in Rural Communities • Testing English Language Learners • Testing Individuals Who Are (Select one: Deaf, Blind, Quadriplegic) • Testing Individuals Who Are Incarcerated • Testing for Competency to Stand Trial References (Portrait orientation must be used for the page layout of this section.) https://content.bridgepointeducation.com/curriculum/file/7aa7d868-2fb0-4b7f-a892-e61ee1bbb9cd/1/PSY640\%20Content\%20for\%20Testing\%20Pamphlets\%20and\%20Information\%20Sheets.pdf https://content.bridgepointeducation.com/curriculum/file/7aa7d868-2fb0-4b7f-a892-e61ee1bbb9cd/1/PSY640\%20Content\%20for\%20Testing\%20Pamphlets\%20and\%20Information\%20Sheets.pdf https://content.bridgepointeducation.com/curriculum/file/7aa7d868-2fb0-4b7f-a892-e61ee1bbb9cd/1/PSY640\%20Content\%20for\%20Testing\%20Pamphlets\%20and\%20Information\%20Sheets.pdf Include a separate reference section that is formatted according to APA style 7. The reference list must consist entirely of scholarly sources. For the purposes of this assignment, assessment manuals, chapters from graduate-level textbooks, chapters from professional books, and peer-reviewed journal articles may be used as resource material. A minimum of 16 unique scholarly sources published within the last 10 years must be used within the Assessment Guide. The bulleted list of credible professional and/or educational online resources required for each assessment area will not count toward these totals. The Assessment Guide • Must be 18 pages in length (not including title and reference pages) and formatted according to APA style 7. • Must include a separate title page with the following: ◦ Title of guide ◦ Student’s name ◦ Course name and number ◦ Instructor’s name ◦ Date submitted • Must use at least 16 scholarly sources.Must document all sources in APA style7. • Must include a separate reference page that is formatted according to APA style 7 Must incorporate at least three different methods of presenting information (e.g., text, graphics, images, original cartoons). Kumar, S., Kartikey, D., & Singh, T. (2021). Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview. Journal of Psychosocial Research, 16(1), 199–209. https://doi-org.proxy-library.ashford.edu/10.32381/ JPR.2021.16.01.18. (For the Tests of Intelligence attachment). Zucchella, C., Federico, A., Martini, A., Tinazzi, M., Bartolo, M., & Tamburin, S. (2018). Neuropsychological testing. Practical Neurology (BMJ Publishing Group), 18(3), 227–237. (For the Neuropsychological testing attachment) Archer, R. P., Wheeler, E. M. A., & Vauter, R. A. (2016). Empirically supported forensic assessment. Clinical Psychology: Science and Practice, 23(4), 348– 364. https://doi-org.proxy-library.ashford.edu/10.1111/cpsp.12171 (For the Forensic assessment attachment) Fisher, D. M., Milane, C. R., Sullivan, S., & Tett, R. P. (2021). A Critical Examination of Content Validity Evidence and Personality Testing for Employee Selection. Public Personnel Management, 50(2), 232–257. https:// doi-org.proxy-library.ashford.edu/10.1177/0091026020935582 (Personality Testing) https://doi-org.proxy-library.ashford.edu/10.32381/JPR.2021.16.01.18 https://doi-org.proxy-library.ashford.edu/10.32381/JPR.2021.16.01.18 https://doi-org.proxy-library.ashford.edu/10.1111/cpsp.12171 https://doi-org.proxy-library.ashford.edu/10.1177/0091026020935582 https://doi-org.proxy-library.ashford.edu/10.1177/0091026020935582 Öz, H., & Özturan, T. (2018). Computer-Based and Paper-Based Testing: Does the Test Administration Mode Influence the Reliability and Validity of Achievement Tests? Journal of Language and Linguistic Studies, 14(1), 67– 85. (For the tests of Achievement) Han, K., Colarelli, S. M., & Weed, N. C. (2019). Methodological and statistical advances in the consideration of cultural diversity in assessment: A critical review of group classification and measurement invariance testing. Psychological Assessment, 31(12), 1481–1496. https://doi-org.proxy- library.ashford.edu/10.1037/pas0000731 (issues of cultural diversity as it applies to psychological assessment) https://doi-org.proxy-library.ashford.edu/10.1037/pas0000731 https://doi-org.proxy-library.ashford.edu/10.1037/pas0000731 PSY640 Content for Testing Pamphlets and Information Sheets For each category of assessment listed in the assignment, students will create two pages of information. The intent for the layout is that it be consistent with either a two-page information sheet (front and back), or a two-sided tri-fold pamphlet that might be found in the office of a mental health professional. The presentation of the information within each pamphlet or brochure must incorporate at least three different visual representations of the information (e.g., text, graphics, images, original cartoons). For each pamphlet or information sheet a minimum of three scholarly sources must be used, at least two of which must be from peer-reviewed journal articles published within the last 10 years and obtained from the Ashford University Library. Some sources may be relevant for more than one category of assessment; therefore, it is acceptable to use relevant sources in more than one category. Remember that the language for each information sheet should be at the sixth- to ninth-grade reading level to allow a broad audience at various ages and levels of education to better understand each category of assessment. For each category of assessment. • Introduce and offer a brief, easy-to-understand definition for the broad assessment category being measured. (e.g., What is intelligence?, What is achievement?, What is personality?, What does “neuropsychological” mean? What does “forensic” mean?) • Provide a brief overview of the types of tests commonly used within the category of assessment explain what they measure. Compare the commonly used assessment instruments within the category. • Describe appropriate and inappropriate uses of tests within the category of assessment. Explain why some tests are more appropriate for specific populations and purposes and which tests may be inappropriate. Analyze and describe the challenges related to assessing individuals from diverse social and cultural backgrounds. Evaluate the ethical interpretation of testing and assessment data as it relates to the test types within the category. Describe major debates in the field regarding different assessment approaches within the category. (e.g., Intellectual disabilities, formerly known as “mental retardation,” cannot be determined by a single test. Thus, an inappropriate use of an intelligence test would be to use such a test as the sole instrument to diagnose an intellectual ability.) • Describe the format in which assessment results may be expected. Evaluate and explain the professional interpretation of testing and assessment data. Analyze the psychometric methodologies typically employed in the validation of types of psychological testing within the category. Include information about the types of scores used to communicate assessment results consistent with the tests being discussed (e.g., scaled scores, percentile rank, grade equivalent, age equivalent, standard age score, confidence interval). • Explain the common terminology used in assessment in a manner that demystifies the professional jargon (e.g., In the course of discussing intelligence testing, students would define concepts such as I.Q., categories of intelligence, and the classification labels used to describe persons with intellectual disabilities.) • Include a bulleted list of at least three credible professional and/or educational online resources where the reader can obtain more information about the various types of testing in order to aid him or her in the evaluation and interpretation of testing and assessment data. No commercial websites may be used. Include the name of the organization that authored the web page, the title of the web page and/or document, and the URL. (These websites will not count toward the 12 scholarly resources required for the assignment.) Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview Subodh Kumar, Divye Kartikey and Tara Singh ABSTRACT From an evolutionary point of view the one factor that helped humanity thrive and survive against all odds was the human’s ability to use their intelligence. Intelligence is what makes us unique among all the species in the world. The aim of this review paper was to discuss the role of intelligence tests in measuring intelligence of different age groups and diagnosing intellectual disability. The reviewed papers have revealed that measuring intelligence is not a construct that can only be measured for grown ups but it can also be measured for newborns. Although IQ tests are used prominently in judging school performance, job performance, intellectual disability and overall well-being, its measurement gets affected by emotions, genetics, cultural background and environmental factors. To improve the validity or accuracy of intelligence tests it is important to incorporate these factors. Keywords: Intelligence, IQ, IQ tests, Intellectual Disability. INTRODUCTION Intelligence is the general cognitive ability to use attention and memory for learning and developing ideas to solve problems. Modern science defines intelligence as the ability to do abstract critical thinking, planning the strategy and solving the problem. However, intelligence is a wide concept and also includes the ability to comprehend complex ideas, Integrate the information, adapting oneself to situations, choosing appropriate response to stimuli, learning from experience, changing the environment and one’s own behaviour, overcome obstacles in life and many more. People differ in intelligence from one another and also in their ability to conduct cognitive tasks, which can be due to various reasons like genetic, personality or complexity of the task (Colom et al., 2010). Journal of Psychosocial Research Vol. 16, No. 1, 2021, 199-209 DOI No. : https://doi.org/10.32381/JPR.2021.16.01.18 Corresponding author. Email : [email protected] ISSN 0973-5410 print/ISSN 0976-3937 online ©2021 Dr. H. L. Kaila http//www.printspublications.com Subodh Kumar, Divye Kartikey and Tara Singh J. Psychosoc. Res. 200 Studying intelligence is important because it helps us understand the strengths, weaknesses, and unique abilities of an individual. Since intelligence is treated as a measurable quantity, currently there are many standardized tests in use which can measure intelligence with considerable accuracy, consistency and also predict future performance of individuals of all ages. The measured quantity of intelligence is called IQ or intelligence quotient and tests which measure IQ are called IQ tests. IQ can be calculated with the formula i.e. IQ = (intelligence age/actual age)x100. Intelligence tests can be age scale or point scale. Age scale intelligence tests are based on the concept of calculating mental age for diagnosing intellectual disability. The Seguin Form Board test is an example of an age scale intelligence test. Point scale intelligence tests are based on the concept of calculating total points scored in the test to calculate intelligence. Most intelligence tests have both verbal and non-verbal questions. Separate scores are generated for both verbal and non-verbal, and combined score is the score after taking into account both the scores (APA Dictionary of Psychology, n.d.). Classification of intelligence is expanding and new ways of analyzing it are coming forward. According to the traditional “investment” theory, intelligence can be classified into two main categories i.e. fluid and crystallized. Fluid intelligence is the ability to use reasoning to solve novel problems, in the absence of any prior specific knowledge. As we grow older fluid intelligence tends to decrease, especially in late twenties. It is also influenced by genetics. Crystallized intelligence is the ability to use previously learned knowledge to solve problems. As we grow older crystallized intelligence tends to increase. (Kaufman, 2013). Intellectual disability: IQ tests are used not only to measure intelligence but also to diagnose intellectual disability. Intellectual disability is characterized by significant problems in cognitive functioning and social skills of an individual. In terms of IQ score, generally the score less than 70 and inability to carry out age specific day to day tasks amounts to intellectual disability. (American Psychiatric Association, n.d.; Schalock et al., 2010). RESEARCH OBJECTIVES The aim of this study is to discuss the role of intelligence tests in measuring intelligence of different age groups and diagnosing intellectual disability. METHODOLOGY Online databases (NCBI, PUBMED, PSYCINFO, PsycNET, Frontiers in Psychology, Google Scholar, Research gate) and websites were searched for papers published in Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview 201 J. Psychosoc. Res. English related to intelligence, intelligence tests and intellectual disability. Twenty seven articles (1994 - 2020) were identified and referred. INTELLIGENCE TESTS APGAR (Appearance, Pulse, Grimace, Activity, and Respiration) APGAR test is a rapid test to evaluate the physiological condition of neonates at the time of birth to know the level of medical attention needed in case of resuscitation. The test evaluates a newborn on five parameters i.e. appearance, pulse, grimace, activity and respiration. Low scores of 1 or 0 on each category are given if the newborn has pale or blue appearance, heart rate less than 100 beats per minute or no heart beats, no grimace, cough and crying on stimulation, less muscle activity or no muscle activity and Slow, irregular respiration or no respiration. High scores of 2 on each category are given if newborn have, mostly pink appearance, heart rate more than 100 beats per minute, grimace, crying and coughing on stimulation, active muscle activity, healthy and normal breathing The final score out of 10 is recorded at 1 minute and 5 minute after the birth. Scores less than 6 are considered as low whereas scores like 7 or more than 7 are considered as good scores (Simon, 2021). APGAR is not only the indication of health of newborns during birth but also can predict the IQ in the later stages of life. A study in the UK found that infants having low APGAR scores during their birth seems to have a higher risk of having low IQ by the age of 18 (Odd, 2008). Stanford Binet Intelligence test (SB) The Stanford Binet intelligence scale is used to measure both the intelligence and intellectual disability. Currently its 5th edition is under use which can be administered to people in the age range of 2 to 89. cognitive abilities like Fluid reasoning, Knowledge, Quantitative reasoning, visual/spatial reasoning and working memory are measured in both verbal and non verbal format. In total there are 10 tests, for every cognitive ability there is a verbal and a nonverbal test. (Marom, 2018). Wechsler Test for Preschool and Primary scale of Intelligence (WPPSI) Currently the fourth edition of the test is under use i.e. WPPSI-IV. There are 14 subtests in it which are administered on children. For children who are in the age range of 2 years 6 months to 3 years 11 months, the subtests which are given include: block design Subodh Kumar, Divye Kartikey and Tara Singh J. Psychosoc. Res. 202 (constructing model with coloured blocks), information (to answer general questions), object assembly (to fix pieces of puzzle into a standard arrangement), picture naming (to name a picture shown), receptive vocabulary (child points out the correct picture based on the vocabulary spoken aloud). For children who are in the age range of 4 years to 7 years 7 months are administered the 5 above subtests along with subtests like animal coding, comprehension, matrix reasoning, Picture completion, picture concepts, word reasoning, vocabulary, symbol search and similarities (Slattery, 2015). Wechsler Intelligence Scale for Children (WISC) Currently the 5th edition of WISC is under use i.e. WISC-V. It measures 5 things that are as follows: visual and spatial index measures the ability of a child to process visual and spatial information like geometrical figures, fluid reasoning index, working memory index, processing speed index and verbal comprehension index. The individual scores from all the above indexes are combined to form a full scale intelligence quotient (Flanagan et al., 2010). Kaufman Assessment Battery for children (K-ABC) The Kaufman intelligence test was developed in 1983, by Alan S. Kaufman and Nadeen L. Kaufman. The test was based on Laria’s neuropsychological theory of sequential and simultaneous cognitive processing and contains four scales- (1) Sequential processing scale, (2) Simultaneous processing scale, (3) Achievement scale and (4) mental processing scale. Sequential processing scale measures short term memory through problem solving related to sequential or order placement. Simultaneous processing scale measures ability to solve problems by processing several information simultaneously. Achievement scale measures application of learned skills on practical problems, and mental processing scale measures the ability to solve problems by using both sequential and simultaneous processing (Marom, 2021). Kaufman Adolescent and Adult intelligence test (KAIT) This test measures fluid intelligence and crystallized intelligence. It can be administered on people from age 11 to 85. KAIT has a core battery and an extended battery. The core battery has six subtests, total time is 65 minutes, which measure parameters like crystallized intelligence(Gc), fluid intelligence(Gf) and composite intelligence. The extended battery has four additional subtests in addition to the subtests of the core battery and takes 90 minutes to complete (Fahmy, 2021). Wechsler Adult Intelligence Scale (WAIS) Currently the 4th edition of WAIS is under use i.e. WAIS-IV. In this test 4 major components of intelligence are measured. Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview 203 J. Psychosoc. Res. 1) Verbal Comprehension Index- subtests include similarities, Vocabulary, Information, Comprehension. 2) Perceptual Reasoning Index- subtests include Block design, Matrix, Reasoning, Visual puzzles, Pictures completion, Figure weights. 3) Working Memory Index- subtests include Digit span, Arithmetic, Letter-numbering sequencing. 4) Processing Speed Index- subtests include symbol search, coding, cancellation. The scores from all the subtests of all the indexes gives the full scale IQ score. Whereas the scores of similarities, vocabulary and information subtests from verbal comprehension index, and scores of block design, matrix reasoning and visual puzzles subtests from perceptual reasoning index gives the score on general ability index (Cherry, 2020). Woodcock-Johnson Test of Cognitive abilities The Woodcock-Johnson test of cognitive abilities was developed in 1977 by Richard Woodcock and Mary E. Bonner Johnson. Currently it’s 4th version is under use. It can be administered from a child of age two to an adult of age 90. The test is based on Cattell-Horn-Caroll’s theory of intelligence which focuses on nine main cognitive abilities like comprehensive knowledge, fluid reasoning, quantitative knowledge, reading and writing ability, short term memory, long term storage and retrieval, visual processing, auditory processing and processing speed (Hamour et al., 2012). Raven’s Progressive matrices test Raven’s progressive matrices test was developed by John C Raven in 1939. It is designed to measure the reasoning ability of individuals who are in the age range of 5.5 years to 11.5 years, adults and senior citizens. The questions are in the form of matrices in which pattern has to be figured to answer about the missing element in a matrix. The difficulty level increases as the test progresses. There are three types of Raven progressive matrices test: Standard Raven’s progressive test, Coloured Raven’s progressive test and Advanced Raven’s progressive test. Questions are presented in black and white pattern. It is suitable for adults who have high intelligence (Thiel, 2020). Bhatia Battery of Performance test of intelligence This test was developed by C.M. Bhatia in 1955. The battery is applicable to the illiterate as well as the literate groups with separate norms provided for each group. While the Subodh Kumar, Divye Kartikey and Tara Singh J. Psychosoc. Res. 204 test was originally developed for the 11-16 years age group, the use of this test on adults beyond 16 years of age is based on the assumption that intelligence does not increase beyond 16 years of age which was set as the upper limit of the test (Barnabas & Rao, 1994). The test contains five subtests: Koh’s block design test, Alexander pass along test, pattern drawing test, immediate memory test and picture construction test (Roopesh, 2020). Binet Kamat test of intelligence This test is the indian adaptation of the Stanford binet intelligence test developed by Kamat for the age group of 3 years to 22 years. It is used widely in diagnosing intellectual disability. This test is administered by increasing the difficulty level of questions until a child fails to solve the questions. Questions below the child’s chronological age are also asked until a child solves all of them successfully. The test contains questions related to vocabulary, language, analogies, social reasoning, similarities and differences, and visual motor ability (Roopesh, 2020). Seguin form board test The Seguin form board test was developed by Seguin in 1856 for the age group 3 years to 15 years. The test measures the parameters like visual ability, eye and hand coordination, non-verbal ability, psychomotor ability and spatial ability. There is a form board which has 10 slots of different shapes and a participant has to place every block in the right slot within the best possible time. The time needed for the administration of the test is usually 10 minutes. A participant gets three trials and the best time among them is noted to calculate the mental age (Koshy, 2017). DISCUSSION IQ tests and overall well being: Studies have pointed that there are gaps in interpretation of IQ results. IQ tests focus only on selected mental abilities which are useful only for school admission, college admission and job roles. Research has found a weak correlation of IQ tests with overall individual well being. The solution is not to limit the importance of IQ tests but rather define intelligence in an inclusive way and also not to inflate the interpretation of IQ tests (Ganuthula, 2019). IQ tests and emotions: Many studies have shown that motivation and emotions decides the performance of an individual in an IQ test. Motivation of a test taker gets affected by positive and negative emotions. Positive emotions like curiosity, resilience and courage boosts the performance of the test taker. Whereas negative emotions like fear and test anxiety negatively affect the performance in the IQ test. (Ganuthula, 2019). Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview 205 J. Psychosoc. Res. Intellectual disability and IQ tests: A study on the cognitive decline among children suffering from medulloblastoma, a nervous system tumour, was done. The disease is known to cause decline in information processing speed and visual motor tasks. Many patients were logical right in the tests but because they were slow in processing information, their IQ score on the wechsler scale was low which takes into account such factors for generating IQ score. Hence it is important that a proper neuropsychological assessment is done rather than just relying on a single score through IQ score (Wegenschimmel, 2017). Colmar et al., (2006) have found that there are psychometric issues in IQ tests and that IQ tests have limitations in diagnosing intellectual disability. Most IQ tests are designed for normal people and tasks in these tests may not be age appropriate for intellectually disabeled. IQ tests and School performance: IQ tests are used in schools for intellectual assessment of students. However, IQ scores do not explain the reason behind a student lagging in a specific academic task. For that, specific cognitive abilities need to be judged along with IQ scores to come to a right conclusion.(Ortiz & Lella, 2010). IQ tests and Job performance: The use of IQ tests in predicting job performance is often cited as the usefulness of IQ tests. However, the problem with IQ tests predicting job performance is that appraisals in jobs are often subjected to bias like halo bias and may not correlate with IQ score. (Richardson, 2015). IQ tests and Genetics: Studies on twins have found that IQ is heritable and that genes influence intelligence. Like all other traits intelligence is also inherited, however it is never 100\%. To what extent intelligence gets inherited depends on the type of environment given. Unfavourable environment will never let genes fully express the qualities related to intelligence. Various cross-sectional and longitudinal studies have shown that the influence of genes on intelligence increases gradually from infancy to adulthood. (Plomin, 2015) . CONCLUSION As the world is becoming more knowledge oriented and skill driven, the role of intelligence and its measurement is becoming an important field of study. IQ tests are the most common method of intellectual assessment for all ages. For diagnosing any intellectual disability, they are the most objective tool available.Today, intelligence tests play a vital role in determining the future of an individual. Their role in diagnosing intellectual disability is also important for taking the right therapeutic approaches. Since its first use, IQ tests have through many modifications, updations and adaptations. In future also many new things will be added in IQ tests as the knowledge of intelligence Subodh Kumar, Divye Kartikey and Tara Singh J. Psychosoc. Res. 206 and intellectual disability increases. The papers reviewed by us have shown that intelligence can be measured for all ages, from childhood to elderly. Intelligence tests like APGAR can measure intelligence of a newborn and also predict the future course of intellectual development in later life. Measuring intelligence and intellectual disability is a complicated task and although the IQ tests are comprehensive, cannot throw light on every aspect of one’s intelligence. The role of genetics, environment, culture and emotions on intelligence may not get reflected on IQ tests. It is important that the validity of IQ tests be improved so that interpretation and prediction of performance can be made more accurate. More knowledge we have about IQ tests and its limitations, the more they can be made better. Implications On the basis of our study we find that focus should be more on filling the gaps in the interpretation of IQ scores in such a way that more realistic predictions of future performance can be done. These gaps in interpretation of IQ score can result in error in judgement. For example a high score in IQ tests may not necessarily translate into good performance in the areas of life which demands high emotional regulation for achieving desirable results. Since IQ scores depend on many factors that can affect the cognition during the test like motivation, emotional state and many environmental factors like family stress, it is important that these factors should be incorporated while interpreting the performance on IQ tests. Our study has also found that for diagnosing intellectual disability, it is important to focus on adaptability skills rather than just IQ scores. Contribution to the Existing Literature Our review article comprehensively covers important intelligence tests for all age groups, which we did not find in any other paper. Our paper will be a ready reference for counsellors, teachers and professionals who want to have a comprehensive on intelligence tests for different age groups starting from the birth of a child. REFERENCES American Psychiatric Association (n.d.). What is Intellectual Disability? https://www.psychiatry.org/ patients-families/intellectual-disability/what-is-intellectual-disability APA Dictionary of Psychology (n.d.). Point scale. https://dictionary.apa.org/point-scale Barnabas, I. P., and Rao, S. (1994). Comparison of two short forms of the Bhatia’s Test of Intelligence. NIMHANS Journal, 12(1), 75–77. Boat, T.F., Wu, J.T. (2015). Mental Disorders and Disabilities Among Low-Income Children: Clinical Characteristics of Intellectual Disabilities. Washington (DC): National Academies Press (US); Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview 207 J. Psychosoc. Res. 2015 Oct 28. 9. Retrieved April 08, 2021 from https://www.ncbi.nlm.nih.gov/books/ NBK332877/ Cherry, K. (2020). The Wechsler Adult Intelligence Scale. Retrieved April 12, 2021 from Verywell Mind, https://www.verywellmind.com/the-wechsler-adult-intelligence-scale-2795283 Colmar, S., Maxwell, A., and Miller, L. (2006). Assessing Intellectual Disability in Children: Are IQ Measures Sufficient, or Even Necessary? Australian Journal of Guidance and Counselling, 16(2), 177-188. doi:10.1375/ajgc.16.2.177 Colom, R., Karama, S., Jung, R. E., and Haier, R. J. (2010). Human intelligence and brain networks. Dialogues in clinical neuroscience, 12(4), 489–501. https://doi.org/10.31887/DCNS.2010.12.4/ rcolom Fahmy, A. (2021). Kaufman Adolescent and Adult Intelligence Test. The Gale Encyclopedia of Mental Health. Retrieved April 13, 2021 from Encyclopedia.com: https://www.encyclopedia.com/ medicine/encyclopedias-almanacs-transcripts-and-maps/kaufman-adolescent-and-adult- intelligence-test Flanagan, D.P., Alfonso, V.C., Mascolo, J. and Hale, J. (2010). The Wechsler Intelligence Scale for Children - Fourth Edition in Neuropsychological Practice. Ganuthula, V., and Sinha, S. (2019). The Looking Glass for Intelligence Quotient Tests: The Interplay of Motivation, Cognitive Functioning, and Affect. Frontiers in psychology, 10, 2857. https:// doi.org/10.3389/fpsyg.2019.02857 Girimaji S.C., Pradeep AJ. Intellectual disability in international classification of Diseases-11: A developmental perspective. Indian J Soc Psychiatry 2018;34, Suppl S1:68-74. Accessed from https://www.indjsp.org/article.asp?issn=0971-9962;year=2018;volume=34;issue=5;spage =68;epage=74;aulast=Girimaji Hamour, B., Hmouz, H., Mattar, J., Muhaidat, M. (2012).The Use of Woodcock-Johnson Tests for Identifying Students with Special Needs-a Comprehensive Literature Review. Procedia - Social and Behavioral Sciences. https://doi.org/10.1016/j.sbspro.2012.06.714 Kaufman, S.B. (2013). The Heritability of Intelligence: Not What You Think. Beautiful Mind, Scientific American. Retrieved April 08, 2021 from https://blogs.scientificamerican.com/beautiful- minds/the-heritability-of-intelligence-not-what-you-think/#:~:text=This\%20traditional \%20theory\%20assumes\%20that,acquired\%20skills\%20and\%20learning\%20opportunities. Koshy, B., Thomas, H., Samuel, P., Sarkar, R., Kendall, S., Kang, G. (2017). Seguin Form Board as an intelligence tool for young children in an Indian urban slum. Family Medicine and Community Health;5: https://doi.org/10.15212/FMCH.2017.0118 Subodh Kumar, Divye Kartikey and Tara Singh J. Psychosoc. Res. 208 Marom, J.P. (2021). Kaufman Assessment Battery for Children. The Gale Encyclopedia of Mental Health. Retrieved April 10, 2021 from Encyclopedia.com: https://www.encyclopedia.com/ medicine/encyclopedias-almanacs-transcripts-and-maps/kaufman-assessment-battery-children Marom, J.P. Stanford-Binet Intelligence Scale. Gale Encyclopedia of Mental Disorders. Retrieved April 10, 2021 from Encyclopedia.com: https://www.encyclopedia.com/psychology/ encyclopedias-almanacs-transcripts-and-maps/stanford-binet-intelligence-scale Odd, D.E., Rasmussen, F., Gunnell, D., Lewis, G.(2008). A cohort study of low Apgar scores and cognitive outcomes.Archives of Disease in Childhood - Fetal and Neonatal Edition 2008;93:F115- F120.doi.http://dx.doi.org/10.1136/adc.2007.123745 Ortiz, S. and Lella, S.(2010). Intellectual ability and Assessment: A primer for Parents and Educators. National Association of School Psychologists. Retrieved April 08, 2021 from https:// apps.nasponline.org/search-results.aspx?q=intellectual+assessment Plomin, R., Deary, I. (2015). Genetics and intelligence differences: five special findings. Mol Psychiatry, 20, 98–108. https://doi.org/10.1038/mp.2014.105 Richardson, K., & Norgate, S.H. (2015). Does IQ Really Predict Job Performance?. Applied developmental science, 19(3), 153–169. https://doi.org/10.1080/10888691.2014.983635 Roopesh, B. (2020). Bhatia’s Battery of Performance Tests of Intelligence: A Critical Appraisal. Indian Journal of Mental Health, 7, 289-306. Roopesh, B. (2020). Binet Kamat Test of Intelligence: Administration, Scoring and Interpretation -An In-Depth Appraisal. Indian Journal of Mental Health, 7, 180-201. Ruhl, C. (2020, July 16). Intelligence: definition, theories and testing. Simply Psychology. Retrieved April 09, 2021 from https://www.simplypsychology.org/intelligence.html Schalock, R. L., Borthwick-Duffy, S. A., Bradley, V. J., Buntinx, W. H. E., Coulter, D. I.Craig, E. M. 2010. Intellectual disability. Definition, classification, and systems of support, Washington, Dc: AAIDD. Silverman, W., Miezejeski, C., Ryan, R., Zigman, W., Krinsky-McHale, S., and Urv, T. (2010). Stanford- Binet & WAIS IQ Differences and Their Implications for Adults with Intellectual Disability (aka Mental Retardation). Intelligence, 38(2), 242–248. https://doi.org/10.1016/j.intell.2009.12.005 Simon, L.V., Hashmi, M.F., Bragg, B.N. (2021). APGAR score. StatPearls Publishing. Retrieved April 09, 2021 from https://www.ncbi.nlm.nih.gov/books/NBK470569/ Slattery, V. (2015). Private school testing: What’s the WPPSI-IV? How do schools use it?. Retrieved April 12, 2021 from https://blog.lowellschool.org/blog/private-school-testing-what-is-it-how- do-schools-use-it Thiel, E. (2020). Raven’s progressive matrices test. Retrieved April 10, 2021 from 123test, https:// www.123test.com/raven-s-progressive-matrices-test/ Intelligence Tests for Different Age Groups and Intellectual Disability: A Brief Overview 209 J. Psychosoc. Res. Wegenschimmel, B., Leiss, U., Veigl, M., Rosenmayr, V., Formann, A., Slavc, I., and Pletschko, T. (2017). Do we still need IQ-scores? Misleading interpretations of neurocognitive outcome in pediatric patients with medulloblastoma: a retrospective study. Journal of neuro-oncology, 135(2), 361–369. https://doi.org/10.1007/s11060-017-2582-x ABOUT THE AUTHORS Subodh Kumar, Research Scholar – Department of Psychology, Banaras Hindu University, Varanasi, U.P., India. Divye Kartikey, Student – Discipline of Psychology, IGNOU, New Delhi, India. Tara Singh, Professor – Department of Psychology, Banaras Hindu University, Varanasi, U.P., India. Copyright of Journal of Psychosocial Research is the property of Prints Publications Pvt. Ltd. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holders express written permission. However, users may print, download, or email articles for individual use. 1 Neuropsychological testing 1 Chiara Zucchella, 1 Angela Federico, 1,2 Alice Martini, 3 Michele Tinazzi, 1,2 Michelangelo Bartolo, 4 2 Stefano Tamburin 1,2 3 1 Neurology Unit, Verona University Hospital, Verona, Italy 4 2 Department of Neurosciences, Biomedicine and Movement Sciences, University of Verona, 5 Verona, Italy 6 3 School of Psychology, Keele University, Staffordshire, United Kingdom 7 4 Department of Rehabilitation, Neurorehabilitation Unit, Habilita, Zingonia (BG), Italy 8 9 Corresponding author. Stefano Tamburin, MD, PhD, Department of Neurosciences, Biomedicine 10 and Movement Sciences, University of Verona, Piazzale Scuro 10, I-37134 Verona, Italy. Tel.: +39-11 045-812-4285; +39-347-523-5580; fax: +39-045-802-7276; email: [email protected] 12 13 Word counts 14 Abstract: 123 words 15 Main text: 2900 words (title page, abstract, keypoints, references, figures, legend not included) 16 Box: 1 17 Tables: 8 18 Figure: 1 19 References: 2020 2 Abstract 21 Neuropsychological testing is a key diagnostic tool for assessing people with dementia and mild 22 cognitive impairment, but can also help in other neurological conditions such as Parkinson’s 23 disease, stroke, multiple sclerosis, traumatic brain injury, and epilepsy. While cognitive screening 24 tests offer gross information, detailed neuropsychological evaluation can provide data on different 25 cognitive domains (visuo-spatial function, memory, attention, executive function, language, praxis) 26 as well as neuropsychiatric and behavioural features. We should regard neuropsychological testing 27 as an extension of the neurological examination applied to higher-order cortical function, since 28 each cognitive domain has an anatomical substrate. Ideally, neurologists should discuss the 29 indications and results of neuropsychological assessment with a clinical neuropsychologist. This 30 paper summarises the rationale, indications, main features, most common tests, and pitfalls in 31 neuropsychological evaluation.32 3 Neuropsychological testing explores cognitive functions to obtain information on the structural and 33 functional integrity of the brain, and to score the severity of cognitive damage and its impairment 34 on daily life activities. It is a core diagnostic tool for assessing people with mild cognitive 35 impairment, dementia and Alzheimer’s disease,[1] but is also relevant in other neurological 36 diseases such as Parkinson’s disease,[2] stroke,[3,4] multiple sclerosis,[5] traumatic brain injury,[6] 37 and epilepsy.[7] Given the relevance and extensive use of neuropsychological testing, it is 38 important that neurologists know when to request a neuropsychological evaluation and how to 39 understand the results. Neurologists and clinical neuropsychologists in tertiary centres often 40 discuss complex cases, but in smaller hospitals and in private practice this may be more difficult. 41 This paper presents information on neuropsychological testing in adult patients, and highlights 42 common pitfalls in its interpretation. A very recent paper published on the February 2018 issue of 43 Practical Neurology focused on neuropsychological assessment in epilepsy.[7] 44 45 NEUROPSYCHOLOGICAL TESTING AND ITS CLINICAL ROLE 46 Why is neuropsychological testing important? From early in their training, neurologists are 47 taught to collect information on a patient’s symptoms, and to perform a neurological examination to 48 identify clinical signs. They then collate symptoms and signs into a syndrome, to identify a lesion in 49 a specific site of the nervous system, and this guides further investigations. Since cognitive 50 symptoms and signs suggest damage to specific brain areas, comprehensive cognitive 51 assessment should also be part of the neurological examination. Neuropsychological testing may 52 be difficult to perform during office practice or at the bedside but the data obtained nevertheless 53 can clearly complement the neurological examination. 54 When is neuropsychological testing indicated and useful? Neuropsychological assessment is 55 indicated when detailed information about cognitive function will aid clinical management: 56  to assess the presence or absence of deficits and to delineate their pattern and severity 57  to help to establish a diagnosis (e.g., Alzheimer’s disease or fronto-temporal dementia) or 58 to distinguish a neurodegenerative condition from a mood disorder (e.g., depression or 59 anxiety) 60 4  to clarify the cognitive effects of a known neurological condition (multiple sclerosis, stroke 61 or brain injury). 62 Neuropsychological testing may address questions about cognition in helping to guide a 63 (differential) diagnosis, obtain prognostic information, monitor cognitive decline, control the 64 regression of cognitive–behavioural impairment in reversible diseases, guide prescription of a 65 medication, measure the treatment response or adverse effects of a treatment, define a baseline 66 value to plan cognitive rehabilitation, or to provide objective data for medico-legal situations (Box 67 1). When requesting a neuropsychological assessment, neurologists should mention any previous 68 testing, and attach relevant reports, so that the neuropsychologist has all the available relevant 69 information. 70 Conversely, there are situations when cognitive evaluation should not be routinely recommended, 71 e.g., when patient is too severely affected, the diagnosis is already clear, testing may cause the 72 patient distress and/or anxiety, the patient has only recently undergone neuropsychological 73 assessment, there is only a low likelihood of an abnormality (though the test may still bring 74 reassurance), and when there are neuropsychiatric symptoms (Table 1). Neuropsychological 75 assessment is time-consuming (1–2 hours) and demanding for the patient, and so neurologists 76 much carefully select subjects for referral. 77 How is neuropsychological testing done? Neuropsychological evaluation requires a neurologist 78 or a psychologist with documented experience in cognitive evaluation (i.e., a neuropsychologist). 79 The clinician starts with a structured interview, then administers tests and questionnaires (Table 2), 80 and then scores and interprets the results. 81  The interview aims to gather information about the medical and psychological history, the 82 severity and the progression of cognitive symptoms, their impact on daily life, the patient’s 83 awareness of their problem, and their attitude, mood, spontaneous speech, and behaviour. 84  Neuropsychological tests are typically presented as ‘pencil and paper’ tasks; they are 85 intrinsically performance based, since patients have to prove their cognitive abilities in the 86 presence of the examiner. The tests are standardised, and so the procedures, materials, 87 and scoring are consistent. Therefore, different examiners can use the same methods at 88 5 different times and places, and still reach the same outcomes. 89  The scoring and analysis of the test results allow the clinician to identify any defective 90 functions, and to draw a coherent cognitive picture. The clinician should note any 91 associations and dissociations in the outcomes, and use these to compare with data 92 derived from the interview including observation of the patient, the neuroanatomical 93 evidence, and theoretical models, to identify a precise cognitive syndrome. 94 What information can neuropsychological testing offer? Neuropsychological assessment 95 provides general and specific information about cognitive performance. 96 Brief cognitive screening tools, such as the Mini-Mental State Examination (MMSE), the 97 Montreal Cognitive Assessment (MoCA), and the Addenbrookes Cognitive Examination (ACE-R), 98 provide a quick and easy global, although rough, measure of a person’s cognitive function,[8,9] 99 when more comprehensive testing is not practical or available. Table 3 gives the most common 100 cognitive screening tests, along with scales for measuring neuropsychiatric and behavioural 101 problems, and their impact on daily life. This type of screening test may suffice in some cases, e.g. 102 when the score is low and patient’s history strongly suggests dementia, or for staging and 103 following-up cognitive impairment with repeated testing. However, neurologists should be aware of 104 the limitations of such cognitive screening tools. Their lack of some subdomains may result in poor 105 sensitivity, e.g., MMSE may give false negative findings in ‘Parkinson’s disease-related mild 106 cognitive impairment’ because it does not sufficiently explore the executive functions that are the 107 first cognitive subdomains to be involved in Parkinson’s disease. The MMSE is particularly feeble 108 in assessing patients with fronto-temporal dementia, many of whom score within the ‘normal’ range 109 on the test, yet cannot function in social or work situations. [10] Also, young patients with a high 110 level of education may have normal screening tests because these are too easy and poorly 111 sensitive to mild cognitive alterations. Such patients therefore need a thorough assessment. 112 A comprehensive neuropsychological evaluation explores several cognitive domains 113 (perception, memory, attention, executive function, language, motor and visuo-motor function). The 114 areas and subdomains addressed in neuropsychological examination and the tests chosen depend 115 upon the referral clinical question, the patients and caregiver’s complaints and symptoms, and the 116 6 information collected during the interview. Observations made during test administration may guide 117 further exploration of some domains and subdomains. Failure in a single test does not imply the 118 presence of cognitive impairment, since it may have several reasons (e.g., reduced attention in 119 patients with depression). Also, single tests are designed to explore a specific domain or sub-120 domain preferentially, but most of them examine multiple cognitive functions (e.g. clock drawing 121 test, Table 4). For these reasons, neuropsychological assessment is performed as a battery, with 122 more than one test for each cognitive domain. 123 The main cognitive domains with their anatomical bases are reviewed below; Table 4 summarises 124 the most widely used cognitive tests for each domain. The neuropsychologist chooses the most 125 reliable and valid test according to the clinical question, the neurological condition, the age, and 126 other specific factors. 127 Parallel forms (alternative versions using similar material) may reduce the effect of learning effect 128 from repeated evaluations. They may help to track cognitive disorders over time, to stage disease 129 severity, and to measure the effect of pharmacological or rehabilitative treatment. 130 131 MAIN COGNITIVE DOMAINS AND THEIR ANATOMICAL BASES 132 Most cognitive functions involve networks of brain areas.[11] Our summary below is not intended 133 as an old-fashioned or phrenological view about cognition, but rather to provide rough clues on 134 where the brain lesion or disease may be. 135 Perception. This process allows recognition and interpretation of sensory stimuli. Perception is 136 based on the integration of processing from peripheral receptors to cortical areas (‘bottom-up’), 137 and a control (‘top-down’) to modulate and gate afferent information based on previous 138 experiences and expectations. According to a traditional model, visual perception involves a 139 ventral temporo-occipital pathway for objects and faces recognition, and a dorsal parieto-occipital 140 pathway for perception and movement in space.[12] Acoustic perception involves temporal areas. 141 Motor control. The classical neurological examination involves evaluation of strength, 142 coordination, and dexterity. Neuropsychological assessment explores other motor features ranging 143 from speed to planning. Visuo-motor ability requires integration of visual perception and motor 144 7 skills and is usually tested by asking the subject to copy figures or perform an action. Apraxia is a 145 higher-order disorder of voluntary motor control, planning and execution characterised by difficulty 146 in performing tasks or movements when asked, and not due to paralysis, dystonia, dyskinesia, or 147 ataxia. The traditional model divides apraxia into ideomotor (i.e., the patient can explain how to 148 perform an action, but cannot imagine it or make it when required), and ideational (i.e., the patient 149 cannot conceptualise an action, or complete the correct motor sequence).[13] However, in clinical 150 practice, there is limited practical value in distinguishing ideomotor from ideational apraxia – see 151 recent review in this journal.[14,15] Apraxia can be explored during routine neurological 152 examination, but neuropsychological assessment may offer a more detailed assessment. 153 Motor control of goal-orientated voluntary tasks depends on the interplay of limbic and associative 154 cortices, basal ganglia, cerebellum, and motor cortices. 155 Memory. Memory and learning are closely related. Learning involves acquiring new information, 156 while memory involves retrieving this information for later use. An item to be remembered must first 157 be encoded, then stored, and finally retrieved. There are several types of memory. Sensory 158 memory—the ability briefly to retain impressions of sensory information after the stimulus has 159 ended—is the fastest memory process. It represents an essential step for storing information in 160 short-term memory, which lasts for a few minutes without being placed into permanent memory 161 stores. Working memory allows information to be temporarily stored and managed when 162 performing complex cognitive tasks such as learning and reasoning. Therefore, short-term memory 163 involves only storage of the information, whilst working memory allows actual manipulation of the 164 stored information. Finally, long-term memory, the storage of information over an extended period 165 of time, can be subdivided into implicit memory (unconscious/procedural; e.g., how to drive a car) 166 and explicit memory (intentional recollection; e.g., a pet’s name). Within explicit memory, episodic 167 memory refers to past experiences that took place at a specific time and place, and can be 168 accessed by recall or by recognition. Recall implies retrieving previously stored information, even if 169 they are not currently present. Recognition refers to the judgment that a stimulus presented has 170 previously occurred. 171 The neuroanatomical bases of memory are complex.[16] The initial sensory memory includes the 172 8 areas of the brain that receive visual (occipital cortex), auditory (temporal cortex), tactile or 173 kinesthetic (parietal cortex) information. Working memory links to the dorsolateral prefrontal cortex 174 (involved in monitoring information) and the ventrolateral prefrontal cortex (involved in maintaining 175 the information). Long-term memory requires a consolidation of information through a chemical 176 process that allows the formation of neural traces for later retrieval. The hippocampus is 177 responsible for early storage of explicit memory; the information is then transmitted to a larger 178 number of brain areas. 179 Attention. Attention includes the ability to respond discretely to specific stimuli (focused attention), 180 to maintain concentration over time during continuous and repetitive tasks (sustained attention), to 181 attend selectively to a specific stimulus filtering out irrelevant information (selective attention), to 182 shift the focus among two or more tasks with different cognitive requirements (alternating 183 attention), and to perform multiple tasks simultaneously (divided attention). Spatial neglect refers to 184 failure to control the spatial orientation of attention, and consequently the inability to respond to 185 stimuli.[17] 186 The occipital lobe is responsible for visual attention, while visuo-spatial analysis involves both the 187 occipital and parietal lobes. Attention to auditory stimuli requires functioning of the temporal lobes, 188 especially the dominant (usually left) one for speech. Complex features of attention require the 189 anterior cingulate and frontal cortices, the basal ganglia and the thalamus. 190 Executive functions. Executive functions include complex cognitive skills, such as the ability to 191 inhibit or resist an impulse, to shift from one activity or mental set to another, to solve problems or 192 to regulate emotional responses, to begin a task or activity, to hold information in mind for 193 completing a task, to plan and organise current and future tasks, and to monitor one’s own 194 performance.[18] Taken together, these skills are part of a supervisory or meta-cognitive system to 195 control behaviour that allows us to engage in goal-directed behaviour, prioritise tasks, develop 196 appropriate strategies and solutions, and be cognitively flexible. These executive functions require 197 normal functioning of the frontal lobe, anterior cingulate cortex, basal ganglia, and many inward 198 and outward connections to the cortical and subcortical areas. 199 Language. Language includes several cognitive abilities that are crucial for understanding and 200 9 producing spoken and written language, as well as naming. Given its complexity, we usually 201 explore language with batteries of tests that use different tasks to investigate its specific aspects 202 (Table 4). According to the traditional neuroanatomical view, language relies primarily on the 203 dominant brain: specifically comprehension lies on the superior temporal lobe, language production 204 on the frontal regions and fronto-parietal/temporal circuits, and conceptual–semantic processing on 205 a network that includes the middle temporal gyrus, the posterior middle temporal regions and 206 superior temporal and inferior frontal lobes.[19] However, recent data from stroke patients do not 207 support this model, but instead indicate that language impairments result from disrupted 208 connectivity within the left hemisphere, and within the bilaterally distributed supporting processes, 209 which include auditory processing, visual attention, and motor planning.[11] 210 Intellectual ability. Regardless of the theoretical model, there is agreement that intellectual 211 ability—or intellectual quotient (IQ)—is a multi-dimensional construct. This construct includes 212 intellectual and adaptive functioning, communication, caring for ones own person, family life, social 213 and interpersonal skills, community resource use, self-determination, school, work, leisure, health 214 and safety skills. The Wechsler adult intelligence scale revised (WAIS-R) is the best-known 215 intelligence test used to measure adult IQ. WAIS-R comprises 11 subtests grouped into verbal and 216 performance scales (Table 4). Any mismatch between verbal and performance scores might 217 suggest different pattern of impairments, i.e., memory and language vs. visuo-spatial and 218 executive. 219 220 COMPARING TO NORMATIVE VALUES 221 A person’s performance on a cognitive test is interpreted by comparing it to that of a group of 222 healthy individuals with similar demographic characteristics. Thus, the raw score is generally 223 corrected for age, education and sex, and the corrected score rated as normal or abnormal. 224 However, not all neuropsychologists use the same normative values. Furthermore, there are no 225 clear guidelines or criteria for judging normality of cognitive testing. For example, the diagnostic 226 guidelines for mild cognitive impairment in Parkinson’s disease stipulate a performance on 227 neuropsychological tests that is 1–2 standard deviations (SDs) below appropriate norms, whereas 228 10 for IQ, a performance that is significantly below average is defined as ≤ 70, i.e., 2 SD below the 229 average score of 100.[2] Sometimes, the neuropsychological outcome is reported as an equivalent 230 score, indicating a level of performance (Figure 1). Understanding how normality is defined—how 231 many SDs below normal values, and the meaning of an equivalent score—is crucial for 232 understanding neuropsychological results correctly, and for comparing the outcomes of evaluations 233 performed in different clinical settings. Furthermore, estimating the premorbid cognitive level, e.g., 234 using the National Adult Reading Test (Table 3), helps to interpret the patient score. ‘Crystallised 235 intelligence’ refers to consolidated abilities that are generally preserved until late age, compared 236 with other abilities such as reasoning, which show earlier decline. In people with a low crystallised 237 intelligence—and consequently a low premorbid cognitive level—a low-average 238 neuropsychological assessment score may not represent a significant cognitive decline. 239 Conversely, for people with high premorbid cognitive level, a low-average score might suggest a 240 significant drop in cognitive functioning. 241 242 REACHING A DIAGNOSIS THROUGH NEUROPSYCHOLOGICAL TESTING 243 Although the score on a single test is important, it is only the performance on the whole 244 neuropsychological test battery that allows clinicians to identify a person’s patterns of cognitive 245 strengths and weaknesses; together with motor and behavioural abnormalities, these may fit into 246 known diagnostic categories (Tables 5, 6). 247 The neuropsychologist reports the information collected through neuropsychological evaluation in a 248 written clinical report that usually includes the scores of each test administered. The conclusions of 249 the neuropsychological report are important to guide further diagnostic workup, to predict 250 functionality and/or recovery, to measure treatment response and to verify correlations with 251 neuroimaging and laboratory findings. 252 As well as these quantified scores, it is critically important to have a patient’s self-report of 253 functioning, plus qualitative data including observation of how the patient behaved during the test. 254 Psychiatric confounders require particular attention. Neuropsychologists apply scales for 255 depression (e.g., Beck’s depression inventory, geriatric depression scale) or anxiety (e.g., state–256 11 trait anxiety inventory) during testing; these may offer information on how coexisting conditions 257 may influence cognition through changes in mood or motivational state. For example, it may be 258 difficult to distinguish between dementia and depressive pseudo-dementia, because depression 259 and dementia are intimately related.[20] Table 7 shows some of the features that may help. Note 260 that antidepressants may ameliorate cognitive deficits, particularly attention and memory, and that 261 opioids may worsen cognitive symptoms. 262 Knowing that there are other potential factors that may influence neuropsychological testing (and 263 usually worsening performance) should help clinicians to avoid misinterpreting the results (Table 264 8). For example, in Parkinson’s disease, it is important to pay particular care to motor fluctuations, 265 neuropsychiatric symptoms, pain, and drug side effects that can worsen cognitive performance.[21] 266 Conversely, patients with long-lasting psychiatric disease, such as bipolar disorder or 267 schizophrenia, are often referred for neurological and cognitive assessment when they begin to 268 perform worse in daily activities. Frontal changes are common in bipolar disorders and so finding 269 prefrontal dysfunction in such patients should not lead clinicians to suspect an ongoing 270 neurological disorder. Discussion with the clinical neuropsychologist and the psychiatrist may help 271 to understand potential drug side effects and, eventually, to revise treatment.272 12 Key points 273  For many neurological diseases, neuropsychological testing offers relevant clinical information 274 that complements the neurological examination 275  Neuropsychological tests can identify patterns of cognitive strengths and weaknesses that are 276 specific to particular diagnostic categories 277  Neuropsychological testing involves tests that investigate different cognitive functions in a 278 standardised way, and so the procedures, materials, and scoring are consistent; it also involves 279 an anamnestic interview, scoring and interpreting the results, and comparing these with other 280 clinical data, to build a diagnostic hypothesis 281  Neuropsychological evaluation must be interpreted in the light of coexisting conditions, in 282 particular sensory, motor, and psychiatric disturbances as well as drug side effects, to avoid 283 misinterpreting the results284 13 Provenance and peer review. Commissioned. Externally peer reviewed. 285 This paper was reviewed by Nick Fox, London, UK. 286 287 Acknowledgments. None. 288 289 Competing interests. None. 290 291 Funding. None. 292 293 Contributorship statement. CZ, AF, AM, ST designed the article, collected and interpreted the 294 data, drafted the manuscript and revised it. MT, MB designed the article, collected and interpreted 295 the data, and revised the manuscript for important intellectual content. All Authors approved the 296 final version of the article. CZ and ST take full responsibility for the content of this review. 297 298 I, Stefano Tamburin, Corresponding Author of this article contained within the original manuscript 299 which includes any diagrams & photographs and any related or stand alone film submitted (the 300 Contribution) have the right to grant on behalf of all authors and do grant on behalf of all authors a 301 licence to the BMJ Publishing Group Ltd and its licensees to permit this Contribution (if accepted) 302 to be published in any BMJ Group products and to exploit all subsidiary rights, as set out in our 303 licence set out at: http://group.bmj.com/products/journals/instructions-for-304 authors/wholly_owned_licence.pdf.305 Table 1. Conditions in which neuropsychological testing is usually not recommended 306 Condition Reason Patient too severely affected Not or only slightly informative assessment The cost in terms of burden for the patient (i.e., fatigue, anxiety, feeling of failure) may exceed the benefit of gaining information from the assessment Clear diagnosis If the diagnosis is clear and neuropsychological testing is required for diagnostic purposes only, it should not be routinely prescribed Distress and/or anxiety might be produced Diagnosis has already been defined and it is clear that the patient will fail in testing Recent (<6 months) neuropsychological assessment Significant cognitive decline is unlikely in the short time, unless a neurological event has occurred or the patient is affected by rapidly progressive dementia Short-interval repeated evaluation may be biased by learning effect, except when parallel versions of tests are used The a priori likelihood of an abnormality is low Neuropsychological testing should not be routinely performed when clinical history and examination exclude a neurological or cognitive condition Consider prescribing neuropsychological testing, if it is the only way to provide reassurance when a healthy individual is concerned about cognitive decline Confusion or psychosis Neuropsychological assessment is not reliable and could exacerbate confusion and/or abnormal behaviour Table 2. Structure of the neuropsychological evaluation 307 Stage Contents Interview with the patient, relative, or caregiver Reason for referral (i.e., what the physician and patient want to know) Medical history, including family history Lifestyle and personal history (e.g., employment, education, hobbies) Premorbid personality Symptoms onset and evolution Previous examinations (e.g., CT scan, MR scan, electroencephalography, positron-emission tomography scan) Sensory deficits (loss of vision, or hearing) Qualitative assessment of cognition, mood and behaviour Mood and motivation (i.e., depression, mania, anxiety, apathy) Self-control, or disinhibition Subjective description and awareness of cognitive disorders, and their impact on the activities of daily life Expectations and beliefs about the disease Verbal (fluency, articulation, semantic content) and non-verbal (eye contact, tone of voice, posture) communication Clothing, and personal care Interview with the relative/caregiver to confirm patient’s information, provide explanations, and acquire information on how the patient behaves in daily life Test administration Standardised administration of validated tests Final report Personal … Empirically Supported Forensic Assessment Robert P. Archer, Eastern Virginia Medical School Elizabeth M. A. Wheeler and Rebecca A. Vauter, Central State Hospital The field of Forensic Psychology has greatly expanded over the past several decades, including the use of psy- chological assessment in addressing forensic issues. A number of surveys have been conducted regarding the tests used commonly by forensic psychologists. These surveys show that while tests specifically designed to address forensic issues have proliferated, traditional clinical assessment tests continue to play a crucial role in many forensic evaluations. The current article identi- fies some of the most salient characteristics of empiri- cally supported forensic tests and provides examples of tests felt to meet each of these five criteria. These crite- ria include adequate standardization, acceptable relia- bility and validity, general acceptance within the community of forensic evaluators, availability of test data from cross-cultural and cross-ethnic samples, and comparison data relevant to specific forensic popula- tions. Although the guidelines provided in this article provide a helpful framework for evaluating the useful- ness of forensic tests, the establishment of a national review panel or workgroup to address this issue would be highly useful, particularly in the potential controver- sial task of identifying those tests that meet reasonable guidelines to be identified as empirically supported forensic assessment instruments. Key words: forensic, forensic assessment, survey, testing. [Clin Psychol Sci Prac 23: 348–364, 2016] BRIEF REVIEW OF THE HISTORY AND DEFINITION OF FORENSIC PSYCHOLOGY The genesis of what would later be termed Forensic Psy- chology can be traced back to the early 20th century when psychologists first became involved in the attempt to understand the limitations of eyewitness testimony and, in particular, Hugo M€unsterberg’s advocacy for an increased role for psychologists within the legal system (Vaccaro & Hogan, 2004). Indeed, the scope of the psychologist’s role in addressing psycholegal issues both inside and outside of the courtroom has expanded greatly as a result of several court rulings. For example, the landmark 1962 ruling in Jenkins v. United States helped to facilitate psychologists’ ability to testify inside the courtroom as expert witnesses, by asserting that psy- chologists could be qualified to give expert testimony on issues of mental health. Over the past several dec- ades, Forensic Psychology has established recognized training models, board certification requirements, and an American Psychological Association (APA) division focused on this area of practice (Division 41, American Psychology-Law Society). For many laypeople, the term Forensic Psychology may invoke dramatic images from CSI (Crime Scene Inves- tigation). For many psychologists not actively involved in Forensic Psychology, the idea of testifying in court may invoke a strong anxiety response. While inaccurate views of the role of forensic psychologists can easily be dismissed as a product of the popular media, and the testimony anxiety of those not actively involved in Forensic Psychology can be effectively addressed by Address correspondence to Robert P. Archer, Eastern Vir- ginia Medical School – Psychiatry & Behavioral Sciences, 825 Fairfax Avenue, Norfolk, VA 23507. E-mail: [email protected] cox.net. doi:10.1111/cpsp.12171 © 2016 American Psychological Association. Published by Wiley Periodicals, Inc., on behalf of the American Psychological Association. All rights reserved. For permissions, please email: [email protected] T hi s do cu m en t i s co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia tio n or o ne o f i ts a lli ed p ub lis he rs . T hi s ar tic le is in te nd ed s ol el y fo r t he p er so na l u se o f t he in di vi du al u se r a nd is n ot to b e di ss em in at ed b ro ad ly . training and experience in the courtroom, there remains a need for a clear and concise definition of the term Forensic Psychology (Huss, 2009). In this regard, rel- atively recent guidelines by the APA have helped to clarify the term and define when psychologists are working within the scope of Forensic Psychology. The most recent edition of the APA’s Specialty Guidelines for Forensic Psychology (APA, 2011) defines Forensic Psychology as “professional practice by any psychologist working with any sub-discipline of psy- chology (e.g., clinical, developmental, social, cognitive) when applying the scientific, technical, or specialized knowledge of psychology to the law to assist in addressing legal, contractual, and administrative mat- ters” (p. 1). Thus, Forensic Psychology can encompass not only direct legal issues (e.g., competency and sanity evaluations) but also administrative/contractual issues such as fitness for duty and disability evaluations. The APA Guidelines also helped to define when a psychologist is serving as a “forensic practitioner” in the following statement: Forensic practitioner refers to a psychologist when engaged in the practice of Forensic Psychology.. . . Such professional conduct is considered forensic from the time the practitioner reasonably expects to, agrees to, or is legally mandated to, provide exper- tise on an explicitly psycholegal issue. (2011, p. 1) The APA Guidelines do not rely on individuals’ typical area of practice in order to determine whether they are providing forensic services, but instead focus on what they are doing in the specific case. The Guidelines state that simply being involved in a court- related matter does not make one a forensic psycholo- gist (e.g., someone testifying only on his or her treat- ment of an individual would not be considered forensic practice). In contrast, a neuropsychologist retained to assess and subsequently testify regarding a psycholegal issue (e.g., whether an individual suffered a brain injury as the result of a car accident for which he or she was seeking compensation) would be practicing as a forensic psychologist regardless of her activities in other areas of her practice (i.e., the rest of her caseload was not court-involved individuals). Thus, anyone pro- viding a psycholegal opinion in a legally mandated matter is providing services as a forensic psychologist, regardless of the nature of his or her typical practice area. DIFFERENCE BETWEEN OBJECTIVES OF CLINICAL ASSESSMENT AND FORENSIC ASSESSMENT There are some important differences between assess- ments undertaken in the forensic context and clinical assessment. Ackerman (1999), in his book Essentials of Forensic Psychological Assessment, provided a summary table that presents some of the salient differences between clinical and forensic relationships. The key components from that table will be reviewed here. The first difference between clinical and forensic assessments has to do with who is identified as the cli- ent. In clinical assessment, the individual being assessed is typically the identified client, whereas in a forensic setting, an attorney or the court is usually the client of record. This is a crucial distinction, and one from which many of the other differences are derived between the clinical and forensic evaluations. A second major difference between the two types of assessment concerns the rules that govern the disclosure of information. In clinical assessments, patient–therapist privilege and the Health Insurance Portability and Accountability Act (HIPAA) provide the guidelines that cover disclosures. In forensic assessment, the scope of disclosures may either be mandated by statute (e.g., in competency and sanity evaluations, the state or fed- eral statutes dictate who has access to the report) or covered under attorney–client privilege (e.g., in immi- gration cases or privately retained personal injury evalu- ations in which the psychologist is typically retained by one side). Overall, it is likely that the report produced at the conclusion of a forensic evaluation will be more widely distributed than a clinical assessment report, and in some cases, the report can become part of a case file that could be open to the public (e.g., in a sanity case in a court of record such as a circuit court). Further, a forensic psychologist may find parts of their report quoted in case law, or even in the local or national newspapers, an event much less likely to occur when practicing clinical assessment. A third significant difference is the stance the psy- chologist takes toward the client/examinee in the eval- uation process. In a clinical assessment, the psychologist FORENSIC ASSESSMENT � ARCHER ET AL. 349 T hi s do cu m en t i s co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia tio n or o ne o f i ts a lli ed p ub lis he rs . T hi s ar tic le is in te nd ed s ol el y fo r t he p er so na l u se o f t he in di vi du al u se r a nd is n ot to b e di ss em in at ed b ro ad ly . typically assumes an empathetic, supportive, and non- judgmental stance. In contrast, in forensic assessment, the evaluator is often advised to present in a more neu- tral and objective manner. Indeed, presenting oneself as empathetic or supportive during a forensic evaluation may be viewed as being potentially misleading and inappropriate. At the heart of the differences between clinical and forensic assessment is the adversarial nature of the legal system. While in clinical assessment, a rela- tionship with a client would rarely be adversarial, it is much more frequently the case in forensic assessment that at least part of the process may be viewed as adversarial by the examinee. A fourth area of difference is that forensic psycholo- gists are often skeptical of the accuracy of the exami- nee’s self-report. Thus, the forensic psychologist is more likely to rely on a variety of sources of informa- tion to confirm the examinee’s self-report, while a clinical psychologist conducting an assessment is more likely to rely more heavily on the subject’s self-report. This difference is related to the level of scrutiny and collaboration applied to the self-report of the exami- nee. Individuals participating in forensic evaluations may have a wide variety of reasons to provide inaccu- rate information during the evaluation. For example, examinees may over-report their psychological stability and parenting skills in order to obtain custody of their children, while other examinees may exaggerate or malinger mental health problems in order to avoid criminal responsibility or to obtain monetary gain in personal injury cases. Further, threats to validity in forensic assessment may also come from a lack of client self-awareness rather than a conscious intent to be dis- honest or malinger. Thus, psychologists in forensic assessments often rely on collateral sources of informa- tion, including medical records; police reports; victim statements; reports from employers, family members, or friends; and even taped phone calls, e-mails, or text messages to confirm or disconfirm the self-report infor- mation provided by the examinee. Finally, the goals of the clinical and forensic assess- ments are typically quite different, and may even be in conflict. Specifically, the goal of clinical assessment is to assist the client (patient). This is typically done by answering broader diagnostic questions to assist clients in understanding more about themselves and to facilitate treatment planning. In contrast, the goal of the forensic assessment is to assist the court, or retain- ing party such as an attorney, in providing opinions regarding a psycholegal question. Thus, forensic assess- ment may or may not be helpful to the individual being assessed, and may not even provide a diagnosis (e.g., it may not be necessary to provide a diagnosis in a competency evaluation if it does not impact the defendant’s abilities in court). Indeed, it is possible that the outcome of a forensic assessment could be harmful to an individual’s case (e.g., an individual wants to plead not guilty by reason of insanity, but the evaluator does not find that there is evidence to support such a plea). BROADNESS OF THE FIELD OF FORENSIC ASSESSMENT As previously noted, Forensic Psychology encompasses a large practice area. Huss (2009) reviewed the major areas of Forensic Psychology and included various topics, such as risk assessment at the time of sentencing, insanity (criminal responsibility), competency to stand trial, sex offender evaluations, juvenile transfers to adult court, child custody, civil commitment, personal injury cases, worker’s compensation, and competency to make medical decisions. We would also add to this list the additional practice areas of fitness for duty evaluations, capital sentencing mitigation, immigration evaluations, jury selection consultation, guardianship/conservator- ship evaluations, and juvenile evaluations conducted at the court’s request, although it is important to note that this is not an all-inclusive list. It is quite clear that two psychologists may report that they both practice Forensic Psychology without engaging in similar evalu- ations (e.g., one might conduct child custody and par- enting evaluations while another might build a practice on competency and sanity evaluations). Further, being competent in one area of forensic practice does not make a practitioner competent in other forensic prac- tice areas. Most practice areas in Forensic Psychology fall under two broad categories: civil and criminal. Civil areas of practice typically involve the relationship between members of the community, and the goal is not to punish a wrongdoer but to prevent or compen- sate for a wrong. Examples of civil cases would be evaluations of custody evaluations and personal injury 350 CLINICAL PSYCHOLOGY: SCIENCE AND PRACTICE � V23 N4, DECEMBER 2016 T hi s do cu m en t i s co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia tio n or o ne o f i ts a lli ed p ub lis he rs . T hi s ar tic le is in te nd ed s ol el y fo r t he p er so na l u se o f t he in di vi du al u se r a nd is n ot to b e di ss em in at ed b ro ad ly . evaluations. In contrast, criminal practice typically involves cases in which it is alleged that criminal laws have been broken, and the goal is to appropriately punish the wrongdoer. Examples of Forensic Psychol- ogy involvement in criminal law would include com- petency to stand trial evaluations and sanity evaluations, as well as capital sentencing mitigation. APA GUIDELINES (DIVISION 41 GUIDELINES) Because Forensic Psychology encompasses such a vast array of practice areas, there are a variety of specific pathways through which psychologists can become competent and proficient in their specialty area. The Specialty Guidelines for Forensic Psychology (APA, 2011) provide some general guidance regarding the relevant issue in conducting forensic assessments. The following is a brief summary of these guidelines: • Forensic psychologists seek to focus on the legally relevant factors (i.e., they understand and are guided by relevant legal statutes and case law). • Forensic practitioners seek the appropriate use of assessment procedures (i.e., they understand the strengths and limitations of the tests they select as applied to the relevant forensic issues/populations). These guidelines state that forensic practitioners also seek to consider the strengths and limitations of employing traditional assessment procedures in forensic examinations. • Given the stakes involved in forensic contexts, forensic practitioners recognize the need to take spe- cial care to ensure the integrity and security of test materials and results. The guidelines also state that when test score valid- ity has not been firmly established in a forensic con- text, the petitioner should seek to describe test score strengths and limitations and to explain these issues in the forensic context. This explanation may include the observation that the context in which the assessments have been given may warrant adjustments in test score interpretation. The guidelines also suggest that forensic psycholo- gists should • take into account individual examinee differences, including “situational, personal, linguistic, and cultural differences that might affect their judge- ments or reduce the accuracy of their interpreta- tions” (APA, 2011, p.15); • take reasonable steps to explain, in an understandable manner, the results of the test to either the examinee or his or her representative. If this feedback is not pos- sible, this restriction should be explained in advance; • document all the data sources they considered as part of the evaluation. Further, this documentation should be made available based on proper subpoenas or legal consent; and • maintain careful and detailed records of the evalua- tion process. IMPACT OF THE DAUBERT SUPREME COURT DECISION Psychologists’ involvement in court-related matters has long been governed, at least in part, by rules regarding the admissibility of evidence. For over 50 years, this involvement in federal courts and in many state courts was governed primarily by the Supreme Court stan- dard, which was developed out of the Frye v. United States (1923) ruling. This ruling established that evi- dence could be admissible if it were based on a tech- nique or method generally accepted in the field. Thus, for example, psychologists’ testimony may be found to be admissible if based on a practice, test, or technique generally accepted by other psychologists within that field. For obvious reasons, however, determining what method is “generally accepted” may be difficult when applied to a specific test used within a specific context. In 1993, the U.S. Supreme Court modified the admissibility standard in its ruling in the case of Daubert v. Merrell Dow Pharmaceuticals. The Daubert case involved the admissibility of testimony based on tech- nical or scientific data. The Supreme Court heard the case because it felt that there were differences regarding how lower courts had been determining the proper standard for admitting expert testimony. The Court noted that the Frye standard had been superseded by the Federal Rules of Evidence. Indeed, the Federal Rules of Evidence have one section for the “ordinary witness” (701) and one for expert witnesses (702). The Court noted that the additional requirements in 702 were not surprising given that “an expert is permitted wide latitude in offering opinions, including those that are not based on firsthand knowledge or observation.” FORENSIC ASSESSMENT � ARCHER ET AL. 351 T hi s do cu m en t i s co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia tio n or o ne o f i ts a lli ed p ub lis he rs . T hi s ar tic le is in te nd ed s ol el y fo r t he p er so na l u se o f t he in di vi du al u se r a nd is n ot to b e di ss em in at ed b ro ad ly . They observed that this relaxation of the requirement for firsthand knowledge, a central part of common law, is “premised on an assumption that the expert’s opin- ion will have a reliable basis in the knowledge and experience of his discipline.” Thus, the Court then outlined the following guidelines for Rule 702 expert witness testimony: . . .If scientific, technical, or other specialized knowl- edge will assist the trier of fact to understand the evidence or to determine a fact in issues, a witness qualified as an expert by knowledge, skill, experi- ence, training, or education, may testify thereto in the form of an opinion or otherwise. The Court also noted that nothing in this rule specifically required “general acceptance” for admissi- bility. Therefore, the Court stated, “That austere stan- dard [Frye], absent from and incompatible with the Federal Rules of Evidence, should not be applied in federal trials.” The Court went on to outline the limits which were put into place by the Rules of Evidence, including that “any and all scientific testimony or evi- dence admitted is not only relevant, but reliable.” The justices noted that many factors could benefi- cially influence a judge’s decision-making process, and therefore, they were not attempting to create a rigidly defined checklist or test, but they did provide the fol- lowing considerations in evaluating admissibility issues: • Is the evidence/opinion based on a technique or method that has been tested and has established stan- dards controlling its use and operation? • Has the theory or technique undergone peer review and publication? • Does the technique have a known and established error rate? • Finally, “the general acceptance of a technique” still has a bearing on admissibility of testing, but no longer serves as the sole or exclusive criterion. Functionally, the Court developed a flexible stan- dard to determine admissibility of expert testimony and opinions. Two crucial Supreme Court cases followed Daubert. In the 1997 ruling in General Electric Company v. Joiner and the 1999 ruling in Kumho Tire Company v. Carmichael, the Supreme Court clarified aspects of the Daubert ruling regarding the admission of expert wit- ness and testimony. Generally, criteria for admissibility under this series of cases support the view that testi- mony based on psychometrically reliable and valid instruments, with test conclusions based on empirical support, is more likely to meet admissibility guidelines in federal courts and in the numerous states that follow Daubert criteria. Goodman-Delahunty (1997) argued that the effects of Daubert have been to cause forensic psychologists “to be more explicit about the scientific foundations of their opinions” (p. 121), and Underwager and Wake- field (1993) discussed the influence of the Daubert deci- sion on psychological testimony in an article entitled “A Paradigm Shift for Expert Witnesses.” A growing number of articles have reviewed specific psychological tests such as the MMPI-2-RF (Ben-Porath, 2012; Sell- bom, 2012) and MCMI (Rogers, Salekin, & Sewell, 1999) in terms of their ability to meet Daubert criteria, and there seems little doubt that any test used to form the basis of an expert’s opinion in court is potentially subject to review based on Daubert factors. Many states have adopted the Daubert standard, but several continue to use Frye or other admissibility crite- ria. It is incumbent on forensic practitioners to know which standard their state uses and how it might impact admissibility of psychological evidence. For pur- poses of the current article, however, it is most impor- tant to note that various psychological tests, or components of tests, have been submitted to a “Daubert Challenge” or “Daubert Motion,” that is, a hearing conducted before a judge where validity and admissi- bility of expert testimony (in this case based on a speci- fic test or tests) are challenged by opposing counsel as failing to meet Daubert standards of methodology, relia- bility, validity, and general acceptance. WHAT DO WE KNOW ABOUT THE “GENERAL ACCEPTANCE” OF TESTS USED IN FORENSIC EVALUATIONS? A growing body of research literature has addressed the issue of the types of assessment instruments typically used in a variety of forensic settings. This research is an important aspect of determining those psychological tests that are generally accepted by practitioners to address varied forensic issues, in turn contributing to identifying tests that are generally accepted within the 352 CLINICAL PSYCHOLOGY: SCIENCE AND PRACTICE � V23 N4, DECEMBER 2016 T hi s do cu m en t i s co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia tio n or o ne o f i ts a lli ed p ub lis he rs . T hi s ar tic le is in te nd ed s ol el y fo r t he p er so na l u se o f t he in di vi du al u se r a nd is n ot to b e di ss em in at ed b ro ad ly . field of psychology. We shall review some of the more prominent of these studies that have focused on the tests used across a variety of forensic settings, and then briefly review some of the surveys of assessment instru- ments used for specific forensic purposes, such as cus- tody evaluations, violence risk assessment, and criminal forensic evaluations. SURVEYS OF TEST USAGE IN GENERAL FORENSIC SETTINGS Boccaccini and Brodsky (1999) noted that psychologists have become increasingly involved in forensic assess- ments. The authors conducted a survey among 80 members of APA Divisions 12 (Clinical) and 40 (Neu- ropsychology) to evaluate what instruments these prac- titioners used in emotional injury assessments and to see whether practitioners used these tests in a manner consistent with the Daubert (1993) Supreme Court rul- ing on the admissibility of expert witness testimony. The 80 psychologists who completed surveys in this study had conducted over 10,500 emotional injury evaluations during their career, and listed a total of 67 different assessment instruments they had employed in emotional injury evaluations during the past year. However, only 11 of these tests were used by five or more survey respondents, and no practitioners used exactly the same combination of tests in their standard battery for injury evaluations. The five most frequently employed tests reported in this study included the Minnesota Multiphasic Personality Inventory (MMPI) or MMPI-2 (94\%), the Wechsler Adult Intelligence Scale–Revised (WAIS-R) or WAIS-III (54\%), the Mil- lon Clinical Multiaxial Inventory-II (MCMI-II) or MCMI-III (Millon, 1994; 50\%), the Rorschach Inkblot Technique (41\%), and the Beck Depression Inventory (31\%). The authors noted that respondents indicated that Daubert-related criteria such as general acceptance of the test within the field and the presence of inde- pendent research validation played an important role in their selection of instruments. However, Boccaccini and Brodsky (1999) reported that respondents also cited test selection factors unrelated to the Daubert standard, such as their personal clinical experience with a partic- ular test, as popular reasons for test selection. Thus, in response to the question concerning whether psycholo- gists selected tests in forensic evaluations based on cri- teria outlined in the Daubert decision, the authors concluded, “Our findings indicate that the answer is yes and no” (p. 257). Since the survey by Boccaccini and Brodsky (1999), several other surveys have been conducted on test usage by psychologists in forensic settings. Table 1 pro- vides a summary of the results from these studies. Lally (2003), for example, surveyed 64 diplomates in Forensic Psychology concerning the frequency with which they used various tests, and their opinions con- cerning the acceptability of a wide variety of specific psychological tests, in six areas of forensic practice, including mental state at the time of an offense, risk for future violence, risk for future sexual violence, Table 1. Result summary of test usage in general forensic settings Study Sample Most Frequently Used/Recommended Tests Boccaccini and Brodsky (1999) 80 APA Division 12 or Division 40 members MMPI/MMPI-2 (94\%) WAIS-R/WAIS-III (54\%) MCMI-II/MCMI-III (50\%) Rorschach (41\%) Beck Depression Inventory (31\%) Lally (2003) 64 Forensic psychology diplomates MMPI-2 WAIS-III PCL-R Luria-Nebraska Halstead-Reitan Archer et al. (2006) 152 APA Division 41 members or AAFP diplomates Cognitive Assessment Wechsler Intelligence Scales Single-Scale Tests Beck Depression Inventory Beck Anxiety Inventory Multiscale Tests MMPI-2 Children/Adolescents MMPI-A Projective Tests Rorschach Viljoen et al. (2010) 215 psychologists who perform forensic evaluations of adults or children Wechsler Intelligence Scales (75.3\%) MMPI-2/MMPI-A (66\%) Structured Assessment of Violence Risk in Youth (SAVRY) (Borum, Bartel, & Forth, 2003; 35.1\%) MCMI-III or MACI (31.2\%) PCL-R or PCL:YV (24.7\%) Austin and Wygant (2012) 284 psychologists conducting forensic evaluations MMPI-2 (59\%) PAI (38\%) MMPI-2-RF (29\%) MCMI-III (26\%) Rorschach (22\%) FORENSIC ASSESSMENT � ARCHER ET AL. 353 T hi s do cu m en t i s co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia tio n or o ne o f i ts a lli ed p ub lis he rs . T hi s ar tic le is in te nd ed s ol el y fo r t he p er so na l u se o f t he in di vi du al u se r a nd is n ot to b e di ss em in at ed b ro ad ly . competency to stand trial, competency to waive Mir- anda rights, and evaluation of … Available online at www.jlls.org JOURNAL OF LANGUAGE AND LINGUISTIC STUDIES ISSN: 1305-578X Journal of Language and Linguistic Studies, 14(1), 67-85; 2018 Computer-based and paper-based testing: Does the test administration mode influence the reliability and validity of achievement tests? Hüseyin Öz a * , Tuba Özturan b a Department of Foreign Language Education, Hacettepe University, Ankara 06800, Turkey b School of Foreign Languages, Erzincan University, Erzincan 24000, Turkey APA Citation: Öz, H., & Özturan, T. (2018). Computer-based and paper-based testing: Does the test administration mode influence the reliability and validity of achievement tests? Journal of Language and Linguistic Studies, 14(1), 67-85. Submission Date: 23/11/2017 Acceptance Date: 06/03/2018 Abstract This article reports the findings of a study that sought to investigate whether computer-based vs. paper-based test-delivery mode has an impact on the reliability and validity of an achievement test for a pedagogical content knowledge course in an English teacher education program. A total of 97 university students enrolled in the English as a foreign language (EFL) teacher education program were randomly assigned to the experimental group that took the computer-based achievement test online and the control group that took the same test in paper-and-pencil based format. Results of Spearman Rank order and Mann-Whitney U tests indicated that test- delivery mode did not have any impact on the reliability and validity of the tests administered in either way. Findings also demonstrated that there was not any significant difference in test scores between participants who took the computer-based test and those who took the paper-based test. Findings were discussed in terms of the idea that computer technology could be integrated into the curriculum not only for instructional practices but also for assessment purposes. © 2018 JLLS and the Authors - Published by JLLS. Keywords: Computer-based testing; paper-based testing; reliability; validity; English teacher education 1. Introduction With the introduction of the digital revolution, educators have begun to benefit from modern computer technology to carry out accurate and efficient assessment of learning outcomes both in primary/secondary and higher education. In recent years, Turkish institutions of higher education have also started integrating e-learning and assessment initiatives into their undergraduate programs. It is assumed that Turkish educational institutions will gradually move components of their assessment systems to online delivery or computerized mode. There are several reasons for implementing computerized assessments in education. We can reduce the “lag time” in reporting scores, increase the efficiency of assessment, achieve the flexibility in terms of time and place, give immediate feedback and announce students’ scores immediately, analyze student performance that cannot be investigated from paper-based tests by implementing individualized assessments customized to student needs and * Corresponding author. Tel.: +90-312-780-5521 E-mail address: [email protected] This article is part of the second author’s master thesis, completed with the supervision of the first author. mailto:[email protected] https://orcid.org/0000-0002-6512-4931 https://orcid.org/0000-0002-7099-0076 https://orcid.org/0000-0002-6512-4931 https://orcid.org/0000-0002-7099-0076 68 H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85 minimize the paper consumption and cost as well as duplicate or mail test materials (Alderson, 2000; Bennett, 2003; Noyes & Garland, 2008; Paek, 2005; Roever, 2001). This paper reports on findings of a study that investigated whether computer-based and paper-based tests as test delivery modes would influence the reliability and validity of the achievement test for a pedagogical content knowledge course in an English as a foreign language (EFL) teacher education program. 1.1. Reliability and validity criteria of tests Defining the aims of tests and choosing the most suitable test type should be done before administering a test. However, these are not enough in order to have an effective test. In this sense, educators have to first consider some specific principles. Validity and reliability are foremost among these principles. As the most essential criterion for the quality of any assessment, validity is the relation between the aim and the form of the assessment and refers to whether a test truly measures what we claim it to measure. In other words, the tests measure what they are supposed to measure once the tests are valid (Fulcher & Davidson, 2007; Stobart, 2012). As it is a very crucial criterion for conducting tests, this following question lingers: how can instructors create valid tests or increase the validity of tests? There are some tips available to them, documented in available academic literature. Firstly, direct testing should be done whenever feasible, and explanations should be made clear. Secondly, scoring should be directly in relation to the targets of tests. Lastly, reliability has to be satisfied. Otherwise, validity cannot be assured (Hughes, 2003). Reliability, on the other hand, is the degree to which a test measures a skill and/or knowledge consistently (Scheerens, Glas, & Thomas, 2005, p. 93). Therefore, similar scores are commonly achieved on a reliable test once the same exam is administered on two different days or on two different but parallel formats. It is important to note that Brown and Abeywickrama (2010) and Hughes (2003) both emphasize that the interval between the administrations of two tests should be neither too long as students might learn new things nor too short as it might change students’ ability to remember the exam questions. Once the test is reliable, the test-takers will get more or less the same score no matter when the test is administered, on a certain day or on coming days, and teachers have to prepare and administer reliable tests so as to obtain similar results from the same students, but at a different time (Hughes, 2003, p. 36). Reliable tests give predictions about to what extent measurement-related factors may have impact on test scores. These factors can be grouped into the following categories: test factors that refer to the clarity of instruction, items, paper-layout and the length of the test; situational factors that refer to the conditions of the room; and individual factors that cover the physical and psychological state of test-taker. All these factors should be considered while interpreting the reliability of any test scores. 1.2. Computer-based testing alternatives Computers are undoubtedly part of our daily lives; they take part in many different walks of life actively. This role change in computer applications goes back to the late 1970s. Since then, computers have had a vital place in the world, especially for educational purposes. In addition to the widespread use of web and computers as teaching sources both inside and outside the class (especially for distance education), computers have come to offer testing alternatives for teachers as well. Today, it is estimated that nearly 1000 computer-assisted assessments are done each day in the UK (Lilley, Barker, & Britton, 2004). These assessment models do not only refer to the traditional tests that are administered on computers in class under the supervision of proctors. It has different sorts of alternatives which are named as computer-based testing (CBT), web-based testing (WBT) and computer-adaptive testing (CAT). These are briefly introduced below. . H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85 69 Computer-based testing roughly refers to making use of computers while preparing questions, administering exams and scoring them (Chapelle, 2001), and with the advent of using computers as testing devices since the 1980s, a different point of view has been gained so that more authentic, cost- saving, scrutinized and controlled testing environment can be achieved, comparing to traditional paper-and-pencil based one (Jeong, 2014; Linden, 2002; Parshall, Spray, Kalohn, & Davey, 2002; Wang, 2010; Ward, 2002). Computer-based testing, which started in the late 1970s or in the early 1980s, was always thought as an alternative to paper-based testing (Folk & Smith, 2002), because “one size fits all” solution across testing programs was not desired at all (Ward, 2002, p. 37). Computers have brought many advantages. First of all, they have the potential to offer realistic test items like media, graphics, pictures, video and sound (Chapelle & Douglas, 2006, p. 9; Linden, 2002, p. 9). Therefore, students can be involved in a real-life testing environment where there are many integrated activities. In other words, students can respond to computers orally, draw on the screen while answering the question, see and interpret graphics or tables for an open-ended question and so on, and handicapped test-takers can take the exams on computer with great ease. CBT also supplies immediate feedback and scoring (Chapelle & Douglas, 2006; Parshall et al., 2002), which has significant impact over pedagogy (test-takers can grasp their mistakes when immediate feedback is offered upon the completion of the test) and eases teachers’ workload of scoring all papers – teachers may spend much time on scoring exam papers, and also, generally they cannot give enough feedback about each student’s mistakes, or even if they provide feedback, it may be so late that students do not remember the questions or their answers. Another issue that should be mentioned here is that especially for open-ended questions, subjective-scoring may be in due. However, thanks to computer technology, objective scoring can be achieved, and problems caused by handwriting disappear, too. And the last important feature of CBT or Computer-Assisted Assessment (henceforth CAA) is that the examiners can collect data about the exam such as how many questions have been answered correctly, how many of them have been omitted and how many minutes have been spent for each question, which is called as response latency (Parshall et al., 2002, p. 2). Since the beginning of using computers as testing tools, many different computer-based test delivery modes have come to scene: computer-adaptive testing (CAT), linear-on-the-fly testing (LOFT) or computerized fixed tests (CFT), computerized mastery testing (CMT) (Ward, 2002, p. 38) and automated test assembly (ATA) (Parshall et al., 2002, p. 11). CAT is totally performance or individual based testing. The more a candidate answers questions correctly, the more challenging questions appear on the screen, and vice versa. On the contrary, LOFT or CFT has fixed time and test- length for all test-takers. Exam security is the main goal in LOFT, rather than having psychometric values as in CAT (Parshall et al., 2002). As for CMT, it aims to divide test-takers into mastery and non-mastery groups (Folk and Smith, 2002, pp. 49-50). Lastly, ATA chooses items from an item pool in regard to the test plan and makes use of content and statistical knowledge. This kind of test has fixed time and is not in adaptive mode (Parshall et al., 2002, p. 10). Kearsley (1996) emphasized the importance of web and its future potential as an educational tool many years ago. Not only is Web a means of delivering information, material, news and so on from one part of the world to the whole, but also it is the most commonly used and significant benefit of teachers for a variety of things like searching different types of materials, teaching for distance education, presenting, preparing tests and delivering them. The reason lying behind this change is that since 1990s, international connectivity has not been limited only to teaching staff at universities and to their use of network in computer labs, and without any doubt, it has brought many differences. As for the testing applications, universal access to computer-assisted assessment has been introduced, and a bulk of opportunities for autonomous learning and self-assessment has spread all around the world, 70 H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85 and so have computer-based applications. Today, thanks to web-based applications, students and teachers can be universal and universally in touch (Chapelle, 2001, p. 23). As an alternative of CAA, web-based testing is specifically driven and delivered by means of web, and it means that the tests can be taken anywhere and anytime, which constitutes the great advantage over traditional paper-based and computer-based tests (Roever, 2001). Moreover, the web system also makes it possible to create unique exams, and it is based on an important mathematical content (McGough, Mortensen, Johnson, & Fadali, 2001). As Roever (2001, pp. 90-91) mentions that WBT is threefold as low-stakes assessment, medium-stakes assessment and high-stakes assessment, which can address for different needs: low-stakes tests are used to give feedback about examinees’ performances over a certain subject or skill. The examinees can take these tests wherever they want. On the other hand, medium-stakes assessment covers midterm and final exams done in classes, placement tests or any tests that have impact on the examinees’ lives. These kinds of tests are carried out by proctors in a lab. And lastly, high-stakes assessment is the one the results of which may affect greatly the examinee’s life like being accepted to a university or certification programs or citizenship tests and so on. Among these three types, WBT is much more useful when it is done for low-stakes assessment. In three phases (preparation, delivery and assessment), a question can be created on the web. Accordingly, an item is on the threshold of being created at authoring time. Teachers can prepare questions and store them in an item bank by using web tools. Then, questions or items are selected in order to conduct the test. The selection of the items is done either statistically by teachers themselves or dynamically by the system at run time (Brusikolovsky & Miller; 1999, p. 2). After delivering the items and conducting the exam, examinees’ answers are assessed as correct, incorrect or partially correct. On the web technology, preparing, delivering and assessing questions are based on HTML codes (Brusikolovsky & Miller, 1999, pp. 2-3). The last mode of CAA, computer-adaptive testing (CAT) that is based on each student’s performance during the exam has been utilized for many years. The cycle of CAT begins with a question that is neither so easy nor so difficult. According to the answer of each test-taker to the item, which question to be asked from the item pool is decided. More clearly, if a test-taker answers a question correctly, the next one will be harder or on equal difficulty. On the contrary, if a test-taker answers a question incorrectly, the next one will be easier. Hence, CAT is said to be based on performance (Chapelle, 2001; Flaugher, 2000; Guzmand & Conejo, 2005; Lilley et al., 2004), and definitely, this new individualized exam model (Wainer & Eignor, 2000, p. 1) offers more confidential testing atmosphere for both teachers and students (Guzman & Conejo, 2005; Linden & Glas, 2002). Students can see each item on screen at a time, and they cannot skip the questions. While the test- takers are busy with each question, the system calculates the scores and decides which question will be next in relation to the previous answers given by the test-takers (Brown, 2004; Hughes, 2003). This measurement model in CATs is known as Item Response Theory (IRT) or Latent Trait Theory, the mathematical bases of which were outlined by Lord and Novick around the 1970s (Stevenson and Gross, 1991, p. 224; Tung, 1986, pp. 4-5). The idea lying behind IRT goes back to the psychological measurement model, put forward by Alfred Binet and today known as the Stanford-Binet IQ test (Linden & Glas, 2002). Binet’s idea of measuring each test-taker separately and according to their performance while they are taking the test has been accepted as the only adaptive testing approach for more than fifty years (Cisar, Radosav, Markoski, Pinter, & Cisar, 2010), but there was one drawback stated about this smart system: despite its truly adaptive side, experienced and skilled teachers (examiners) might be needed in order to administer large-scale tests. Therefore, it was practical only for small-scale tests (Madsen, 1991). Today, CAT is used not only for small-scale exams but also for large-scale high-stakes exams as well. For example, Graduate Management Admission Test, Microsoft Certified Professional and Test of . H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85 71 English as a Foreign Language have been administered in the CAT mode (Lilley et al, 2004, p. 110), and SIETTE is a web-based CAT system used in Spain (Guzman & Conejo, 2005, p. 688). Many schools and universities have started to benefit from web technology while administering exams. One of them is Iowa State University that has created the WebCT. This smart system does not require any technical information so as to use it, and teachers can easily create and publish online courses and exams (Chapelle & Douglas, 2006, p. 63). Among other online tools to be utilized are Hot Potatoes, Discovery School Quiz Center, Blackboard and Questionmark (Chapelle & Douglas, 2006, pp. 72-73). 1.3. Studies on comparability of reliability and validity by test mode Over the last two decades a number of comparability studies have concentrated on the effects of the test delivery mode on student performance, i.e., whether the test scores obtained from computer- and paper-based tests are interchangeable; these are referred to as “mode effects” (Bennett, 2003; Choi, Kim, & Boo, 2003; Dunkel, 1991; Paek, 2005; TEA, 2008; Wang, Jiao, Young, Brooks, & Olson, 2007). These studies often revealed mixed results regarding the comparability issues of CBT and PBT in different content areas. Some studies show that CBTs are more challenging than PBTs (Creed, Dennis, & Newstead, 1987; Laborda, 2010) or vice versa (Chin,1990; Dillon, 1994; Yağcı, Ekiz & Gelbal, 2011), whereas some studies conclude that CBTs and PBTs are comparable (Akdemir & Oğuz, 2008; APA, 1986; Bugbee, 1996; Choi, et al., 2003; Choi & Tinkler, 2002, cited in Wang & Shin, 2009; Higgings, Russell, & Hoffmann, 2005; Jeong, 2014; Kim & Hyunh, 2007; Logan, 2015; Muter, Latremouille, Treurniet, & Beam, 1982; Paek, 2005; Parshall & Kromrey, 1993; Retnawati, 2015; Russell, Goldberg, & O’conner, 2003; Stevenson & Gross, 1991; Tsai & Shin, 2012; Wang et al., 2007; Wang & Shin, 2009; Yaman & Çağıltay, 2010). In her comprehensive review, Paek (2005, p. 17) concludes that overall CBT and PBT “versions of traditional multiple-choice tests are comparable across grades and academic content.” Higgings et al (2005) conducted a survey with 219 4th grade students in an attempt to define any probable score differences in reading comprehension between groups, resulting from the test-mode effect; their research revealed no statistically significant differences. Similarly, in the study of Akdemir and Oğuz (2008), 47 prospective teachers in the departments of Primary School Teaching and Turkish Language and Literature took an achievement test, including thirty questions, both on computer and on paper. At the end of the study, it was revealed that there was not statistical difference between the test-takers’ scores in line with the test-administration mode. Hence, the researchers mentioned that “computer- based testing could be an alternative to paper-based testing” (p. 123). Hosseini, Abidin, and Baghdarnia (2014) compared reading comprehension test with multiple-choice items administered on computer and on paper; at the end of the study, no significant difference was found. Retnawati (2015) compared the scores of the participants who took paper-based Test of English Proficiency with the ones who took computer-based version of the test as well, and the results revealed that scores in both exam modes were quite similar. Lastly, Logan (2015) aimed to search the students’ performance differences up to exam administration mode within the frame of mathematics course. In total, 807 6 th grade Singaporean students took the mathematics test with 24 items and the paper folding test either on computer or on paper. The results displayed that there was no significant difference. In contrast, Choi et al. (2003) found out that taking a listening test on computer offered an advantage for the test- takers since they got higher scores compared to a paper-based listening test. Yağcı et al. (2011) at a state university carried out a similar study on this topic. This time participants were 75 vocational school students in the department of business administration. In order to reveal the probable academic success differences among participants, the exam was done in two ways (CBT versus PBT), and at the end, participants’ scores were compared. It was found that students who had taken the computer- 72 H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85 assisted exam outperformed. Hensley (2015) carried out a study with 142 students in the department of mathematics at the University of Iowa with an aim to compare the students’ test scores taken from paper-based tests and computer-based tests. At the end, it was found that the test scores could not be compared because there was a significant difference between the two test modes. A recent study done by Hakim (2017) with 200 female students whose English language command at B1 level in Saudi Arabia displayed that tests done in two different versions, CBT versus PBT, had statistically significant differences. Although professional assessment standards attach great importance to the comparability of CBTs and PBTs, there has been little empirical research that examines the impact of technology on the two main aspects of the assessment, which include the concepts of validity and reliability (Al-Amri, 2008; Chapelle, 1998; 1999; 2001; Chapelle & Douglas, 2006). For example, in a recent study, Chua (2012) compared the reliabilities of CBTs and PBTs by using computer- and paper-based versions of the multiple-choice Yanpiaw Creative-Critical Styles test (YBRAINS) and the Testing Motivation Questionnaire (TMQ) with a five-point Likert scale. The findings revealed that the reliability values were close to each other in CBTs and PBTs. However, Chua (2012) stated that the results might have been different if achievement tests had been used in the study since the test takers’ motivation, desire to achieve high scores and context of the test might affect the scores. Dermo (2009) also carried out a study with 130 undergraduate students who took online tests. The research had six perspectives such as affective factors, validity, practical issues, reliability, security and learning and teaching. According to the results, it was concluded that taking online tests was regarded as a practical and secure domain by the participants. As for the validity and reliability of online tests, both factors seemed to be appropriate and related to the curriculum. Al-Amri (2008) administered three tests to each participant who took the same test once on computer and once on paper. In order to determine the effect of the testing mode on reliability, he examined the internal consistency (Cronbach’s alpha) of CBTs and PBTs and the results indicated that the internal reliability coefficients ranged between .57 and .70, not as high as expected. In order to check concurrent validity of the tests, on the other hand, a correlational analysis was conducted and the results indicated that each PBT significantly correlated with its computerized version. Overall, there was not any significant effect of the test administration mode on the overall reliability and validity of the tests. In another study (Boo, 1997, cited in Al-Amiri), the test administration mode did not have any impact on the reliability of tests. Utilizing an EFL test battery entitled the Test of English Proficiency developed by Seoul National University (TEPS), Choi et al. (2003) investigated the comparability between PBT and CBT based on content and construct validation. Although they did not focus on the measurement of course learning outcomes in higher education, their findings supported comparability between the CBT and PBT versions of the TEPS subtests (listening comprehension, grammar, vocabulary, and reading comprehension) in question. On the other hand, Semerci and Bektaş (2005) conducted a survey about how to improve the validity of web-based tests. In this regard, they collected data from four different state universities (Anadolu, Sakarya, Fırat Universities and METU) in Turkey, where web-based tests were being administered. The researchers sent emails to a total of 45 people at those universities as to collect data for the study, and only 33 of them wrote back. After the data were analyzed, some ways to improve the validity of web-based tests were defined: Digital identities like fingerprint and voice control should be used; teachers should encourage learners to make projects and research; mini-quizzes and video- conferencing can foster learning, so teachers should make use of them in their courses. Within a similar vein, Delen (2015) aimed to focus on how to increase the validity and reliability of computer- assisted assessment. In this sense, optimum item response time for each question was shown on the screen when the participants were busy with answering the exam items, and the findings revealed that . H. Öz, T. Özturan / Journal of Language and Linguistic Studies, 14(1) (2018) 67–85 73 if students were offered optimum item response time, more valid and reliable tests would be achieved than paper-based tests. Our review of the related literature indicates that although there have been numerous studies that compare CBTs and PBTs in terms of mean scores, there is little research that specifically deals with the criteria of adequate reliability and accuracy of measurement. Wang and Kolen (2001) developed a framework of criteria for evaluating the comparability between CAT and PBT: (1) validity, (2) psychometric/reliability, and (3) statistical assumption/test administration. We assume that these three criteria can also be used to evaluate the comparability between the linear CBTs and PBTs. 1.4. Research questions To the best of our knowledge, at a time when Turkish institutions of higher education are on the eve of considering the computerized administration of assessments, there is not even a single study that deals with the comparability of computer- and paper-based tests in English language teacher education programs. Thus, the present research grew out of a desire to learn whether the validity and reliability principles of assessment would be influenced by the test administration mode when pre- service English teachers would take an achievement test for their pedagogical content knowledge course. Thus, the following research questions were formulated to guide the present study: 1. To what extent are the results of a paper-based test (PBT) comparable to those of its CBT version? 2. If the PBT in question has satisfied the criteria of adequate reliability and accuracy of measurement, can its CBT version be considered to have equal reliability and accuracy of measurement? 2. Method The quantitative research model of the study covers the experimental study - a posttest only design. Accordingly, there is no place for pretests in the study, just the posttests are used. After the participants of the study had been randomly assigned to two groups, the control group took the achievement test in a traditional way while the experimental group took the same exam through a computer-assisted system. When the exam was over, both groups were administered a questionnaire adapted to state some background information of participants and their attitudes towards computer- assisted assessment. 2.1. Participants The participants for this study consisted of a total of 100 student teachers enrolled in Approaches to ELT course in the English language teaching (ELT) department at Hacettepe University. They had already been enrolled in three different sections of the course and taking it from the same faculty member before the study started. They were randomly assigned to the … Methodological and Statistical Advances in the Consideration of Cultural Diversity in Assessment: A Critical Review of Group Classification and Measurement Invariance Testing Kyunghee Han, Stephen M. Colarelli, and Nathan C. Weed Central Michigan University One of the most important considerations in psychological and educational assessment is the extent to which a test is free of bias and fair for groups with diverse backgrounds. Establishing measurement invariance (MI) of a test or items is a prerequisite for meaningful comparisons across groups as it ensures that test items do not function differently across groups. Demonstration of MI is particularly important in assessment settings where test scores are used in decision making. In this review, we begin with an overview of test bias and fairness, followed by a discussion of issues involving group classification, focusing on categorizations of race/ethnicity and sex/gender. We then describe procedures used to establish MI, detailing steps in the implementation of multigroup confirmatory factor analysis, and discussing recent developments in alternative procedures for establishing MI, such as the alignment method and moderated nonlinear factor analysis, which accommodate reconceptualization of group categorizations. Lastly, we discuss a variety of important statistical and conceptual issues to be considered in conducting multigroup confirmatory factor analysis and related methods and conclude with some recommendations for applications of these procedures. Public Significance Statement This article highlights some important conceptual and statistical and issues that researchers should consider in research involving MI to maximize the meaningfulness of their results. Additionally, it offers recommendations for conducting MI research with multigroup confirmatory factor analysis and related procedures. Keywords: test bias and fairness, categorizations of race/ethnicity and sex/gender, measurement invariance, multigroup CFA Supplemental materials: http://dx.doi.org/10.1037/pas0000731.supp When psychological tests are used in diverse populations, it is assumed that a given test score represents the same level of the underlying construct across groups and predicts the same outcome score. Suppose that two hypothetical examinees, a middle aged Mexican immigrant woman and a Jewish European American male college student, each produced the same score on a measure of depression. We would like to conclude that the examinees exhibit the same severity and breadth of depression symptoms and that their therapists would rate them similarly on relevant behavioral and symptom measures. If empirical evi- dence indicates otherwise, and such conclusions are not justi- fied, scores on the measure are said to be biased. Although it has been defined variously, a representative definition refers to psychometric bias as “systematic error in estimation of a value”). A biased test “is one that systematically overestimates or underestimates the value of the variable it is intended to assess” due to group membership, such as ethnicity or gender (Reynolds & Suzuki, 2013, p. 83). The “value of the variable it is intended to assess” can either be a “true score” (see S1 in the online supplemental materials) on the latent construct or a score on a specified criterion measure. The former appli- cation concerns what is sometimes termed measurement bias, in which the relationship between test scores and the latent attri- bute that these test scores measure varies for different groups (Borsboom, Romejin, & Wicherts, 2008; Millsap, 1997), whereas the latter application concerns what is referred to as predictive bias, which entails systematic inaccuracies in the prediction of a criterion from a test depending upon group membership (Clearly, 1968; Millsap, 1997). Kyunghee Han, Stephen M. Colarelli, and Nathan C. Weed, Department of Psychology, Central Michigan University. This article has not been published elsewhere, nor has it been submitted simultaneously for publication elsewhere. The author(s) de- clared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. The author(s) received no funding for this study. Correspondence concerning this article should be addressed to Kyung- hee Han, Department of Psychology, Central Michigan University, Mount Pleasant, MI 48859. E-mail: [email protected] T hi s do cu m en t is co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia ti on or on e of it s al li ed pu bl is he rs . T hi s ar ti cl e is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y. Psychological Assessment © 2019 American Psychological Association 2019, Vol. 31, No. 12, 1481–1496 1040-3590/19/$12.00 http://dx.doi.org/10.1037/pas0000731 1481 http://dx.doi.org/10.1037/pas0000731.supp mailto:[email protected] http://dx.doi.org/10.1037/pas0000731 Test bias should not be confused with test fairness. Although the two concepts have been used interchangeably at times (e.g., Hunter & Schmidt, 1976), test fairness entails a broader and more sub- jective evaluation of assessment outcomes from perspectives of social justice (Kline, 2013), whereas test bias is an empirical property of test scores, estimated statistically (Jensen, 1980). Ap- praisals of test fairness include multifaceted aspects of the assess- ment process, lack of test bias being only one facet (American Educational Research Association, American Psychological Asso- ciation [APA], & National Council on Measurement in Education, 2014; Society for Industrial Organizational Psychology, 2018; see S2 in the online supplemental materials). In the example above, the measure of depression may be unfair for the Mexican female client if an English language version of the measure was used without evaluating her English proficiency, if her score was derived using American norms only, if computerized administration was used, or if use of the test leads her to be less likely than members of other groups to be hired for a job. Although test bias is not a necessary condition for test unfairness to exist, it may be a sufficient condition (Kline, 2013). Accordingly, it is especially important to evaluate whether test scores are biased against vulnerable groups. The evaluation of test bias and test fairness each entails a comparison of one group of people with another. While asking the question, “Is a test biased?” we are also implicitly asking “against or for which group?” Similarly, if we are concerned about using a test fairly, we must ask: are the outcomes based on the results of the test apportioned fairly to groups of people who have taken the test? Thus, the categorization of people into distinct groups is a sine qua non of many aspects of psychological assessment re- search. Racial/ethnic and sex/gender categories are prominent fea- tures of the social, cultural, and political landscapes in the United States (e.g., Helms, 2006; Hyde, Bigler, Joel, Tate, & van Anders, 2019; Jensen, 1980; Newman, Hanges, & Outtz, 2007), and have therefore been the most commonly studied group variables in bias research (e.g., Warne, Yoon, & Price, 2014). Most of the initial research on and debates about test bias and fairness in the United States stemmed from political movements addressing race and sex discrimination (e.g., Sackett & Wilk, 1994). In service of pressing research on questions of discrimination and economic inequality, it thus became commonplace among psychologists and social scien- tists to categorize people crudely into groups (based primarily on race, ethnicity, and sex/gender) without much thought to the mean- ing and validity of those categorizations (e.g., Hyde et al., 2019; Yee, 1983; Yee, Fairchild, Weizmann, & Wyatt, 1993). This has changed somewhat over the past two decades as scholarship by psychologists and others has increasingly focused on nuances of identity, multiculturalism, intersectionality, and multiple position- alities (Cole, 2009; Song, 2017). This scholarship has emphasized that racial, ethnic, and gender classifications can be complex, ambiguous, and debatable—and that identities are often self- constructed and can be fluid (Helms, 2006; Hyde et al., 2019). The first goal of this review, therefore, is to overview contemporary issues involving race/ethnicity and sex/gender classifications in bias research and to describe alternative approaches to the mea- surement of these variables. The psychometric methods used to examine test bias usually depend on the definition of test bias operating for a given appli- cation. Evaluating predictive bias (i.e., establishing predictive in- variance) often involves regressing total scores from a criterion measure onto total scores on the measure of interest, and compar- ing regression slopes and intercepts across groups (Clearly, 1968). Evaluating measurement bias (i.e., establishing measurement in- variance [MI]) often necessitates more advanced quantitative meth- ods, such as confirmatory factor analysis (CFA) or methods deriv- ing from item response theory, to compare the properties of item scores and scores on latent variables across different groups. Multigroup confirmatory factor analysis (MGCFA) has been one of the most commonly used techniques to examine MI (Davidov, Meuleman, Cieciuch, Schmidt, & Billiet, 2014) because it pro- vides a comprehensive framework for evaluating different forms of MI. The second goal of this review is to provide a broad overview of MGCFA and related procedures and their relevance to psychological assessment. Although MGCFA is a well-established procedure in the eval- uation of MI, it has limitations. MGCFA is not an optimal method for conducting MI tests when many groups are involved. More- over, the grouping variable in MGCFA must be categorical, and therefore does not permit MI testing with continuous grouping variables (e.g., age). As modern research questions may require MI testing across many groups, and with continuous reconceptualiza- tions of some of the grouping variables (e.g., gender), more flex- ible techniques are needed. Our third goal, therefore, is to describe two recent alternative methods for MI testing, the alignment method and moderated nonlinear factor analysis, that aim to over- come these limitations. We conclude the review with a discussion of some important statistical and conceptual issues to be consid- ered when evaluating MI, and include a list of recommended practices. Group Classifications Used in Bias Research Racial and Ethnic Classifications Race and ethnicity (see S3 in the online supplemental materials) are conceptually vague and empirically complex social constructs that have been examined by numerous researchers across many disciplines (Betancourt & López, 1993; Helms, Jernigan, & Mascher, 2005; Yee et al., 1993). Consider race. As a biological concept, it is essentially meaningless. In most cases, there is more genetic variation within so-called racial groups than between racial groups (Witherspoon et al., 2007). Even if we allow race to be defined by a combination of specific morphological features and ancestry, few “racial” populations are pure (Gibbons, 2017). Most are mixed—like real numbers, with infinite gradations. For exam- ple, although many African Americans trace their ancestry to West Africa, about 20\% to 30\% of their genetic heritage is from Euro- pean and American Indian ancestors (Parra et al., 1998), and racial admixture continues as the frequency of interracial marriages increases (Rosenfeld, 2006; U.S. Census Bureau, 2008). Even if one were to accept race as a combination of biological features and cultural and social identities (shared cultural heritage, hardships, and discrimination), there is the problem of degree. For example, while many Black Americans share social and cultural identities based on roots in American slavery and racial discrimination, not all do, such as recent Black immigrants from the Caribbean. Racial and ethnic classifications are often conflated. In psychological research, “Asian” is commonly used both as a cultural (Nisbett, T hi s do cu m en t is co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia ti on or on e of it s al li ed pu bl is he rs . T hi s ar ti cl e is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y. 1482 HAN, COLARELLI, AND WEED http://dx.doi.org/10.1037/pas0000731.supp http://dx.doi.org/10.1037/pas0000731.supp Peng, Choi, & Norenzayan, 2001) and racial category (Rushton, 1994). Yet it is a catch-all term based primarily on geography. It typically refers to people from (or whose ancestors are from) South, Southeast, and Eastern Asia. The term Hispanic often conflates linguistic, cultural, and sometimes even morphological features (Humes, Jones, & Ramirez, 2010). In public policy, mixtures of racial (or ethnic) background has only recently begun to be addressed. The U.S. Census, for exam- ple, did not include a multiracial category until 2000 (Nobles, 2000). We are only beginning to see assessment studies that parse people from traditional broad groupings into smaller, more mean- ingful and homogeneous groups. In one of the few studies that identified different types of Asians, Appel, Huang, Ai, and Lin (2011) found significant (and sometimes major) differences in physical, behavioral, and mental health problems among Chinese, Vietnamese, and Filipina women in the U.S. More recently, Tal- helm et al. (2014) found important differences in culture and thought patterns within only one Asian country, China. People in northern China were significantly more individualistic than those in southern China, who were more collectivistic. With current and historical farming practices as their theoretical centerpiece, they examined farming practices as causal factors. In northern China wheat has been farmed as a staple crop for millennia, whereas in southern China rice has been (and is) the staple crop. Talhelm et al. argued that the farming practices required by these two crops required different types of social organization that, over time, influenced cultural values and cognition. The work by Talhelm and colleagues is important because it is one of the first studies to show—along with a powerful theoretical rationale—that there are important cultural differences between people from what has typ- ically been thought of as a relatively homogeneous racial and cultural group. In another seminal article, Gelfand and colleagues (2011) ex- amined the looseness-tightness dimension of cultures in 33 coun- tries. This dimension reflects the strength of norms and the toler- ance of deviant behavior. Loose cultures have weaker norms and are more tolerant of deviant behavior. While there was substantial variation between countries, there was still considerable variation among countries typically considered “Asian.” Hong Kong was the loosest (6.3), while Malaysia was the tightest (11.8), with the People’s Republic of China (7.9), Japan (8.6), South Korea (10), and Singapore (10.4) in between. To say that all Asian countries are culturally similar is untenable when, for example, Malaysian culture is 88\% tighter than Hong Kong culture. Gender Classifications Binary categories. Gender (see S4 in the online supplemental materials) differences or similarities on psychological constructs have been a widely researched topic (e.g., Feingold, 1994; Hyde, 2005) since the 1970s (Eagly & Riger, 2014), with many studies assuming the existence of clear qualitative and quantitative differ- ences between genders (Brizendine, 2006; Ruigrok et al., 2014). These include numerous studies examining bias or MI across gender (e.g., Baker & Mason, 2010; Linn & Kessel, 2010) in which researchers have tended to employ a binary categorization of gender. However, the binary gender categorization and the presumption of qualitative gender difference have recently been challenged (APA, 2015, 2017; Richards et al., 2016; Richards, Bouman, & Barker, 2017; Schellenberg & Kaiser, 2018). In a recent article, Hyde and colleagues (2019) reviewed em- pirical findings from five disciplines and challenged the legitimacy of binary gender classification in each: (a) neuroscience (sexual dimorphism of the human brain), (b) behavioral neuroendrocrinol- ogy (the argument for genetically fixed, nonoverlapping, sexually dimorphic hormonal systems), (c) research on psychological vari- ables (the inference of clear gender differences on psychological constructs), (d) research with transgender and nonbinary individ- uals (the assumption that gender identities and experiences are consistent with gender assigned at birth), and (e) research from developmental psychology (arguing that binary gender categories are culturally universal and unmalleable). Nonbinary identities. There is a wide variety of nonbinary gen- der identities (APA, 2015; Hyde et al., 2019; Richards et al., 2016, 2017). Intersex individuals have physical characteristics outside the typical binary male-female categories, although most still identify their gender within the binary system (Richards & Barker, 2013). More common are people who are not physiologically intersex but who have nonbinary gender identities. Although terms are often subsumed within the umbrella terms of nonbinary or genderqueer identities, various more specific labels have been used: androgynous, mixed gender, or pangender to indicate incor- porating aspects of both male and female, but having a fixed identity; bigender or gender fluid to indicate movement between gender in a fluid way; trigender to indicate moving between more than two genders; third gender or other gender to identify a specific additional gender; gender queer to challenge the binary gender system; agender, gender neutral, genderless, nongendered, or neuter to indicate no gender (Richards et al., 2016). The frequency of people with nonbinary gender identities, while small compared to the population at large, is not trivial. For example, one Dutch study found that 4.6\% of people assigned male identities at birth and 3.2\% assigned female identities at birth have ambivalent gender identities (Kuyper & Wijsen, 2014). Others—people who regard themselves as asexual— have what might be called no gender identity (Carrugan, 2015). The trend toward acceptance of diversity in gender classification has been reflected in various professional and societal contexts. Within the context of mental health care, the Diagnostic and Statistical Manual, 5th edition (American Psychiatric Association, 2013) removed the Diagnostic and Statistical Manual–IV–TR (American Psychiatric Association, 2000) diagnosis of gender identity disorder and recognized nonbinary genders within the diagnostic taxonomy. People with nonbinary gender identities have become more politically active, with visible results in recent years both in official documentation and in media coverage (Scelfo, 2015). For example, New Zealand passport holders can claim one of three gender identities: male, female, or X (other). New York, New York, has recently joined Oregon, California, Wash- ington, and New Jersey in offering a nonbinary gender marker (X) on birth certificates for residents who do not identify as male or female (Hafner, 2019). In an effort to foster an environment of inclusiveness and supporting students’ preferred form of self-identification, many universities in the United States allow students to choose preferred gender pronouns (https://www.npr.org/2015/11/08/455202525/more- universities-move-to-include-gender-neutral-pronouns). This move- ment has also been reflected in product development and marketing; T hi s do cu m en t is co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia ti on or on e of it s al li ed pu bl is he rs . T hi s ar ti cl e is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y. 1483METHODOLOGICAL AND STATISTICAL ADVANCES http://dx.doi.org/10.1037/pas0000731.supp https://www.npr.org/2015/11/08/455202525/more-universities-move-to-include-gender-neutral-pronouns https://www.npr.org/2015/11/08/455202525/more-universities-move-to-include-gender-neutral-pronouns numerous vendors specialize in gender neutral clothing or other items (e.g., https://www.lgbtqnation.com/2017/07/target-hits-back-binary- gender-neutral-clothing-line/; http://www.foxnews.com/us/2015/08/ 13/target-going-gender-neutral-in-some-sections.html). Emerging Recommendations Regarding Race/Ethnicity and Sex/Gender Categorizations Results of studies of test bias rest to a considerable degree on how grouping variables such as race/ethnicity and sex/gender are operationalized. Cultural researchers have recently proposed a number of suggestions for reconceptualizing and operationalizing these grouping variables. Hyde et al. (2019) and other researchers (e.g., Bittner & Goodyear-Grant, 2017; Schellenberg & Kaiser, 2018; Tate, Ledbetter, & Youssef, 2013) warn about the costs of binary sex/gender categorization in research and provide helpful suggestions for reconceptualizing and measuring sex/gender: (a) providing multiple categories of gender identity within available response options (“female,” “male,” “transgender female,” “trans- gender male,” “genderqueer” [click for more options: “agender,” “bigender,” etc.], “intersex,” and other [specify]); (b) asking about both birth-assigned and self-assigned gender/sex identities; (c) using an open-ended response format (e.g., “what is your gen- der?”); (d) treating gender/sex constructs as multidimensional (use of multiple measures of gender/sex identities, stereotypes, or be- haviors), dynamic, and continuous (e.g., “In the past 12 months, have you thought of yourself as a man?” “In the past 12 months, have you thought of yourself as a woman?” “How would you rate yourself on the continuum from 0 to 100 regarding “maleness?” “How would you rate yourself on the continuum from 0 to 100 regarding “femaleness?”); and (e) asking about gender identity at the end of a study so that it does not influence responding. Numerous researchers (e.g., Cole, 2009; Helms & colleagues, 2005; Yee, 1983) have proposed ways to reconceptualize race/ ethnicity in research. When researchers separate people into groups and compare them, careful attention must be given to (a) what the categories mean, (b) attempt to ensure their homogeneity, (c) how the categories are theoretically related to the substantive constructs under study, (d) treating race/ethnicity constructs as multidimensional (use of multiple measures of race/ethnicity iden- tities, stereotypes, socioeconomic status [SES], or racism), and (d) intersectional views of the categories (see below). Racial catego- rization should be avoided in research without clear conceptual reasons. It is clear that people fall into multiple grouping categories— each person is simultaneously a member of a gender, race, ethnic- ity, class, and sexual orientation (APA, 2017). The concept of “intersectionality” has been developed by feminist and critical race theorists (Cole, 2009; Eagly & Riger, 2014) to encourage research- ers to understand individuals from “a vast array of cultural, struc- tural, sociobiological, economic, and social contexts by which individuals are shaped and with which they identify” (APA, 2017, p. 19). How might we conceptualize and examine test bias across combinations of categories? Studies of psychometric bias are typically conducted from the perspective of one group membership at a time mainly due to methodological complications in imple- menting multiple group categories into testable models. Research teams such as Corral and Landrine (2010) and Else-Quest and Hyde (2016) have provided recommendations for incorporating intersectional approaches to numerous facets of research (i.e., theory, design, sampling techniques, measurement, data analytic strategies, and interpretation and framing). More work building on these recommendations will be needed to meet the research chal- lenges of intersectionality in the context of bias testing. Testing Bias As mentioned earlier, the statistical methods used to examine test bias usually depend on the definition of test bias operating for a given application (see S5 in the online supplemental materials). If test scores can be used to predict some future outcome (a criterion), they are said to demonstrate predictive validity (Clearly, 1968). In general, a test is considered not biased (demonstrating “predictive invariance”) if its scores predict future outcomes equally well for individuals from different groups. However, if test scores are better at predicting outcomes for some groups than for others, the test scores are said to reflect differential predictive validity or slope bias (Camilli, 2006). If test scores systematically overpredict or underpredict criterion scores for one group relative to another, the scores are said to reflect intercept bias (see S6 in the online supplemental materials). Relative to predictive invariance, MI has been investigated more frequently and more rigorously in recent years due to advances in statistical procedures, including the development of software that allows researchers to carry out MI testing more easily. Researchers have examined how predictive invariance is related to MI and argued that evidence for one form of invariance is not evidence in support of the other, but may in some cases serve as evidence against the other (Borsboom et al., 2008; see Millsap, 1995, 1997, 2007, for a mathematical proof of this argument). Investigating both forms of invariance simultaneously in the same study is ideal, but seldom achieved (Millsap, 2007), with some exceptions (Cul- hane, Morera, Watson, & Millsap, 2009; Wright, Kutschenko, Bush, Hannum, & Braddy, 2015). Researchers tend to prefer to evaluate one form over the other depending upon the context of assessment. For example, in assessment research related to per- sonnel section, evaluating predictive invariance is favored over evaluating MI (Borsboom, 2006; Society for Industrial Organiza- tional Psychology, 2018). However, even in this context, item- level MI analyses are recommended when conducting cross- cultural research involving linguistically different populations. The ease of evaluating predictive bias varies greatly depend- ing upon the psychometric integrity of the criterion variable in question. It is possible for differential predictive validity to reflect statistical artifact due to measurement error in scores on the criterion and predictor (Kane & Mroch, 2010; Warne et al., 2014). Moreover, other confounding variables (e.g., time gap between obtaining test and criterion data) need to be controlled across groups when examining predictive invariance. These complications (see Borsboom et al., 2008, for a comprehensive discussion) are especially problematic in cross-cultural research involving translated measures. Therefore, Borsboom and col- league (2008) argued that psychologists should favor MI testing over predictive invariance testing in defining and measuring bias. For the rest of this article, we focus on procedures asso- ciated with MI. T hi s do cu m en t is co py ri gh te d by th e A m er ic an P sy ch ol og ic al A ss oc ia ti on or on e of it s al li ed pu bl is he rs . T hi s ar ti cl e is in te nd ed so le ly fo r th e pe rs on al us e of th e in di vi du al us er an d is no t to be di ss em in at ed br oa dl y. 1484 HAN, COLARELLI, AND WEED https://www.lgbtqnation.com/2017/07/target-hits-back-binary-gender-neutral-clothing-line/ https://www.lgbtqnation.com/2017/07/target-hits-back-binary-gender-neutral-clothing-line/ http://www.foxnews.com/us/2015/08/13/target-going-gender-neutral-in-some-sections.html http://www.foxnews.com/us/2015/08/13/target-going-gender-neutral-in-some-sections.html http://dx.doi.org/10.1037/pas0000731.supp http://dx.doi.org/10.1037/pas0000731.supp Testing Measurement Invariance: Multigroup Confirmatory Factor Analysis The popularity of research involving MI has increased exponen- tially in recent years. An informal PsycINFO database search of titles, keywords, and abstracts associated with peer reviewed jour- nal articles from 1980 to January 31, 2018, on group and mea- surement and invariance or equivalence located 55 unique articles published from 1980 to 1989, 91 articles published from 1990 to 1999, 459 articles published from 2000 to 2009, and 1,504 articles published from 2010 to 2017. The increased interest in this topic may reflect globalization of the social sciences, increased empha- sis on fair assessment for diverse populations, and statistical ad- vances permitting rigorous testing of MI (Davidov et al., 2014; Sass, 2011; Vandenberg, 2002). Since its development by Jöreskog (1971) and Sörbom (1974) for use with continuous indicators, MI testing via MGCFA … https://doi.org/10.1177/0091026020935582 Public Personnel Management 2021, Vol. 50(2) 232 –257 © The Author(s) 2020 Article reuse guidelines: sagepub.com/journals-permissions DOI: 10.1177/0091026020935582 journals.sagepub.com/home/ppm Article A Critical Examination of Content Validity Evidence and Personality Testing for Employee Selection David M. Fisher1 , Christopher R. Milane2, Sarah Sullivan3, and Robert P. Tett1 Abstract Prominent standards/guidelines concerning test validation provide contradictory information about whether content-based evidence should be used as a means of validating personality test inferences for employee selection. This unresolved discrepancy is problematic considering the prevalence of personality testing, the importance of gathering sound validity evidence, and the deference given to these standards/guidelines in contemporary employee selection practice. As a consequence, test users and practitioners are likely to be reticent or uncertain about gathering content-based evidence for personality measures, which, in turn, may cause such evidence to be underutilized when personality testing is of interest. The current investigation critically examines whether (and how) content validity evidence should be used for measures of personality in relation to employee selection. The ensuing discussion, which is especially relevant in highly litigious contexts such as personnel selection in the public sector, sheds new light on test validation practices. Keywords test validation, content validity, personality testing, employee selection An essential consideration when using any test or measurement tool for employee selection is gathering and evaluating relevant validity evidence. In the contemporary employee selection context, validity evidence is generally understood to mean 1The University of Tulsa, OK, USA 2Qualtrics, Provo, UT, USA 3Rice University, Houston, TX, USA Corresponding Author: David M. Fisher, Assistant Professor of Psychology, The University of Tulsa, 800 S. Tucker Drive, Tulsa, OK 74104, USA. Email: [email protected] 935582PPMXXX10.1177/0091026020935582Public Personnel ManagementFisher et al. research-article2020 https://us.sagepub.com/en-us/journals-permissions https://journals.sagepub.com/home/ppm mailto:[email protected] Fisher et al. 233 evidence that substantiates inferences made from test scores. Various sources provide standards and guidelines for gathering validity evidence, including the Uniform Guidelines on Employee Selection Procedures (Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice, 1978; hereafter, Uniform Guidelines, 1978), the Principles for the Validation and Use of Personnel Selection Procedures (Society for Industrial and Organizational Psychology [SIOP], 2003; hereafter, SIOP Principles, 2003), and Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999/2014; hereafter, Joint Standards, 1999/2014), as well as the academic literature (e.g., Aguinis et al., 2001). Having such a variety of sources available is beneficial, but challenges arise when the various sources provide ambiguous or con- tradictory information. Such ambiguity can be particularly troublesome in highly liti- gious contexts, such as the public sector, where adherence to regulations governing selection is of paramount importance. The current investigation attempts to shed light on one such area of ambiguity— whether evidence based on test content should be used as a means of validating per- sonality test inferences for employee selection. Rothstein and Goffin (2006) noted, “It has been estimated that personality testing is a $400 million industry in the United States and it is growing at an average of 10\% a year” (Hsu, 2004, p. 156). Given this reality, it is important to carefully consider appropriate validation procedures for such measures. However, the various sources mentioned above present conflicting direc- tions on this issue, specifically in relation to content-based validity evidence. On one hand, evidence based on test content is one of five potential sources of validity evi- dence described by the Joint Standards (1999/2014), which is similarly endorsed by the SIOP Principles (2003). This form of evidence has further been suggested by some to be particularly relevant to personality tests (e.g., Murphy et al., 2009; O’Neill et al., 2009), and especially under challenging validation conditions, such as small sample sizes, test security concerns, or lack of a reliable criterion measure (Landy, 1986; Tan, 2009; Thornton, 2009). On the other hand, the Uniform Guidelines (1978) assert that “. . . a content strategy is not appropriate for demonstrating the validity of selection procedures which purport to measure traits or constructs, such as intelligence, apti- tude, personality, commonsense, judgment, leadership, and spatial ability [emphasis added]” (Section 14.C.1). Other sources similarly convey reticence toward content validity for measures of traits or constructs (e.g., Goldstein et al., 1993; Lawshe, 1985; Wollack, 1976). Thus, there appears to be conflicting guidance on the use of content validity evidence to support personality measures. In light of this discrepancy, the current investigation offers a critical examination of content validity evidence and personality testing for employee selection. Such an investigation is valuable for several reasons. First, an important consequence of the inconsistency noted above is that content-based evidence may be overlooked as a valuable approach to validation when personality testing is of interest. Evidence for this can be seen in the fact that other approaches such as criterion-related validation are sometimes viewed as the only option for personality measures (Biddle, 2011). 234 Public Personnel Management 50(2) Similarly, prominent writings on personality testing in the workplace (e.g., Morgeson et al., 2007b; O’Neill et al., 2013; Ones et al., 2007; Rothstein & Goffin, 2006; Tett & Christiansen, 2007) have tended to ignore the applicability of content validation to personality measures. Furthermore, considering the deference given to the various standards and guidelines in contemporary employee selection practice (Schmit & Ryan, 2013), those concerned about strict adherence to such standards/guidelines are likely to be reticent or uncertain about gathering content-based evidence for personal- ity measures—in no small part due to conflicting or ambiguous recommendations. The above circumstances tend to relegate content-based evidence to be seen as less desir- able or otherwise viewed as an afterthought. In turn, this represents a missed opportu- nity for valuable insight into the use of personality measures. Second, the neglect or underutilization of content-based evidence is, in many ways, antithetical to the broader goal of developing a theory-based and scientifically grounded understanding of tests and measures used for employee selection (Binning & Barrett, 1989). For example, as elaborated below, there are various situations in which content-based evidence may be more optimal than criterion-based evidence, not the least of which includes an insufficient sample size for a criterion-based investiga- tion (McDaniel et al., 2011). Similarly, an exclusive focus on empirical prediction ignores the importance of underlying theory, which is critical for advancing employee selection research. Of relevance, the examination of content validity evidence forces one to carefully consider the correspondence between selection measures and underly- ing construct domains, as informed by theoretical considerations. Evidence for the value of content validity can also be found in trait activation theory (Tett & Burnett, 2003; Tett et al., 2013), which highlights the importance of a clear conceptual linkage between the content of personality traits/constructs and the job domain in question. Thus, content validity evidence should be of primary importance for personality test validation. Third, it is useful to acknowledge that the prohibition against content validity evi- dence in relation to personality measures noted in the Uniform Guidelines (1978) appears to be at odds with contemporary thinking on validation (Joint Standards, 1999/2014). The focal passage quoted above from the Uniform Guidelines has been described as being “. . . as destructive to the interface of psychological theory and practice as any that might have been conceived” (Landy, 1986, p. 1189). Although there have been well-argued critiques of the Uniform Guidelines (e.g., McDaniel et al., 2011), in addition to thoughtful elaboration of issues surrounding content valid- ity (e.g., Binning & LeBreton, 2009), a direct attempt at resolving the noted contradic- tion remains conspicuously absent from the literature. This contradiction, in conjunction with the absence of a satisfactory explanation, is problematic given the importance of gathering sound validity evidence pertaining to psychological test use. As such, a critical examination of this issue is warranted. Finally, the findings of the current investigation are likely to have broad applicabil- ity. Namely, although focused on personality testing, the discussion below is relevant to measures of other commonly assessed attributes classified under the Uniform Guidelines (1978) as “traits or constructs” (Section 14.C.1). Similarly, while we Fisher et al. 235 address the Uniform Guidelines—which some argue are outdated (e.g., Jeanneret & Zedeck, 2010) and further limited by their applicability to employee selection in the United States—we believe the value of this discussion extends far beyond these guide- lines. It is important to carefully consider appropriate validation strategies in all cir- cumstances where psychological tests are used. Hence, the discussion presented herein is likely to be of relevance for content-based validation efforts in other areas beyond employee selection in the United States (e.g., educational testing, clinical practice, international employee selection efforts). Following a brief overview of validity and content-based validation, our investiga- tion is organized around three fundamental questions. Question 1 asks whether current standards and guidelines support the use of content validity evidence for validation of personality test inferences in an employee selection context. Based on the concerns raised above, a preliminary answer to this question is that it is unclear. Question 2 then asks about the underlying bases of the inconsistency. Building on the identified causes of disagreement, Question 3 asks how one might actually gather evidence based on test content for personality measures. Ultimately, our goal in this effort is to reduce ambigu- ity and promote clarity regarding content-based validation of personality measures. Overview of Validity and Evidence Based on Test Content Broadly speaking, validity in measurement refers to how well an assessment device measures what it is supposed to (Schmitt, 2006). The focus of measurement is typi- cally described as a construct (Joint Standards, 1999/2014), which represents a latent attribute on which individuals can vary (e.g., cognitive ability, diligence, interpersonal skill, knowledge, the capacity to complete a given task). Importantly, a person’s level or relative standing with regard to the construct of interest is inferred from the test scores (SIOP Principles, 2003). As such, the notion of validity addresses the simple yet fundamental issue of whether test scores actually reflect the attribute or construct that the test is intended to measure. However, this succinct characterization of validity also belies the true complexity of this topic (Furr & Bacharach, 2014). Two particular com- plexities bear discussion in light of our current aims. First, contemporary thinking holds that validity is not the property of a test per se, but rather of the inferences made from test scores (Binning & Barrett, 1989; Furr & Bacharach, 2014; Joint Standards, 1999/2014; Landy, 1986; SIOP Principles, 2003). The value of this approach can be seen when the same test is used for two different purposes—for example, when an interpersonal skills test developed for the selection of sales personnel is used for hiring both sales representatives and accountants. Notably, the test itself does not change, but the inferences made from the test scores regarding the job performance potential of the applicants may be more or less valid given the focal job in question. In accord with this perspective, the Joint Standards (1999/2014) describe validity as “the degree to which evidence and theory support the interpretations of test scores for proposed uses of the test” (p. 11). Inherent in this view is the idea that validity is difficult to fully assess without a clear explication of the 236 Public Personnel Management 50(2) intended interpretation of scores and corresponding purpose of testing. Thus, substan- tiating relevant inferences in terms of the intended purpose of the test is of primary concern in the contemporary view of validity. Second, validity has come to be understood as a unitary concept, as compared with the dated notion of distinct types of validity (Binning & Barrett, 1989; Furr & Bacharach, 2014; Joint Standards, 1999/2014; Landy, 1986; SIOP Principles, 2003). The older trinitarian view (Guion, 1980) posits three different types of validity, includ- ing criterion-related, content, and construct validity, each relevant for different test applications (Lawshe, 1985). By contrast, the more recent unitarian perspective (Landy, 1986) emphasizes that all measurement attempts are ultimately about assess- ing a target construct, and validation entails the collection of evidence to support the argument that test scores actually reflect the construct (and that the construct is rele- vant to the intended use of the test). Consistent with this latter perspective, the Joint Standards (1999/2014) espouse a unitary view of validity and identify five sources of validity evidence, including evidence based on test content, response processes, inter- nal structure, relations to other variables, and consequences of testing. In summary, the contemporary view of validity suggests that measurement efforts ultimately implicate constructs, and different sources of evidence can be marshaled to substantiate the validity of inferences based on test scores. Drawing on the above discussion, evidence based on test content represents one of several potential sources of evidence for validity judgments. The collection of content- based evidence has become well-established as an important and viable validation strategy, as can be seen in the common discussion and endorsement of content validity in the academic literature (e.g., Aguinis et al., 2001; Binning & Barrett, 1989; Furr & Bacharach, 2014; Haynes et al., 1995; Landy, 1986) as well as in legal, professional, and technical standards or guidelines (e.g., Joint Standards, 1999/2014; SIOP Principles, 2003; Uniform Guidelines, 1978). The specific manner in which evidence based on test content can substantiate the validity of test score inferences is via an informed and judicious examination of the match between the content of an assess- ment tool (e.g., test instructions, item wording, response format) and the target con- struct in light of the assessment purpose (Haynes et al., 1995). For the sake of simplicity and ease of exposition, throughout this article, we use various terms interchangeably to represent the concept of evidence based on test content, such as content validity evidence, content validation strategy, content-based strategy, or simply content valid- ity. However, each reference to this concept is intended to reflect contemporary think- ing regarding validity as described above—specifically, content validity evidence is not a separate “type” of validity but rather, a category of evidence that can be used to substantiate the validity of inferences regarding test scores. Do Current Standards Support Content Validity for Personality? Having introduced the concepts of validity and evidence based on test content, we now turn to our primary purpose of discussing whether a content validation strategy should Fisher et al. 237 be used as a means of validating personality test inferences for employee selection purposes. In doing so, a preliminary question becomes whether current standards and guidelines support this practice. The following four sources/areas are considered: (a) the Uniform Guidelines (1978), (b) the SIOP Principles (2003), (c) the Joint Standards (1999/2014), and (c) a general review of relevant academic literature. A summary of information derived from these sources is shown in Table 1. The Uniform Guidelines (1978) The Uniform Guidelines (1978) are federally endorsed standards pertaining to employee selection procedures, which were jointly developed by the Equal Employment Opportunity Commission, the Civil Service Commission, the Department of Labor, and the Department of Justice in the United States. Regarding content vali- dation, the guidelines state that, Evidence of the validity of a test or other selection procedure by a content validity study should consist of data showing that the content of the selection procedure is representative of important aspects of performance on the job for which the candidates are to be evaluated. (Section 5.B) The guidelines go on to describe specific technical standards and requirements for content validity studies. For example, a content validity study should include a review of information about the job under consideration (Section 14.A; Section 14.C.2). Furthermore, when the selection procedure focuses on work tasks or behaviors, it must be shown that the selection procedure includes a representative sample of on-the-job behaviors or work products (Section 14.C.1; Section 14.C.4). Conversely, under cer- tain circumstances, the guidelines also permit content validation where the selection procedure focuses on worker requirements or attributes, including knowledge, skills, or abilities (KSAs). In such cases, beyond showing that the selection procedure reflects a representative sample of the implicated KSA, it must additionally be documented that the KSA is needed to perform important work tasks (Section 14.C.1; Section 14.C.4), and the KSA must be operationally defined in terms of observable work behaviors (Section 14.C.4). The above notwithstanding, the Uniform Guidelines (1978) explicitly prohibit con- tent validity for tests focusing on traits or constructs, including personality (Section 14.C.1). The logic underlying this restriction appears to be based on the seemingly reasonable notion that content-based validation becomes increasingly difficult as the focus of the selection test is farther removed from actual work behaviors (Section 14.C.4; Landy, 1986; Lawshe, 1985). This logic was confirmed in a subsequent “Questions and Answers” document, where it is stated that, The Guidelines emphasize the importance of a close approximation between the content of the selection procedure and the observable behaviors or products of the job, so as to minimize the inferential leap between performance on the selection procedure and job performance [emphasis added]. (See http://www.uniformguidelines.com/questionandanswers.html) http://www.uniformguidelines.com/questionandanswers.html 238 Public Personnel Management 50(2) Table 1. Review of Various Sources Regarding Content Validity and Personality Testing. Source Description of content validity Position on personality measures Uniform Guidelines (1978) “Evidence of the validity of a test or other selection procedure by a content validity study should consist of data showing that the content of the selection procedure is representative of important aspects of performance on the job for which the candidates are to be evaluated” (Section 5.B) Explicit prohibition related to the use of content validity for tests that focus on traits or constructs, such as personality: “. . . a content strategy is not appropriate for demonstrating the validity of selection procedures which purport to measure traits or constructs, such as intelligence, aptitude, personality, commonsense, judgment, leadership, and spatial ability” (Section 14.C.1) SIOP Principles (2003) “Evidence for validity based on content typically consists of a demonstration of a strong linkage between the content of the selection procedure and important work behaviors, activities, worker requirements, or outcomes on the job” (p. 21) Approval of content validity approach for personality measures can be inferred from [1] the absence of an explicit prohibition against the use of content validity evidence for tests that focus on traits or constructs and [2] the stated scope of applicability for content-based evidence, which includes tests that focus on knowledge, skills, abilities, and other personal characteristics Joint Standards (1999/2014) “Important validity evidence can be obtained from an analysis of the relationship between the content of a test and the construct it is intended to measure” (p. 14) Approval of content validity approach for personality measures can be inferred from [1] the absence of an explicit prohibition against the use of content validity evidence for tests that focus on traits or constructs, [2] the explicit description of content validity as pertaining to “the relationship between the content of a test and the construct it is intended to measure” (p. 14), and [3] the broad definition of the term construct (see p. 217), which makes it clear that personality variables would fall under the definition of a construct General review of academic literature Most, if not all, descriptions of content validity found in the literature embody the core notion of documenting the linkage between the content of a test and a particular domain that represents the target of measurement and/or purpose of testing (Haynes et al., 1995) The sources that specifically discuss this issue collectively indicate mixed opinions; while some authors have expressed reticence toward the use of content-based evidence for measures of personality (e.g., Goldstein et al., 1993; Lawshe, 1985; Wollack, 1976); others consider this restriction to be problematic (e.g., Landy, 1986; McDaniel et al., 2011) or view content validity as particularly relevant to personality testing (e.g., Murphy et al., 2009; O’Neill et al., 2009) Note. SIOP = Society for Industrial and Organizational Psychology. Fisher et al. 239 Interestingly, in an apparent application of this logic, the guidelines permit content validation for selection procedures focusing on KSAs (as noted in the preceding para- graph). In such cases, the inferential leap necessary to link KSAs to job performance is ostensibly greater than if the selection procedures were to focus directly on work behaviors, which explains why the guidelines include additional requirements related to the content validation of tests focusing on these worker attributes (see Sections 14.C.1 and 14.C.4). Presumably, these additional requirements serve to bridge the larger inferential leap made when the test does not directly focus on work behaviors. Thus, the Uniform Guidelines do not limit the use of content validity to actual samples of work behavior, but additional evidence is needed to help bridge the larger inferen- tial leap made when selection tests target worker attributes (i.e., KSAs)—yet this same reasoning is not extended to what the guidelines characterize as traits or constructs. The SIOP Principles (2003) The SIOP Principles (2003) embody the formal pronouncements of the Society for Industrial and Organizational Psychology pertaining to appropriate validation and use of employee selection procedures. For content validation, the principles state that, “Evidence for validity based on content typically consists of a demonstration of a strong linkage between the content of the selection procedure and important work behaviors, activities, worker requirements, or outcomes on the job” (p. 21). Like the Uniform Guidelines (1978), the SIOP Principles stress the importance of capturing a representative sample of the target of measurement and further establishing a close correspondence between the selection procedure and the work domain. The principles also acknowledge that content validity evidence can be either “logical or empirical” (p. 6), highlighting the role of job analysis and expert judgment in generating content- based evidence. However, unlike the Uniform Guidelines, the SIOP Principles do not make a substantive distinction between work tasks/behaviors and worker require- ments/attributes in relation to content-based evidence but rather, collectively, consider selection procedures that focus on “work behaviors, activities, and/or worker KSAOs” (p. 21). Importantly, the addition of “O” to the KSA acronym represents “other per- sonal characteristics,” which are generally understood to include “interests, prefer- ences, temperament, and personality characteristics [emphasis added]” (Brannick et al., 2007, p. 62). Accordingly, although not explicitly stated, the use of content validity evidence as a mean of validating personality test inferences for employee selection purposes appears to be consistent with the SIOP Principles. The Joint Standards (1999/2014) The Joint Standards (1999/2014) are a set of guidelines for test development and valida- tion in the areas of psychological and educational testing, which were developed by a joint committee including representatives from the American Educational Research Association, the American Psychological Association, and the National Council of Measurement in Education. According to the standards, content validity is examined by 240 Public Personnel Management 50(2) specifying the content domain to be measured and then conducting “logical or empirical analyses of the adequacy with which the test content represents the content domain and of the relevance of the content domain to the proposed interpretation of test scores” (p. 14). In other words, content validity is described as pertaining to “the relationship between the content of a test and the construct it is intended to measure” (p. 14), where “construct” is defined as “The concept or characteristic that a test is designed to measure” (p. 217). Because personality traits are easily understood as constructs, the Joint Standards suggest that personality test inferences may be subject to content-based validation. Academic Literature It is also informative to examine the academic literature regarding validation and per- sonality testing. In doing so, several general observations can be made. First, most if not all definitions of content validity share the core notion of documenting the linkage between the content of a test and a particular domain that represents the target of mea- surement and/or purpose of testing (e.g., Aguinis et al., 2001; Goldstein et al., 1993; Haynes et al., 1995; Sireci, 1998). Second, as noted previously, prominent writings on personality testing in the workplace (e.g., Morgeson et al., 2007b; O’Neill et al., 2013; Ones et al., 2007; Rothstein & Goffin, 2006; Tett & Christiansen, 2007) have tended to ignore the applicability of content validation to personality measures. Third, the sources that do specifically address this issue present mixed opinions. While some have expressed reticence about content-based evidence for measures of personality (e.g., Goldstein et al., 1993; Lawshe, 1985; Wollack, 1976), others consider this restriction to be problematic (e.g., Landy, 1986; McDaniel et al., 2011) or view content validity as particularly relevant to personality testing (e.g., Murphy et al., 2009; O’Neill et al., 2009). Thus, as with the technical standards and guidelines discussed above, those turning to the academic literature for guidance might similarly come away uncertain regarding the use of content validity evidence to support personality measures in an employee selection context. What Are the Bases of Inconsistency? This section attempts to identify the conceptual issues that form the bases for disagree- ment/misunderstanding regarding the use of content validity evidence for personality measures. Making these underlying matters explicit will help to identify some com- mon ground and the potential for a way forward. Based on the review of documents and literature above, the primary areas to be addressed include (a) vestiges of the trini- tarian view of validity, (b) the focus of the content match, and (c) a clear understanding of the inferences to be substantiated. Vestiges of the Trinitarian View of Validity Although it is now well-established …
CATEGORIES
Economics Nursing Applied Sciences Psychology Science Management Computer Science Human Resource Management Accounting Information Systems English Anatomy Operations Management Sociology Literature Education Business & Finance Marketing Engineering Statistics Biology Political Science Reading History Financial markets Philosophy Mathematics Law Criminal Architecture and Design Government Social Science World history Chemistry Humanities Business Finance Writing Programming Telecommunications Engineering Geography Physics Spanish ach e. Embedded Entrepreneurship f. Three Social Entrepreneurship Models g. Social-Founder Identity h. Micros-enterprise Development Outcomes Subset 2. Indigenous Entrepreneurship Approaches (Outside of Canada) a. Indigenous Australian Entrepreneurs Exami Calculus (people influence of  others) processes that you perceived occurs in this specific Institution Select one of the forms of stratification highlighted (focus on inter the intersectionalities  of these three) to reflect and analyze the potential ways these ( American history Pharmacology Ancient history . Also Numerical analysis Environmental science Electrical Engineering Precalculus Physiology Civil Engineering Electronic Engineering ness Horizons Algebra Geology Physical chemistry nt When considering both O lassrooms Civil Probability ions Identify a specific consumer product that you or your family have used for quite some time. This might be a branded smartphone (if you have used several versions over the years) or the court to consider in its deliberations. Locard’s exchange principle argues that during the commission of a crime Chemical Engineering Ecology aragraphs (meaning 25 sentences or more). Your assignment may be more than 5 paragraphs but not less. INSTRUCTIONS:  To access the FNU Online Library for journals and articles you can go the FNU library link here:  https://www.fnu.edu/library/ In order to n that draws upon the theoretical reading to explain and contextualize the design choices. Be sure to directly quote or paraphrase the reading ce to the vaccine. Your campaign must educate and inform the audience on the benefits but also create for safe and open dialogue. A key metric of your campaign will be the direct increase in numbers.  Key outcomes: The approach that you take must be clear Mechanical Engineering Organic chemistry Geometry nment Topic You will need to pick one topic for your project (5 pts) Literature search You will need to perform a literature search for your topic Geophysics you been involved with a company doing a redesign of business processes Communication on Customer Relations. Discuss how two-way communication on social media channels impacts businesses both positively and negatively. Provide any personal examples from your experience od pressure and hypertension via a community-wide intervention that targets the problem across the lifespan (i.e. includes all ages). Develop a community-wide intervention to reduce elevated blood pressure and hypertension in the State of Alabama that in in body of the report Conclusions References (8 References Minimum) *** Words count = 2000 words. *** In-Text Citations and References using Harvard style. *** In Task section I’ve chose (Economic issues in overseas contracting)" Electromagnetism w or quality improvement; it was just all part of good nursing care.  The goal for quality improvement is to monitor patient outcomes using statistics for comparison to standards of care for different diseases e a 1 to 2 slide Microsoft PowerPoint presentation on the different models of case management.  Include speaker notes... .....Describe three different models of case management. visual representations of information. They can include numbers SSAY ame workbook for all 3 milestones. You do not need to download a new copy for Milestones 2 or 3. When you submit Milestone 3 pages): Provide a description of an existing intervention in Canada making the appropriate buying decisions in an ethical and professional manner. Topic: Purchasing and Technology You read about blockchain ledger technology. Now do some additional research out on the Internet and share your URL with the rest of the class be aware of which features their competitors are opting to include so the product development teams can design similar or enhanced features to attract more of the market. The more unique low (The Top Health Industry Trends to Watch in 2015) to assist you with this discussion.         https://youtu.be/fRym_jyuBc0 Next year the $2.8 trillion U.S. healthcare industry will   finally begin to look and feel more like the rest of the business wo evidence-based primary care curriculum. Throughout your nurse practitioner program Vignette Understanding Gender Fluidity Providing Inclusive Quality Care Affirming Clinical Encounters Conclusion References Nurse Practitioner Knowledge Mechanics and word limit is unit as a guide only. The assessment may be re-attempted on two further occasions (maximum three attempts in total). All assessments must be resubmitted 3 days within receiving your unsatisfactory grade. You must clearly indicate “Re-su Trigonometry Article writing Other 5. June 29 After the components sending to the manufacturing house 1. In 1972 the Furman v. Georgia case resulted in a decision that would put action into motion. Furman was originally sentenced to death because of a murder he committed in Georgia but the court debated whether or not this was a violation of his 8th amend One of the first conflicts that would need to be investigated would be whether the human service professional followed the responsibility to client ethical standard.  While developing a relationship with client it is important to clarify that if danger or Ethical behavior is a critical topic in the workplace because the impact of it can make or break a business No matter which type of health care organization With a direct sale During the pandemic Computers are being used to monitor the spread of outbreaks in different areas of the world and with this record 3. Furman v. Georgia is a U.S Supreme Court case that resolves around the Eighth Amendments ban on cruel and unsual punishment in death penalty cases. The Furman v. Georgia case was based on Furman being convicted of murder in Georgia. Furman was caught i One major ethical conflict that may arise in my investigation is the Responsibility to Client in both Standard 3 and Standard 4 of the Ethical Standards for Human Service Professionals (2015).  Making sure we do not disclose information without consent ev 4. Identify two examples of real world problems that you have observed in your personal Summary & Evaluation: Reference & 188. Academic Search Ultimate Ethics We can mention at least one example of how the violation of ethical standards can be prevented. Many organizations promote ethical self-regulation by creating moral codes to help direct their business activities *DDB is used for the first three years For example The inbound logistics for William Instrument refer to purchase components from various electronic firms. During the purchase process William need to consider the quality and price of the components. In this case 4. A U.S. Supreme Court case known as Furman v. Georgia (1972) is a landmark case that involved Eighth Amendment’s ban of unusual and cruel punishment in death penalty cases (Furman v. Georgia (1972) With covid coming into place In my opinion with Not necessarily all home buyers are the same! When you choose to work with we buy ugly houses Baltimore & nationwide USA The ability to view ourselves from an unbiased perspective allows us to critically assess our personal strengths and weaknesses. This is an important step in the process of finding the right resources for our personal learning style. Ego and pride can be · By Day 1 of this week While you must form your answers to the questions below from our assigned reading material CliftonLarsonAllen LLP (2013) 5 The family dynamic is awkward at first since the most outgoing and straight forward person in the family in Linda Urien The most important benefit of my statistical analysis would be the accuracy with which I interpret the data. The greatest obstacle From a similar but larger point of view 4 In order to get the entire family to come back for another session I would suggest coming in on a day the restaurant is not open When seeking to identify a patient’s health condition After viewing the you tube videos on prayer Your paper must be at least two pages in length (not counting the title and reference pages) The word assimilate is negative to me. I believe everyone should learn about a country that they are going to live in. It doesnt mean that they have to believe that everything in America is better than where they came from. It means that they care enough Data collection Single Subject Chris is a social worker in a geriatric case management program located in a midsize Northeastern town. She has an MSW and is part of a team of case managers that likes to continuously improve on its practice. The team is currently using an I would start off with Linda on repeating her options for the child and going over what she is feeling with each option.  I would want to find out what she is afraid of.  I would avoid asking her any “why” questions because I want her to be in the here an Summarize the advantages and disadvantages of using an Internet site as means of collecting data for psychological research (Comp 2.1) 25.0\% Summarization of the advantages and disadvantages of using an Internet site as means of collecting data for psych Identify the type of research used in a chosen study Compose a 1 Optics effect relationship becomes more difficult—as the researcher cannot enact total control of another person even in an experimental environment. Social workers serve clients in highly complex real-world environments. Clients often implement recommended inte I think knowing more about you will allow you to be able to choose the right resources Be 4 pages in length soft MB-920 dumps review and documentation and high-quality listing pdf MB-920 braindumps also recommended and approved by Microsoft experts. The practical test g One thing you will need to do in college is learn how to find and use references. References support your ideas. College-level work must be supported by research. You are expected to do that for this paper. You will research Elaborate on any potential confounds or ethical concerns while participating in the psychological study 20.0\% Elaboration on any potential confounds or ethical concerns while participating in the psychological study is missing. Elaboration on any potenti 3 The first thing I would do in the family’s first session is develop a genogram of the family to get an idea of all the individuals who play a major role in Linda’s life. After establishing where each member is in relation to the family A Health in All Policies approach Note: The requirements outlined below correspond to the grading criteria in the scoring guide. At a minimum Chen Read Connecting Communities and Complexity: A Case Study in Creating the Conditions for Transformational Change Read Reflections on Cultural Humility Read A Basic Guide to ABCD Community Organizing Use the bolded black section and sub-section titles below to organize your paper. For each section Losinski forwarded the article on a priority basis to Mary Scott Losinksi wanted details on use of the ED at CGH. He asked the administrative resident