Fluent Essays - discussion (300 words total)

DISCUSSION (300 WORDS TOTAL)

discussion (300 words total) - Reading

PLACE ORDER

Finish it in 12 hours. You need to read carefully or I will dispute. Discussion question: 1. Obviously, in both papers (Requirements development in scenario-based design, Scenario-Based Usability Engineering, Chapter 3), there seems to be a strong preference for scenario-based design, and rightfully so as it seems to be a better approach in almost all situations. However, when would a requirement-based approach beat out a scenario-based approach, or what big ideas from the readings make the scenario-based approach better? (bonus points for naming something not mentioned already!) 2. Read Understanding the Effect of Accuracy on Trust in Machine Learning Models Discuss: Have you ever used any ML systems in your daily life/work? Do you think the ML systems are trustworthy? Copyright  1999 by Mary Beth Rosson and John M. Carroll DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION Scenario-Based Usability Engineering Mary Beth Rosson and John M. Carroll Department of Computer Science Virginia Tech Fall 1999 DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION SBUE—Chapter 3 1 Copyright  1999 by Mary Beth Rosson and John M. Carroll Chapter 3 Analyzing Requirements Making work visible. The end goal of requirements analysis can be elusive when work is not understood in the same way by all participants. Blomberg, Suchman, and Trigg describe this problem in their exploration of image-processing services for a law firm. Initial studies of attorneys produced a rich analysis of their document processing needs—for any legal proceeding, documents often numbering in the thousands are identified as “responsive” (relevant to the case) by junior attorneys, in order to be submitted for review by the opposing side. Each page of these documents is given a unique number for subsequent retrieval. An online retrieval index is created by litigation support workers; the index encodes document attributes such as date, sender, recipient, and type. The attorneys assumed that their job (making the subjective relevance decisions) would be facilitated by image processing that encodes a documents’s objective attributes (e.g., date, sender). However, studies of actual document processing revealed activities that were not objective at all, but rather relied on the informed judgment of the support staff. Something as simple as a document date was often ambiguous, because it might display the date it was written, signed, and/or delivered; the date encoded required understanding the document’s content and role in a case. Even determining what constituted a document required judgment, as papers came with attachments and no indication of beginning or end. Taking the perspective of the support staff revealed knowledge-based activities that were invisible to the attorneys, but that had critical limiting implications for the role of image-processing technologies (see Blomberg, 1995). DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION SBUE—Chapter 3 2 Copyright  1999 by Mary Beth Rosson and John M. Carroll What is Requirements Analysis? The purpose of requirements analysis is to expose the needs of the current situation with respect to a proposed system or technology. The analysis begins with a mission statement or orienting goals, and produces a rich description of current activities that will motivate and guide subsequent development. In the legal office case described above, the orienting mission was possible applications of image processing technology; the rich description included a view of case processing from both the lawyers’ and the support staffs’ perspectives. Usability engineers contribute to this process by analyzing what and how features of workers’ tasks and their work situation are contributing to problems or successes1. This analysis of the difficulties or opportunities forms a central piece of the requirements for the system under development: at the minimum, a project team expects to enhance existing work practices. Other requirements may arise from issues unrelated to use, for example hardware cost, development schedule, or marketing strategies. However these pragmatic issues are beyond the scope of this textbook. Our focus is on analyzing the requirements of an existing work setting and of the workers who populate it. Understanding Work What is work? If you were to query a banker about her work, you would probably get a list of things she does on a typical day, perhaps a description of relevant information or tools, and maybe a summary of other individuals she answers to or makes requests of. At the least, describing work means describing the activities, artifacts (data, documents, tools), and social context (organization, roles, dependencies) of a workplace. No single observation or interview technique will be sufficient to develop a complete analysis; different methods will be useful for different purposes. Tradeoff 3.1: Analyzing tasks into hierarchies of sub-tasks and decision rules brings order to a problem domain, BUT tasks are meaningful only in light of organizational goals and activities. A popular approach to analyzing the complex activities that comprise work is to enumerate and organize tasks and subtasks within a hierarchy (Johnson, 1995). A banker might indicate that the task of “reviewing my accounts” consists of the subtasks “looking over the account list”, “noting accounts with recent activity”, and “opening and reviewing active accounts”. Each of these sub-tasks in turn can decomposed more finely, perhaps to the level of individual actions such as picking up or filing a particular document. Some of the tasks will include decision-making, such 1 In this discussion we use “work” to refer broadly to the goal-directed activities that take place in the problem domain. In some cases, this may involve leisure or educational activities, but in general the same methods can be applied to any situation with established practices. DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION SBUE—Chapter 3 3 Copyright  1999 by Mary Beth Rosson and John M. Carroll as when the banker decides whether or not to open up a specific account based on its level of activity. A strength of task analysis is its step-by-step transformation of a complex space of activities into an organized set of choices and actions. This allows a requirements analyst to examine the task’s structure for completeness, complexity, inconsistencies, and so on. However the goal of systematic decomposition can also be problematic, if analysts become consumed by representing task elements, step sequences, and decision rules. Individual tasks must be understood within the larger context of work; over-emphasizing the steps of a task can cause analysts to miss the forest for the trees. To truly understand the task of reviewing accounts a usability engineer must learn who is responsible for ensuring that accounts are up to date, how account access is authorized and managed, and so on. The context of work includes the physical, organizational, social, and cultural relationships that make up the work environment. Actions in a workplace do not take place in a vacuum; individual tasks are motivated by goals, which in turn are part of larger activities motivated by the organizations and cultures in which the work takes place (see Activities of a Health Care Center, below). A banker may report that she is reviewing accounts, but from the perspective of the banking organization she is “providing customer service” or perhaps “increasing return on investment”. Many individuals — secretaries, data-entry personnel, database programmers, executives — work with the banker to achieve these high-level objectives. They collaborate though interactions with shared tools and information; this collaboration is shaped not only by the tools that they use, but also by the participants’ shared understanding of the bank’s business practice — its goals, policies, and procedures. Tradeoff 3.2: Task information and procedures are externalized in artifacts, BUT the impact of these artifacts on work is apparent only in studying their use. A valuable source of information about work practices is the artifacts used to support task goals (Carroll & Campbell, 1989). An artifact is simply a designed object — in an office setting, it might be a paper form, a pencil, an in-basket, or a piece of computer software. It is simple and fun to collect artifacts and analyze their characteristics (Norman, 1990). Consider the shape of a pencil: it conveys a great deal about the size and grasping features of the humans who use it; pencil designers will succeed to a great extent by giving their new designs the physical characteristics of pencils that have been used for years. But artifacts are just part of the picture. Even an object as simple as a pencil must be analyzed as part of a real world activity, an activity that may introduce concerns such as erasability (elementary school use), sharpness (architecture firm drawings), name-brands (pre-teen status brokering), cost (office supplies accounting), and so on. Usability engineers have adapted ethnographic techniques to analyze the diverse factors influencing work. Ethnography refers to methods developed within anthropology for gaining insights into the life experiences of individuals whose everyday reality is vastly different from the DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION SBUE—Chapter 3 4 Copyright  1999 by Mary Beth Rosson and John M. Carroll analyst’s (Blomberg, 1990). Ethnographers typically become intensely involved in their study of a group’s culture and activities, often to the point of becoming members themselves. As used by HCI and system design communities, ethnography involves observations and interviews of work groups in their natural setting, as well as collection and analysis of work artifacts (see Team Work in Air Traffic Control, below). These studies are often carried out in an iterative fashion, where the interpretation of one set of data raises questions or possibilities that may be pursued more directly in follow-up observations and interviews. Figure 3.1: Activity Theory Analysis of a Health Care Center (after Kuuiti and Arvonen, 1992) Activities of a Health Care Center: Activity Theory (AT) offers a view of individual work that grounds it in the goals and practices of the community within which the work takes place. Engeström (1987) describes how an individual (the subject) works on a problem (the object) to achieve a result (the outcome), but that the work on the problem is mediated by the tools available (see Figure 3.2m). An individual’s work is also mediated by the rules of practice shared within her community; the object of her work is mediated by that same communities division of labor. Kuutti and Arvonen (1992; see also Engeström 1990; 1991; 1993) applied this framework to their studies of a health care organization in Espoo, Finland. This organization wished to evolve Tools Supporting Activity: Subject Involved in Activity: Community sponsoring Activity: Object of Activity: Activity Outcome: Division of LaborRules of Practice patient record, medicines, etc. one physician in a health care unit all personnel of the health care unit the complex, multi-dimensional problem of a patient patient problem resolved DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION SBUE—Chapter 3 5 Copyright  1999 by Mary Beth Rosson and John M. Carroll from a rather bureaucratic organization with strong separations between its various units (e.g., social work, clinics, hospital) to a more service-oriented organization. A key assumption in doing this was that the different units shared a common general object of work—the “life processes” of the town’s citizens. This high-level goal was acknowledged to be a complex problem requiring the integrated services of complementary health care units. The diagram in Figure 3.1 summarizes an AT analysis developed for one physician in a clinic. The analysis records the shared object (the health conditions of a patient). At the same time it shows this physician’s membership in a subcommunity, specifically the personnel at her clinic. This clinic is both geographically and functionally separated from other health care units, such as the hospital or the social work office. The tools that the physician uses in her work, the rules that govern her actions, and her understanding of her goals are mediated by her clinic. As a result, she has no way of analyzing or finding out about other dimensions of this patient’s problems, for example the home life problems being followed by a social worker, or emotional problems under treatment by psychiatric personnel. In AT such obstacles are identified as contradictions which must be resolved before the activity can be successful. In this case, a new view of community was developed for the activity. For each patient, email or telephone was used to instantiate a new community, comprised of individuals as relevant from different health units. Of course the creation of a more differentiated community required negotiation concerning the division of labor (e.g. who will contact whom and for what purpose), and rules of action (e.g., what should be done and in what order). Finally, new tools (composite records, a “master plan”) were constructed that better supported the redefined activity. Figure 3.2 will appear here, a copy of the figure provided by Hughes et al. in their ethnographic report. Need to get copyright permission. Team Work in Air Traffic Control: An ethnographic study of British air traffic control rooms by Hughes, Randall and Shapiro (CSCW’92) highlighted the central role played by the paper strips used to chart the progress of individual flights. In this study the field workers immersed themselves in the work of air traffic controllers for several months. During this time they observed the activity in the control rooms and talked to the staff; they also discussed with the staff the observations they were collecting and their interpretation of these data. The general goal of the ethnography was to analyze the social organization of the work in the air traffic control rooms. In this the researchers showed how the flight progress strips supported “individuation”, such that each controller knew what their job was in any given situation, but also how their tasks were interdependent with the tasks of others. The resulting division of labor was accomplished in a smooth fashion because the controllers had shared knowledge of what the strips indicated, and were able to take on and hand off tasks as needed, and to recognize and address problems that arose. DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION SBUE—Chapter 3 6 Copyright  1999 by Mary Beth Rosson and John M. Carroll Each strip displays an airplane’s ID and aircraft type; its current level, heading, and airspeed; its planned flight path, navigation points on route, estimated arrival at these points; and departure and destination airports (see Figure 3.2). However a strip is more than an information display. The strips are work sites, used to initiate and perform control tasks. Strips are printed from the online database, but then annotated as flight events transpire. This creates a public history; any controller can use a strip to reconstruct a “trajectory” of what the team has done with a flight. The strips are used in conjunction with the overview offered by radar to spot exceptions or problems to standard ordering and arrangement of traffic. An individual strip gets “messy” to the extent it has deviated from the norm, so a set of strips serves as a sort of proxy for the orderliness of the skies. The team interacts through the strips. Once a strip is printed and its initial data verified, it is placed in a holder color-coded for its direction. It may then be marked up by different controllers, each using a different ink color; problems or deviations are signaled by moving a strip out of alignment, so that visual scanning detects problem flights. This has important social consequences for the active controller responsible for a flight. She knows that other team members are aware of the flight’s situation and can be consulted; who if anyone has noted specific issues with the flight; if a particularly difficult problem arises it can be passed on to the team leader without a lot of explanation; and so on. The ethnographic analysis documented the complex tasks that revolved around the flight control strips. At the same time it made clear the constraints of these manually-created and maintained records. However a particularly compelling element of the situation was the controllers’ trust in the information on the strips. This was due not to the strips’ physical characteristics, but rather to the social process they enable—the strips are public, and staying on top of each others’ problem flights, discussing them informally while working or during breaks, is taken for granted. Any computerized replacement of the strips must support not just management of flight information, but also the social fabric of the work that engenders confidence in the information displayed. User Involvement Who are a system’s target users? Clearly this is a critical question for a user-centered development process. It first comes up during requirements analysis, when the team is seeking to identify a target population(s), so as to focus in on the activities that will suggest problems and concerns. Managers or corporation executives are a good source of high-level needs statements (e.g., reduce data-processing errors, integrate billing and accounting). Such individuals also have a well-organized view of their subordinates’ responsibilities , and of the conditions under which various tasks are completed. Because of the hierarchical nature of most organizations, such individuals are usually easily to identify and comprise a relatively small set. Unfortunately if a requirements team accepts these requirements too readily, they may miss the more detailed and situation-specific needs of the individuals who will use a new system in their daily work. DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION SBUE—Chapter 3 7 Copyright  1999 by Mary Beth Rosson and John M. Carroll Tradeoff 3.3: Management understands the high-level requirements for a system, BUT is often unaware of workers’ detailed needs and preferences. Every system development situation includes multiple stakeholders (Checklund, 1981). Individuals in management positions may have authorized a system’s purchase or development; workers with a range of job responsibilities will actually use the system; others may benefit only indirectly from the tasks a system supports. Each set of stakeholders has its own set of motivations and problems that the new system might address (e.g., productivity, satisfaction, ease of learning). What’s more, none of them can adequately communicate the perspectives of the others — despite the best of intentions, many details of a subordinate’s work activities and concerns are invisible to those in supervisory roles. Clearly what is needed in requirements analysis is a broad-based approach that incorporates diverse stakeholder groups into the observation and interviewing activities. Tradeoff 3.4: Workers can describe their tasks, BUT work is full of exceptions, and the knowledge for managing exceptions is often tacit and difficult to externalize. But do users really understand their own work? We made the point above that a narrow focus on the steps of a task might cause analysts to miss important workplace context factors. An analogous point holds with respect to interviews or discussions with users. Humans are remarkably good (and reliable) at “rationalizing” their behaivor (Ericsson & Simon, 1992). Reports of work practices are no exception — when asked workers will usually first describe a most-likely version of a task. If an established “procedures manual” or other policy document exists, the activities described by experienced workers will mirror the official procedures and policies. However this officially-blessed knowledge is only part of the picture. An experienced worker will also have considerable “unofficial” knowledge acquired through years of encountering and dealing with the specific needs of different situations, with exceptions, with particular individuals who are part of the process, and so on. This expertise is often tacit, in that the knowledgeable individuals often don’t even realize what they “know” until confronted with their own behavior or interviewed with situation-specific probes (see Tacit Knowledge in Telephone Trouble-Shooting, below). From the perspective of requirements analysis, however, tacit knowledge about work can be critical, as it often contains the “fixes” or “enhancements” that have developed informally to address the problems or opportunities of day-to-day work. One effective technique for probing workers’ conscious and unconscious knowledge is contextual inquiry (Beyers & Holtzblatt, 1994). This analysis method is similar to ethnography, in that it involves the observation of individuals in the context of their normal work environment. However it includes the perogative to interrupt an observed activity at points that seem informative (e.g., when a problematic situation arises) and to interview the affected individual(s) on the spot concerning the events that have been observed, to better understand causal factors and options for continuing the activity. For example, a usability engineer who saw a secretary stop working on a DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION SBUE—Chapter 3 8 Copyright  1999 by Mary Beth Rosson and John M. Carroll memo to make a phone call to another secretary, might ask her afterwards to explain what had just happened between her and her co-worker. Tacit Knowledge in Telephone Trouble-Shooting: It is common for workers to see their conversations and interactions with each other as a social aspect of work that is enjoyable but unrelated to work goals. Sachs (199x) observed this in her case study of telephony workers in a phone company. The study analyzed the work processes related to detecting, submitting, and resolving problems on telephone lines; the focus of the study was the Trouble Ticketing System (TTS), a large database used to record telephone line problems, assign problems (tickets) to engineers for correction, and keep records of problems detected and resolved. Sachs argues that TTS takes an organizational view of work, treating work tasks as modular and well-defined: one worker finds a problem, submits it to the database, TTS assigns it to the engineer at the relevant site, that engineer picks up the ticket, fixes the problem, and moves on. The original worker is freed from the problem analysis task once the original ticket, and the second worker can move on once the problem has been addressed. TTS replaced a manual system in which workers contacted each other directly over the phone, often working together to resolve a problem. TTS was designed to make work more efficient by eliminating unnecessary phone conversations. In her interviews with telephony veterans, Sachs discovered that the phone conversations were far from unnecessary. The initiation, conduct, and consequences of these conversations reflected a wealth of tacit knowledge on the part of the worker--selecting the right person to call (one known to have relevant expertise for this apparent problem), the “filling in” on what the first worker had or had not determined or tried to this point, sharing of hypotheses and testing methods, iterating together through tests and results, and carrying the results of this informal analysis into other possibly related problem areas. In fact, TTS had made work less efficient in many cases, because in order to do a competent job, engineers developed “workarounds” wherein they used phone conversations as they had in the past, then used TTS to document the process afterwards. Of interest was that the telephony workers were not at first aware of how much knowledge of trouble-shooting they were applying to their jobs. They described the tasks as they understood them from company policy and procedures. Only after considerable data collection and discussion did they recognize that their jobs included the skills to navigate and draw upon a rich organizational network of colleagues. In further work Sachs helped the phone company to develop a fix for the observed workarounds in the form of a new organizational role: a “turf coordinator”, a senior engineer responsible for identifying and coordinating the temporary network of workers needed to collaborate on trouble-shooting a problem. As a result of Sach’s analysis, work that had been tacit and informal was elevated to an explicit business responsibility. Requirements Analysis with Scenarios As introduced in Chapter 2, requirements refers to the first phase of SBUE. As we also have emphasized, requirements cannot be analyzed all at once in waterfall fashion. However some DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION SBUE—Chapter 3 9 Copyright  1999 by Mary Beth Rosson and John M. Carroll analysis must happen early on to get the ball rolling. User interaction scenarios play an important role in these early analysis activities. When analysts are observing workers in the world, they are collecting observed scenarios, episodes of actual interaction among workers that may or may not involve technology. The analysis goal is to produce a summary that captures the critical aspects of the observed activities. A central piece of this summary analysis is a set of requirements scenarios. The development of requirements scenarios begins with determining who are the stakeholders in a work situation — what their roles and motivations are, what characteristics they possess that might influence reactions to new technology. A description of these stakeholders’ work practice is then created, through a combination of workplace observation and generation of hypothetical situations. These sources of data are summarized and combined to generate the requirements scenarios. A final step is to call out the most critical features of the scenarios, along with hypotheses about the positive or negative consequences that these features seem to be having on the work setting. Introducing the Virtual Science Fair Example Case The methods of SBUE will be introduced with reference to a single open-ended example problem, the design of a virtual science fair (VSF). The high-level concept is to use computer- mediated communication technology (e.g., email, online chat, discussion forums, videoconferencing) and online archives (e.g., databases, digital libraries) to supplement the traditional physical science fairs. Such fairs typically involve student creation of science projects over a period of months. The projects are then exhibited and judged at the science fair event. We begin with a very loose concept of what a virtual version of such a fair might be — not a replacement of current fairs, but rather a supplement that expands the boundaries of what might constitute participation, project construction, project exhibits, judging, and so on. Stakeholder Analysis Checklund (1981) offers a mnemonic for guiding development of an early shared vision of a system’s goals — CATWOE analysis. CATWOE elements include Clients (those people who will benefit or suffer from the system), Actors (those who interact with the system), a Transformation (the basic purpose of the system), a Weltanschauung (the world view promoted by the system), Owners (the individuals commissioning or authorizing the system), and the Environment (physical constraints on the system). SBUE adapts Checklund’s technique as an aid in identifying and organizing the concerns of various stakeholders during requirements analysis.The SBUE adaptation of Checklund’s technique includes the development of thumbnail scenarios for each element identified. The table includes just one example for each VSF element called out in the analysis; for a complex situation multiple thumbnails might be needed. Each scenario sketch is a usage-oriented elaboration of the element itself; the sketch is points to a future situation in which a possible benefit, interaction, environmental constraint, etc., is realized. Thus the client thumbnails emphasize hoped-for benefits of the VSF; the actor thumbnails suggest a few interaction variations anticipated for different stakeholders. The thumbnail scenarios generated in DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION SBUE—Chapter 3 10 Copyright  1999 by Mary Beth Rosson and John M. Carroll this analysis are not yet design scenarios, they simply allow the analyst to begin to explore the space of user groups, motivations, and pragmatic constraints. The CATWOE thumbnail scenarios begin the iterative process of identifying and analyzing the background, motivations, and preferences that different user groups will bring to the use of the target system. This initial picture will be elaborated throughout the development process, through analysis of both existing and envisioned usage situations. CATWOE Element V S F Element Thumbnail Scenarios Clients Students Community members A high school student learns about road-bed coatings from a retired civil engineer. A busy housewife helps a middle school student organize her bibliographic information. Actors Students Teachers Community members A … Understanding the Effect of Accuracy on Trust in Machine Learning Models Ming Yin Purdue University [email protected] Jennifer Wortman Vaughan Microsoft Research [email protected] Hanna Wallach Microsoft Research [email protected] ABSTRACT We address a relatively under-explored aspect of human– computer interaction: people’s abilities to understand the relationship between a machine learning model’s stated per- formance on held-out data and its expected performance post deployment. We conduct large-scale, randomized human- subject experiments to examine whether laypeople’s trust in a model, measured in terms of both the frequency with which they revise their predictions to match those of the model and their self-reported levels of trust in the model, varies depend- ing on the model’s stated accuracy on held-out data and on its observed accuracy in practice. We find that people’s trust in a model is affected by both its stated accuracy and its observed accuracy, and that the effect of stated accuracy can change depending on the observed accuracy. Our work relates to re- cent research on interpretable machine learning, but moves beyond the typical focus on model internals, exploring a different component of the machine learning pipeline. CCS CONCEPTS • Human-centered computing → Empirical studies in HCI; • Computing methodologies → Machine learn- ing. KEYWORDS Machine learning, trust, human-subject experiments ACM Reference Format: Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understanding the Effect of Accuracy on Trust in Machine Learn- ing Models. In CHI Conference on Human Factors in Computing Systems Proceedings (CHI 2019), May 4–9, 2019, Glasgow, Scotland Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected] CHI 2019, May 4–9, 2019, Glasgow, Scotland UK © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-5970-2/19/05.. .$15.00 https://doi.org/10.1145/3290605.3300509 UK. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/ 3290605.3300509 1 INTRODUCTION Machine learning (ML) is becoming increasingly ubiquitous as a tool to aid human decision-making in diverse domains ranging from medicine to public policy and law. For exam- ple, researchers have trained deep neural networks to help dermatologists identify skin cancer [8], while political strate- gists regularly use ML-based forecasts to determine their next move [21]. Police departments have used ML systems to predict the location of human trafficking hotspots [28], while child welfare workers have used predictive modeling to strategically target services to the children most at risk [3]. This widespread applicability of ML has led to a move- ment to “democratize machine learning” [12] by developing off-the-shelf models and toolkits that make it possible for anyone to incorporate ML into their own system or decision- making pipeline, without the need for any formal training. While this movement opens up endless possibilities for ML to have real-world impact, it also creates new challenges. Decision-makers may not be used to reasoning about the explicit forms of uncertainty that are baked into ML pre- dictions [27], or, because they do not need to understand the inner workings of an ML model in order to use it, they may misunderstand or mistrust its predictions [6, 16, 25]. Prompted by these challenges, as well as growing concerns that ML systems may inadvertently reinforce or amplify so- cietal biases [1, 2], researchers have turned their attention to the ways that humans interact with ML, typically focusing on people’s abilities and willingness to use, understand, and trust ML systems. This body of work often falls under the broad umbrella of interpretable machine learning [6, 16, 25]. To date, most work on interpretability has focused explic- itly on ML models, asking questions about people’s abilities to understand model internals or the ways that particular models map inputs to outputs [20, 24], as well as questions about the relationship between these abilities and people’s willingness to trust a model. However, the model is just one component of the ML pipeline, which spans data collection, model selection, training algorithms and procedures, model evaluation, and ultimately, deployment. It is therefore im- portant to study people’s interactions with each of these components—not just those that relate to model internals. CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK Paper 279 Page 1 https://doi.org/10.1145/3290605.3300509 https://doi.org/10.1145/3290605.3300509 https://doi.org/10.1145/3290605.3300509 One particularly under-explored aspect of the evaluation and deployment components of the pipeline is the inter- pretability of performance metrics, such as accuracy, preci- sion, or recall. The democratization of ML means that it is increasingly common for a decision-maker to be presented with a “black-box” model along with some measure of its performance—most often accuracy—on held-out data. How- ever, a model’s stated performance may not accurately reflect its performance post deployment because the data on which the model was trained and evaluated may look very differ- ent from real-world use cases [15]. In deciding how much to trust the model, the decision-maker has little to go on besides this stated performance, her own limited observations of the model’s predictions in practice, and her domain knowledge. This scenario raises a number of questions. To what extent do laypeople—who are increasingly often the end users of systems built using ML models—understand the relationship between a model’s stated performance on held-out data and its expected performance post deployment? How does their understanding influence their willingness to trust the model? For example, do people trust a model more if they are told that its accuracy on held-out data is 90% as compared with 70%? If so, will the model’s stated accuracy continue to influ- ence their trust in the model even after they are given the op- portunity to observe and interact with the model in practice? In this paper, we describe the results of a sequence of large-scale, randomized, pre-registered human-subject exper- iments1 designed to investigate whether an ML model’s accu- racy affects laypeople’s willingness to trust the model. Specif- ically, we focus on the following three main questions: • Does a model’s stated accuracy on held-out data affect people’s trust in the model? • If so, does it continue to do so after people have observed the model’s accuracy in practice? • How does a model’s observed accuracy in practice affect people’s trust in the model? In each of our experiments, subjects recruited on Amazon Mechanical Turk were asked to make predictions about the outcomes of speed dating events with the help of an ML model. Subjects were first shown information about a speed dating participant and his or her date, and then asked to predict whether or not the participant would want to see his or her date again. Finally, they were shown the model’s pre- diction and given the option of revising their own prediction. In our first experiment, we focus on the first two questions above, investigating whether a model’s stated accuracy on held-out data affects laypeople’s trust in the model and, if so, whether it continues to do so after they have observed the model’s accuracy in practice. Subjects were randomized into one of ten treatments, which differed along two dimensions: 1All experiments were approved by the Microsoft Research IRB. stated accuracy on held-out data and amount at stake. Some subjects were given no information about the model’s accu- racy on held-out data, while others were told that its accuracy was 60%, 70%, 90%, or 95%. Halfway through the experiment, each subject was given feedback on both their own accuracy and the model’s accuracy on the first half of the prediction tasks, which was 80% regardless of the treatment. Subjects in all treatments saw exactly the same speed dating events and exactly the same model predictions. This experimental de- sign allows us isolate the effect of stated accuracy on people’s trust, both before and after they observe the model’s accuracy in practice. As a robustness check, some subjects received a monetary bonus for each correct prediction, while others did not, allowing us to test whether the effect of stated accuracy on trust varies when people have more “skin in the game.” We find that stated accuracy does have a significant effect on people’s trust in a model, measured in terms of both the frequency with which subjects adjust their predictions to match those of the model and their self-reported levels of trust in the model. We also find that the effect size is smaller after people observe the model’s accuracy in practice. We do not find that the amount at stake has a significant effect. In our second experiment, we test whether these results are robust to different levels of observed accuracy by running two additional variations of our first experiment: one in which the observed accuracy of the model was low and one in which the observed accuracy of the model was high. We find that a model’s stated accuracy still has a significant effect on people’s trust even after observing a high accuracy (100%) in practice. However, if a model’s observed accuracy is low (55%), then after observing this accuracy, the stated accuracy has at most a very small effect on people’s trust in the model. In our third experiment, we investigate the final question above—i.e., how does a model’s observed accuracy in prac- tice affect people’s trust in the model? The experimental design used in our first two experiments does not enable us to directly compare people’s trust between treatments with different levels of observed accuracy because the prediction tasks (i.e., speed dating events) and the model predictions differed between these treatments. Our third experiment was therefore carefully designed to enable us to make such com- parisons. We find that after observing a model’s accuracy in practice, people’s trust in the model is significantly affected by its observed accuracy regardless of its stated accuracy. Finally, via an exploratory analysis, we dig more deeply into the question of how people update their trust after re- ceiving feedback on their own accuracy and the model’s accuracy in practice. We analyze differences in individual subjects’ trust in the model before and after receiving such feedback. Our experimental data support the conjecture that people compare their own accuracy to the model’s observed accuracy, increasing their trust in the model if the model’s CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK Paper 279 Page 2 observed accuracy is higher than their own accuracy—except in the case where the model’s observed accuracy is substan- tially lower than its stated accuracy on held-out data. Taken together, our results show that laypeople’s trust in an ML model is affected by both the model’s stated accu- racy on held-out data and its observed accuracy in practice. These results highlight the need for designers of ML systems to clearly and responsibly communicate their expectations about model performance, as this information shapes the extent to which people trust a model, both before and after they are able to observe and interact with it in practice. Our results also reveal the importance of properly communicat- ing the uncertainty that is baked into every ML prediction. Of course, proper caution should be used when generalizing our results to other settings. For example, although we do not find that the amount at stake has a significant effect, it is possible that there would be an effect when stakes are suf- ficiently high (e.g., doctors making life-or-death decisions). Related Work Our research contributes to a growing body of experimental work on trust in algorithmic systems. As a few examples, Dzindolet et al. [7] and Dietvorst et al. [4] found that people stop trusting an algorithm after witnessing it make a mistake, even when the algorithm outperforms human predictions— a phenomenon known as algorithm aversion. Dietvorst et al. [5] found that people are more willing to rely on an algo- rithm’s predictions when they are given the ability to make minor adjustments to the predictions rather than accepting them as is. Yeomans et al. [30] found that people distrust automated recommender systems compared with human rec- ommendations in the context of predicting which jokes peo- ple will find funny—a highly subjective domain—even when the recommender system outperforms human predictions. In contrast, Logg et al. [17] found that people trust predictions more when they believe that the predictions come from an algorithm as opposed to a human expert when predicting mu- sic popularity, romantic matches, and other outcomes. This effect is diluted when people are given the choice between us- ing an algorithm’s prediction and using their own prediction (as opposed to a prediction from another human expert). The relationship between interpretability and trust has been discussed in several recent papers [16, 22, 25]. Most related to our work, and an inspiration for our experimental design, Poursabzi-Sangdeh et al. [24] ran a sequence of ran- domized human-subject experiments and found no evidence that either the number of features used in an ML model nor the model’s level of transparency (clear or black box) have a significant impact on people’s willingness to trust the model’s predictions, although these factors do affect people’s abilities to detect when the model has made a mistake. Kennedy et al. [14] touched on the relationship between stated accuracy and trust in the context of criminal recidi- vism prediction. They ran a conjoint experiment in which they presented subjects with randomly generated pairs of models and asked each subject which model they preferred. The models varied in terms of their stated accuracy, the size of the (fictitious) training data set, the number of features, and several other properties. The authors estimated the ef- fect of each property by fitting a hierarchical linear model and found that people generally focus most on the size of the training data set, the source of the algorithm, and the stated accuracy, while less often taking into account the model’s level of transparency or the relevance of the training data. Finally, a few studies from the human–computer interac- tion community have examined the relationship between sys- tem performance and users’ trust in automated systems [31, 32], ubiquitous computing systems [13], recommender sys- tems [23], and robots [26]. For example, in a simulated ex- perimental environment in which users interacted with an automated quality monitoring system to identify faulty items in a fictional factory production line, Yu et al. [31, 32] ex- plored how users’ trust in the system varies with its accuracy. Unlike in our work, system accuracy was not explicitly com- municated to users. Instead, users “perceived” the accuracy by receiving feedback after interacting with the system. Yu et al. found that users are able to correctly perceive the accuracy and stabilize their trust to a level correlated with the accu- racy [31], though system failures have a stronger impact on trust than system successes [32]. In addition, Kay et al. [13] developed a survey tool through which they revealed that, for classifiers used in four hypothetical applications (e.g., elec- tricity monitoring and location tracking), users tend to put more weight on the classifiers’ recall rather than their pre- cision when deciding whether the classifiers’ performance is acceptable, with the weight varying across applications. 2 EXPERIMENT 1: DOES A MODEL’S STATED ACCURACY AFFECT LAYPEOPLE’S TRUST? Our first experiment was designed to answer our first two main questions—i.e., does a model’s stated accuracy on held- out data affect laypeople’s trust in the model, and if so, does it continue to do so after they have observed the model’s ac- curacy in practice? In our experiment, each subject observed the model’s accuracy in practice via a feedback screen that was presented halfway through the experiment with infor- mation about the subject’s own accuracy and the model’s accuracy thus far, as described below. Before running the ex- periment, we posited and pre-registered two hypotheses de- rived from our questions, which we state informally here:2 2The pre-registration document is at https://aspredicted.org/uq3hi.pdf. CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK Paper 279 Page 3 https://aspredicted.org/uq3hi.pdf • [H1] The stated accuracy of a model has a significant effect on people’s trust in the model before seeing the feedback screen. • [H2] The stated accuracy of a model has a significant effect on people’s trust in the model after seeing the feedback screen. As a robustness check to guard against the potential criti- cism that any null results might be due to a lack of perfor- mance incentives, we randomly selected some subjects to receive a monetary bonus for each correct prediction. We also posited and pre-registered two additional hypotheses: • [H3] The amount at stake has a significant effect on peo- ple’s trust in a model before seeing the feedback screen. • [H4] The amount at stake has a significant effect on people’s trust in a model after seeing the feedback screen. Prediction Tasks We asked subjects to make predictions about the outcomes of forty speed dating events. The data came from real speed dat- ing participants and their dates via the experimental study of Fisman et al. [9]. Each speed dating participant indicated whether or not he or she wanted to see his or her date again, thereby giving us ground truth from which to compute accu- racy. We chose this application for two reasons: First, predict- ing romantic interest does not require specialized domain expertise. Second, this setting is plausibly one in which ML might be used given that many dating websites already rely on ML models to predict potential romantic partners [18, 29]. For each prediction task (i.e., speed dating event), each subject was first shown a screen of information about the speed dating participant and his or her date, including: • The participant’s basic information: the gender, age, field of study, race, etc. of the participant. • The date’s basic information: the gender, age, and race of the participant’s date. • The participant’s preferences: the participant’s reported distribution of 100 points among six attributes (attrac- tiveness, sincerity, intelligence, fun, ambition, and shared interests), indicating how much he or she values each attribute in a romantic partner. • The participant’s impression of the date: the participant’s rating of his or her date on the same six attributes us- ing a scale of one to ten, as well as scores (also using a scale of one to ten) indicating how happy the participant expected to be with his or her date and how much the participant liked his or her date. The subject was then asked to follow a three step-procedure: First, they were asked to carefully review the information about the participant and his or her date and predict whether or not the participant would want to see his or her date Figure 1: Screenshot of the prediction task interface. again. Next, they were shown the model’s (binary) prediction. Finally, they were given the option of revising their own prediction. A screenshot of the interface is shown in Figure 1. Experimental Treatments We randomized subjects into one of ten treatments arranged in a 5×2 design. The treatments differed along two dimen- sions: stated accuracy on held-out data and amount at stake. Subjects were randomly assigned to one of five accuracy levels: none (the baseline), 60%, 70%, 90%, or 95%. Subjects assigned to an accuracy level of none were initially given no information about the model’s accuracy on held-out data. Subjects assigned to one of the other accuracy levels saw the following sentence in the instructions: “We previously evaluated this model on a large data set of speed dating participants and its accuracy was x%, i.e., the model’s predic- tions were correct on x% of the speed dating participants in this data set.” Throughout the experiment, we also reminded these subjects of the model’s stated accuracy on held-out data each time they were shown one of the model’s predictions. We note that our sentence about accuracy was not a decep- tion. We developed four ML models (a rule-based classifier, CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK Paper 279 Page 4 a support vector machine, a three-hidden-layer neural net- work, and a random forest) and evaluated them on a held-out data set of 500 speed dating participants, obtaining accuracies of 60%, 70%, 90%, and 95%. To keep the treatments as similar as possible, the models made exactly the same predictions for the forty speed dating events that were shown to subjects. Subjects were randomly assigned to either low or high stakes. Subjects assigned to low stakes were paid a flat rate of $1.50 for completing the experiment. Subjects assigned to high stakes also received a monetary bonus of $0.10 for each correct (final) prediction3 in addition to the flat rate of $1.50. Experimental Design We posted our experiment as a human intelligence task (HIT) on Amazon Mechanical Turk. The experiment was only open to workers in the U.S., and each worker could participate only once. In total, 1,994 subjects completed the experiment. Upon accepting the HIT, each subject was randomized into one of the ten treatments described above. Each HIT consisted of exactly the same forty prediction tasks, grouped into two sets A and B of twenty tasks each. As described above, subjects in all ten treatments saw exactly the same model prediction for each task. The experiment was divided into two phases. To minimize differences between the phases, subjects were randomly assigned to see either the tasks in set A during Phase 1 and the tasks in set B during Phase 2, or vice versa; the order of the tasks was randomized within each phase. We chose the tasks in sets A and B so that the observed accuracy on the first twenty tasks would be 80% regardless of the ordering of sets A and B. This experimental design mini- mizes differences between treatments and allows us to draw causal conclusions about the effect of stated accuracy on people’s trust without worrying about confounding factors. Each subject was asked to make initial and final predic- tions for each task, following the three-step procedure de- scribed above. The subjects were given no feedback on their own prediction or the model’s prediction for any individual task; however, after Phase 1, each subject was shown a feed- back screen with information about their own accuracy and the model’s accuracy (80% by design) on the tasks in Phase 1. A screenshot of the feedback screen is shown in Figure 2. At the end of the HIT, each subject completed an exit sur- vey in which they were asked to report their level of trust in the model during each phase using a scale of one (“I didn’t trust it at all”) to ten (“I fully trust it”). Specifically, we asked subjects the following question: “How much did you trust our machine learning algorithm’s predictions on the first [last] twenty speed dating participants (that is, before [after] 3The highest possible bonus was 40×$0.10 = $4—i.e., substantially more than the flat rate of $1.50, thereby making the bonus salient [11]. Figure 2: Screenshot of the feedback screen shown between Phase 1 and Phase 2 (i.e., after the first twenty tasks). you saw any feedback on your performance and the algo- rithm’s performance)?” We also collected basic demographic information (such as age and gender) about each subject. To quantify a subject’s trust in a model, we defined two metrics, calculated separately for each phase, that capture how often the subject “followed” the model’s predictions: • Agreement fraction: the number of tasks for which the subject’s final prediction agreed with the model’s predic- tion, divided by the total number of tasks. • Switch fraction: the number of tasks for which the sub- ject’s initial prediction disagreed with the model’s pre- diction and the subject’s final prediction agreed with the model’s prediction, divided by the total number of tasks for which the subject’s initial prediction disagreed with the model’s prediction. We used these two metrics when formally stating all of our pre-registered hypotheses, while additionally pre-registering our intent to analyze subjects’ self-reported trust levels. Analysis of Trust in Phase 1 (H1 and H3) We start by analyzing data from Phase 1 to see if subjects’ trust in a model is affected by the model’s stated accuracy and the amount at stake before they see the feedback screen. Figures 3a and 3b show subjects’ average agreement fraction and average switch fraction, respectively, in Phase 1, by treat- ment. Visually, stated accuracy appears to have a substantial effect on how often subjects follow the model’s predictions. Subjects’ final predictions agree with the model’s predictions more often when the model has a high stated accuracy. How- ever, the effect of the amount at stake is less apparent. To formally compare the treatments, we conduct a two-way ANOVA on subjects’ agreement fractions and, respectively, switch fractions in Phase 1. The results suggest a statistically significant main effect of stated accuracy on how often sub- jects follow the model’s predictions (effect size η2 = 0.036, p = 4.72 × 10−15 for agreement fraction, and η2 = 0.061, p = 5.62×10−26 for switch fraction) while the main effect of the amount at stake is insignificant (p = 0.30 and p = 0.11 for agreement fraction and switch fraction, respectively). We do not detect a significant interaction between the two CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK Paper 279 Page 5 0.73 0.77 0.81 0.85 0.89 None 60% 70% 90% 95% A gr ee m en t F ra ct io n Stated Accuracy Low stakes High stakes (a) Phase 1: Agreement frac. 0.2 0.3 0.4 0.5 0.6 None 60% 70% 90% 95% S w itc h Fr ac tio n Stated Accuracy Low stakes High stakes (b) Phase 1: Switch fraction 0.73 0.77 0.81 0.85 0.89 None 60% 70% 90% 95% A gr ee m en t F ra ct io n Stated Accuracy Low stakes High stakes (c) Phase 2: Agreement frac. 0.2 0.3 0.4 0.5 0.6 None 60% 70% 90% 95% S w itc h Fr ac tio n Stated Accuracy Low stakes High stakes (d) Phase 2: Switch fraction Figure 3: Comparing how often subjects in different experimental treatments follow an ML model’s predictions (average agree- ment fraction and average switch fraction) during each phase of our first experiment. Error bars represent standard errors. factors (p = 0.77 and p = 0.62 for agreement fraction and switch fraction, respectively). In other words, hypothesis H1 is supported by our experimental data, while H3 is not. An analysis of subjects’ self-reported levels of trust reveals a similar pattern. We detect a statistically significant main effect of stated accuracy on subjects’ self-reported levels of trust during Phase 1 (η2 = 0.049, p = 1.61×10−20), while the main effect of the amount at stake is insignificant (p = 0.92). We also conduct a post-hoc Tukey’s HSD test to identify pairs of treatments in which subjects exhibit distinct dif- ferences in how often they follow the model’s predictions. We find that treatments can be clustered into two groups— treatments with an accuracy level of none, 60%, or 70%, and treatments with an accuracy level of 90% or 95%—such that almost all statistically significant results are found for across- group treatment pairs.4 These results confirm our visual intuition from Figures 3a and 3b: when subjects have not yet observed the model’s accuracy in practice, they tend to follow the predictions of models …

64307

DISCUSSION (300 WORDS TOTAL)

discussion (300 words total) - Reading

CATEGORIES