Fluent Essays - You are asked to write an essay to discuss the topic of data mining. Please address each of the following:1.Based on the assigned readings(3 research papers)and textbook, what is data mining? Are the definitions consistent or contradictory?2.Describe the

YOU ARE ASKED TO WRITE AN ESSAY TO DISCUSS THE TOPIC OF DATA MINING. PLEASE ADDRESS EACH OF THE FOLLOWING:1.BASED ON THE ASSIGNED READINGS(3 RESEARCH PAPERS)AND TEXTBOOK, WHAT IS DATA MINING? ARE THE DEFINITIONS CONSISTENT OR CONTRADICTORY?2.DESCRIBE THE

You are asked to write an essay to discuss the topic of data mining. Please address each of the following:1.Based on the assigned readings(3 research papers)and textbook, what is data mining? Are the definitions consistent or contradictory?2.Describe the - Management

PLACE ORDER

Business and Management Topic: Data Mining Type of work: Term Paper Level: College Number of pages: 4 pages = 250/- Grade: High Quality (Normal Charge) Formatting style: APA Language Style: English (U.S.) Sources: 4 Website Region: United States Customer Time: You are asked to write an essay to discuss the topic of data mining. Please address each of the following:1.Based on the assigned readings(3 research papers)and textbook, what is data mining? Are the definitions consistent or contradictory?2.Describe the various applications of data mining based on the following perspectives: a. Biomedical Data Mining b. Educational Technology Classroom Research c. Public Internet Data Mining For each perspective, address the benefits, issues, and challenges. Guidelines1.APA style2.References: make sure you reference all research papers into your essay. You are also expected to reference the course textbook. When you do, indicate page numbers.3.All referenced material must be paraphrased, not quoted. If you need more information on how to paraphrase, refer to the “Avoiding Plagiarism” section of the course website.4.Your essay will have at least 1,000 words excluding references and title page.5.Your essay will be submitted to Turnitin.com. Since you are not allowed to quote, only paraphrase, your similarity report should be blue or green (see diagram below). Yellow, orange, or red outcome, depending on its severity, may result in a failing grade.6. You will submit one Word document. No other document types will be allowed.Rubric1. 40% Content: Did you answer the questions?2. 30% References: How well did you incorporate references in your essay?3. 30% Writing: Is your document free from spelling and grammatical errors? Here is the link to the text book https://www.sendspace.com/file/6qzqsi -- Quality is Not an Option Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery Graciela H. Gonzalez, Tasnia Tahsin, Britton C. Goodale, Anna C. Greene and Casey S. Greene Corresponding author. Casey S. Greene, Institute for Translational Medicine and Therapeutics, 10-131 Smilow Center for Translational Research, 3400 Civic Center Boulevard, Building 421, Philadelphia, PA 19104-5158, USA. Tel.: 215-573-2991; Fax: 215-573-9135; E-mail: [email protected] Abstract Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine. Key words: text mining; data mining; biomedical discovery; gene prioritization; pharmacogenomics; toxicology Introduction Technologies that resulted in the successful completion of the Human Genome project and those that have followed it afford an unprecedented breadth of data collection avenues (whole-genome expression data, chip-based comparative genomic hybridization and proteomics of signal transduction pathways, among many others) and have resulted in exceptional opportunities to advance the understanding of the genetic basis of human disease. However, high-throughput results are usually only the first step in a long discovery process, with subsequent and much more time- consuming experiments that, in the best of cases, culminate in the publication of results in journals and conference proceedings. Rather than stopping at the publication stage, the challenge for precision medicine is then to translate all of these research results into better treatments and improved health. To achieve this goal, a range of analytic methods and computational approaches have evolved from other domains and have been applied to an ever- growing set of specific problem areas. It would be impossible to enumerate the numerous biological questions targeted by computational approaches. We will focus here on an overview of text and data mining methods and their applications to discovery in a broad range of biomedical areas, including biological pathway extraction and reasoning, gene prioritization, precision medicine, pharmacogenomics and toxicology. The advances are plenty and the specific areas of application diverse, but the fundamental Graciela H. Gonzalez is an Associate Professor in the Department of Biomedical Informatics at Arizona State University, Scottsdale, Arizona, United States. Tasnia Tahsin is a PhD student in the Department of Biomedical Informatics at Arizona State University, Scottsdale, Arizona, United States. Britton C. Goodale is a postdoctoral fellow in the Department of Microbiology and Immunology at the Geisel School of Medicine at Dartmouth College, Hanover, New Hampshire, United States. Anna C. Greene is the Assistant Curriculum Director for the Graduate Program in Quantitative Biomedical Sciences at Dartmouth College, Hanover, New Hampshire, United States. Casey S. Greene is an Assistant Professor in the Department of Systems Pharmacology and Translational Therapeutics in the Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania, United States. Submitted: 17 February 2015; Received (in revised form): 26 August 2015 VC The Author 2015. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/ licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] 33 Briefings in Bioinformatics, 17(1), 2016, 33–42 doi: 10.1093/bib/bbv087 Advance Access Publication Date: 29 September 2015 Paper D ow nloaded from https://academ ic.oup.com /bib/article/17/1/33/2240575 by Libraries of the C larem ontC olleges user on 10 O ctober 2020 , , http://www.oxfordjournals.org/ motivation is to aid scientists in analyzing available data to sug- gest a road to discovery, to precise predictions that lead to better health. Background Data mining Data mining is the act of computationally extracting new infor- mation from large amounts of data [1], and the biological sci- ences are generating enormous quantities of data, ushering in the era of ‘big data’. Stephens et al. state that sequencing data alone constitutes �35 petabases/year and will grow to �1 zetta- base/year by 2025 [2]. This creates a large opportunity for the development and deployment of novel mining algorithms, and two recent reviews on data and text mining in the era of big data are found in Che et al. [3] and Herland et al. [4]. A wide variety of methods for extracting value from different types and models of data fall under the umbrella of ‘data mining’. Classification algorithms (decision trees, naı̈ve Bayesian classification and other classifiers), frequent pattern algorithms (association rule mining, sequential pattern mining and others), clustering algorithms (including methods to cluster continuous and categorical data) and graph and network algorithms have all evolved to present a diverse landscape for research and an arsenal to deploy against the toughest data challenges. Most researchers consider some other areas, including text mining, as being under the data mining umbrella. For example, Piatetsky-Shapiro state: ‘Data Mining in my opinion includes: text mining, image mining, web mining, predictive analytics, and much of the techniques we use for dealing with massive data sets, now known as Big Data’ [5]. The methods applied to text mining, however, are specialized to such a degree that it is common to view it as a separate area of specialty. Data mining courses do not usually include any text mining material, but ra- ther there are separate courses dedicated to it, and the same applies to textbooks. A complete coverage of data mining techniques is beyond the scope of this article though we have included some import- ant resources that cover this topic. Kernel Methods in Computational Biology by Schölkopf, Tsuda and Vert [6] covers methods specific to Computational Biology. Introduction to Data Mining [7] and Data Mining: Concepts and Techniques, 3rd edn [8] are two popular textbooks in data mining and give an excellent overview of the field. A more concise presentation can be found in the paper by Xindong Wu et al., Top 10 algorithms in data mining [9], which were identified in December 2006 as C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes and CART, covering clustering, classification and association analysis, which are among the most important topics in data mining research: • According to Jain et al. in ‘Data clustering: a review’, ‘Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters)’ [10]. • Classification is akin to clustering because it segments data into groups called classes, but unlike clustering, classification analyses require knowledge and specification of how classes are defined. • Statistical learning theory seeks ‘to provide a framework for studying the problem of inference that is, of gaining knowledge, making predictions, making decisions or constructing models from a set of data’ states Bousquet et al. [11]. A textbook on statis- tical learning expands on these notions [12]. • Association analysis facilitates the unmasking of hidden relationships in large data sets. The discovered associations are then expressed as rules or sets of items that frequently occur to- gether. Challenges to association analysis methods include that discovering such patterns can be computationally expensive given a large input data set and that there could potentially be many spurious associations ‘discovered’ that simply occur by chance. A well-known introduction to the topic is found in [13], and in particular, a seminal paper on mining association rules from clinical databases is found in Stilou et al. [14]. • Link analysis analyzes hyperlinks and the graph structure of the Web for the ranking of web search results. PageRank is perhaps the best-known algorithm for link analysis [15]. In a notable transition showing the power of new algorithms and data, data mining approaches are now being used to learn, not just the primary features but also context-specific features. For example, initial data mining approaches that constructed gene–gene networks built a single network [16]. In contrast, recent approaches learn multiple context-specific networks, allowing the construction of process-specific [17] and tissue-specific networks [18–20]. An individual is made up of a personalized combination of such context-specific networks, so we anticipate that continued advances in the context specificity of data mining approach will play an important role in the broad implementation of precision medicine. Text mining Text mining is a subfield of data mining that seeks to extract valuable new information from unstructured (or semi- structured) sources [21]. Text mining extracts information from within those documents and aggregates the extracted pieces over the entire collection of source documents to uncover or de- rive new information. This is the preferred view of the field that allows one to distinguish text mining from natural language pro- cessing (NLP) [22, 23]. Thus, given as input a set of documents, text mining methods seek to discover novel patterns, relation- ships and trends contained within the documents. Aiding the overall goal of discovering new information are NLP programs that go from the relatively simple text processing tasks at the lexical or grammatical levels (such as a tokenizing or a part- of-speech tagger), to relatively complex information extraction algorithms [like named entity recognition (NER) to find concepts such as genes or diseases, normalization to map them to their unique identifiers or relationship extraction and sentiment ana- lysis systems, among others]. The greater the complexity of the task, the more likely it is to integrate methods from data mining (such as classification or statistical learning). Although there is no current textbook that can be considered the definite guide on text mining as defined above, there are a couple of classic textbooks that cover fundamental NLP techniques and at least the first covers some of the analytics required to discover information: Speech and Language Processing by Jurafsky and Martin [24] and Foundations of Statistical Natural Language Processing by Manning and Schuetze [25]. The biomed- ical domain is one of the most interesting application areas for text mining, given both the potential impact of the information that can be discovered and the specific characteristics and volume of information available. The textbook Text mining for biology and medicine [26] offers an overview of the fundamental approaches to biomedical NLP, emphasizing different sub-areas in each chapter, although overall it does not totally adhere to the definition of text mining as a means for discovery given by 34 | Gonzalez et al. D ow nloaded from https://academ ic.oup.com /bib/article/17/1/33/2240575 by Libraries of the C larem ontC olleges user on 10 O ctober 2020 in order `` '' `` '' , , , `` '' paper , , very , , `` '' `` '' `` '' `` '' - , natural language processing ( , ) natural language processing Hearst [23]. A good non-textbook review of the different subareas is the article ‘Frontiers of biomedical text mining: current progress’ [27]. For those just starting in the area, the article ‘Getting Started in Text Mining’ [28] is a good starting point. A more in-depth treatment of automated techniques applied to the biomedical literature and its contribution to innovative biomedical research can be found in ‘Text-mining solutions for biomedical research: enabling integrative biology’ [29]. Text mining sub-areas, briefly summarized, include: • Information Retrieval deals with the problem of finding relevant documents in response to a specific information need (query). An overview of tools for information retrieval from the biomed- ical literature can be found in [30]. • NER is at the core of the automatic extraction of information from text and deals with the problem of finding references to entities (mentions) such as genes, drugs and diseases present in natural language text and tagging them with their location and type. NER is also referred to as ‘entity tagging’ or ‘concept extrac- tion’. This is a basic building block for almost all other extraction tasks. NER in the biomedical domain is generally considered to be more difficult than other domains, such as geography or news reports. This is owing to inconsistency in how known entities, such as symptoms or drugs, are named (e.g. nonstandard abbre- viations and new ways of referring to them). An open-source NER engine, BANNER [31], with models to recognize genes and diseases mentioned in biomedical text, is currently available for gene and disease NER, and LINNAEUS is available for species [32]. Rebholz-Schuhmann et al. [33] present an overview of the NER solutions for the second CALBC task, including protein, disease, chemical (drug) and species entities. Campos et al. [34] discuss a recent survey of tools for biomedical NER. A system assigning text to a wide range of semantic classes using linguis- tic rules is presented in [35], illustrating a slightly different than standard NER because classes potentially overlap. Verspoor et al. [36] use the CRAFT corpus to improve the evaluation of gene NER (and some lower-level tasks like part-of-speech and sentence segmentation). Recent work in [37] presents an NER system for extracting gene and protein sequence variants from the biomedical literature. For locating chemical compounds, Krallinger et al. [38] summarize the task that was part of BioCreative IV and give a short overview of some of the techniques used. • Named Entity Identification allows the linkage of objects of interest, such as genes, to information that is not detailed in a publication (such as their Entrez Gene identifier) [39]. Two open- source systems using largely dictionary-based approaches to normalize gene names appear in [39–41]. For normalizing disease names, [42] introduces DNorm, a new normalization framework using machine learning, with strong results. • Association extraction is one of the higher-level tasks still considered purely an information extraction application. It uses the output from the prior subtasks to produce a list of (binary or higher) associations among the different entities of interest. Catalysts for advances in this area have been the Biocreative and BioNLP shared tasks, with excellent teams from around the world putting their systems to the test against carefully annotated data sets. A survey of submissions to Biocreative III [43] and BioNLP [44, 45] shows a good overview of approaches re- sponsive to the respective shared tasks. Putting together associ- ations into networks of molecular interactions that can explain complex biological processes is the next logical step, and one that still is considered the ‘holy grail’ of automatic biomolecular extraction. Ananiadou et al. [46] and Li et al. [47] discuss comprehensive surveys of methods for the extraction of network information from the scientific literature and the evaluation of extraction methods against reference corpora. Semantic-based approaches such as [48] will make their mark in the coming years. • Event extraction is similar to association extraction but instead of separately extracting various relations between different entities in text, this task focuses on identifying specific events and the various players involved in it (arguments). For instance, the arguments of a transport event will include the molecule being transported, the cell to which it is being transported and the cell from which it is being transported. Event extraction was a key component of the BioNLP Shared Tasks in both 2011 [45] and 2013 [49], challenging the biomedical community to expand and cultivate their approaches in this area and leading to stead- ily improving results. • Pathway extraction is a budding branch of biomedical text mining closely following the footsteps of event extraction. It involves the automated construction of biological pathways through the extraction and ordering of pathway-related events from text. Although, like [50] and [51], the majority of researchers in this domain have been focusing their efforts on supporting pathway curation through event extraction, rather than entirely automating the process. Tari et al. was able to achieve promising results for the automated synthesis of pharmacokinetic pathways by applying an automated reasoning-based approach for event ordering [52]. The first shared task on Pathway Curation was organized by BioNLP in 2013 [49] to establish the current state-of-the-art performance level for extract- ing pathway-relevant events such as phosphorylation and transport. In the end, a set of the different subtask solutions are used in a pipeline that allows information to be integrated and analyzed toward knowledge discovery. However, this multiplies the effects of errors down the pipeline, leaving systems highly vulnerable. An overarching challenge for biomedical text mining is to incorporate the many knowledge resources that are available to us into the NLP pipeline. In the biomedical domain, unlike the general text mining domain, we have access to large numbers of extensive, well-curated ontologies and knowledge bases. Biomedical ontologies provide an explicit characterization of a given domain of interest. The quality of data mining efforts would likely increase if existing ontologies (e.g. UMLS [53] and BioPortal [54]) were used as sources of terms in building lexicons, for figuring out what concept subsumes another, and as a way of normalizing alternative names to one identifier. For example, using ontologies as described enabled the use of unstructured clinical notes for generating practice-based evidence on the safety of a highly effective, generic drug for peripheral vascular disease [55]. Today, the data being generated is massive, complex and increasingly diverse owing to recent technological innovations. However, the impact of this data revolution on our lives is hampered by the limited amount of data that has been analyzed. This necessitates data mining tools and methods that can match the scale of the data and support timely decision- making through integration of multiple heterogeneous data sources. Finally, another area in which the field has fallen short is that of making text mining applications that are easily adaptable by end users. Many researchers have developed systems that can be adapted by other text mining specialists, Rcent advances and emerging applications | 35 D ow nloaded from https://academ ic.oup.com /bib/article/17/1/33/2240575 by Libraries of the C larem ontC olleges user on 10 O ctober 2020 - ",0,0,2 ",0,0,2 `` '' ",0,0,2 ",0,0,2 (IR) Named Entity Recognition ( ) , `` '' `` '' Named entity recognition due - , es s s which s employ employing - `` '' - natural language processing , due but applications that can be tuned by bench scientists are mostly lacking. Application areas Pathway extraction and reasoning Analyzing the intricate network of biological pathways is an essential precursor to understanding the molecular mechan- isms of complex diseases affecting humans. Without acquiring a deeper insight into the underlying mechanisms behind such diseases, we cannot advance in our efforts to design effective solutions for preventing and treating them. However, given the vast amount of data currently available on biological pathways in biomedical publications and databases and the highly inter- connected nature of these pathways, any attempt to manually reason over them will invariably prove to be largely ineffective and inefficient. As a result, there is a growing need for computa- tional approaches to address this demanding task through automated pathway analysis. Pathway analysis can be either quantitative or qualitative and is a key focus of the growing field of Systems Biology. Quantitative pathway analysis uses dy- namic mathematical models for simulating pathways and can be especially useful in drug discovery and the development of patient-specific dosage guidelines [56]. Some examples of tech- niques used in this form of analysis include ordinary differen- tial equations [57], Petri Nets [58], and p-calculus [59]. Qualitative pathway analysis uses static, structural representa- tions of pathways to answer qualitative questions about them; for instance it may be used to explain why a certain phenom- enon occurs in the pathway based on existing pathway know- ledge. Artificial intelligence paradigms, such as symbolic (i.e. explicit representations) or connectionist (i.e. massively paral- lelized) approaches, can greatly inform this type of pathway analysis [60]. Although some of the techniques principally ad- dressing quantitative pathway analysis, such as Petri Nets and p-calculus, may also be used to perform qualitative pathway analysis, they typically tend to provide limited functionality [61]. Therefore, richer languages such as Maude [62], BioCham [63] and action languages [52, 64, 65] are more popular in this domain. In recent years, hybrid approaches have been applied for qualitative pathway reasoning. For instance, [66] presents a qualitative pathway reasoning system that uses Petri net se- mantics as the pathway specification language and action lan- guages as the query language. Pathway reasoning, as a technique, relies on either humans defining the pathway information needed or the development of new algorithms to extract, represent and reason over biological pathways, which is an area of growing interest. Gene prioritization and gene function prediction Complex diseases present diverse symptoms because they are caused by multiple genes and environmental factors that differ for each individual and can diverge at different stages of the disease process. This complexity is reflective of epistatic effects where causative genes have an impact on the expression of many other genes. Because variant expression levels vary across the genome, it is difficult to determine true causative genes or distinguish key sets affected by the disease from high- throughput experiments. For example, the Affimetrix U133 Plus 2.0 microarray chip from the Repository of Molecular Brain Neoplasia Data shows >7500 2-fold differentially expressed genes in brain cancer tissue when compared with normal brain tissue [67]. The validation of a single causative gene is a long and expensive process [68], often taking up to a year and even longer, which necessitates using gene prioritization to pare down the list of potential gene targets to a manageable size. Gene prioritization methods that suggest the most significant prospects for further validation are critically needed, and method development in this area would greatly facilitate discovery. Many gene prioritization algorithms have been developed to address this problem, such as GeneWanderer [69], GeneSeeker [70], GeneProspector [71], SUSPECTS [72], G2D [73] and Endeavour [74], among others [75, 76]. A comparative review of these methods can be found in Tranchevent et al. [77]. The gen- eral premise of these methods is to rank genes based on the similarity between a set of candidate genes compared with genes already known to be associated with the disease (usually called the training set). Similarity is established based on different parameters (depending on the specific method) and may include purely biological measures (such as cytogenetic location, expression patterns, patterns of pathogenic mutations or DNA sequence similarity), biological measures plus annotation of the genes using different protein databases (for example, UniProt [78] and InterPro [79]), or other vocabularies and ontologies (such as the Gene Ontology [80, 81], eVOC [82], MeSH [83] and term vectors from the literature). In these methods, the closer a gene in the candidate list coincides with the profile of the training genes, the higher it is ranked. Gene prioritization includes the areas of gene function prediction. The Critical Assessment of protein Function Annotation experiment was the first large community-wide evaluation of 54 methods that were compared on a core set of annotations using evaluation metrics to ascertain the top meth- ods [84]. Earlier computational methods for prioritization were compared through a large-scale biological assay of yeast mitochondrial phenotypes and found to be effective [85, 86]. A related but distinct gene prioritization problem is the identifica- tion of genes with tissue-specific expression patterns [87]. Existing webservers such as GeneMANIA [88, 89] and IMP [90] allow biologists to perform gene prioritization by network connectivity, and servers such as PILGRM allow for prioritiza- tion directly by gene expression [91]. Predicted functions, in addition to curated functions, have also shown promise for in- terpreting the results of genome-wide association studies, which aim to pair genetic variants with associated genes and pathways [92]. Precision medicine and drug repositioning Precision medicine is determining prevention and treatment strategies based on an individual’s predisposition in an effort to provide more targeted and therefore effective treatments [93]. This area is poised for intense growth based on the ease of obtaining patient data and the development of computational methods with which to analyze this personalized data. While precision medicine is a nascent field, there have been many ad- vances in the personalized treatment of cancer. Some hospitals are already using genetic data to direct treatment options for cancer patients (e.g. BRCA1 and BRCA2 [94], BRAF [95] testing), though drugs targeted to specific mutations lag behind and is an area where computational drug repositioning will potentially have a strong impact [96]. On the clinical side of translational research, the demand for timely and accurate knowledge has the urgency of life itself. Emily Whitehead was the first child with acute lymphoblastic 36 | Gonzalez et al. D ow nloaded from https://academ ic.oup.com /bib/article/17/1/33/2240575 by Libraries of the C larem ontC olleges user on 10 O ctober 2020 , , (REMBRANDT) more than two to , to , , (CAFA) leukemia to be treated and cured with an experimental T cell therapy called CAR T cell therapy at the Children’s Hospital of Philadelphia [97]. The therapy enables the patient’s T cells to recognize and attack malignant B cells, but this treatment can also trigger an intense immune reaction, which Emily experi- enced. She suffered from a high level of the interleukin 6 pro- tein, and her doctors suggested trying tocilizumab (Actemra), a rheumatoid arthritis drug, to combat the extraneous protein production [97, 98]. This drug returned Emily’s vital signs back to normal. In this case, rather than relying on the serendipity of a team member knowing about the right drug, specialized text mining could have been used to mine the literature for the relevant drugs. In such a scenario, either the literature would be mined in advance, stored in a database that extracts relation- ships between drugs and genes or proteins or it could be searched in real time. As an example of this, Essack et al. created a sickle cell disease knowledgebase by mining 419 612 PubMed abstracts related to red blood cells, anemia or this disease [99]. Some databases (such as PharmGKB) store such relationships, but are not the result of automatic extraction. Manual curation is still the current standard for such databases, with the value of text mining applications yet to be fully realized. Currently, despite notable advances in entity mention extraction and normalization, the use of text mining is mostly limited to aiding curators to speed up the process. Data and text mining methods are useful for biomedical predictions and can be successfully extended to biomedical discoveries as well. Sirota et al. used publicly available gene expression data for both drugs and diseases to ascertain if Food and Drug Administration-approved drugs could be repositioned for use in new diseases [100]. They discovered and experimen- tally validated the use of cimetidine, generally used for heartburn and peptic ulcers, as a treatment … Computers & Education 113 (2017) 226e242 Contents lists available at ScienceDirect Computers & Education journal homepage: www.elsevier.com/locate/compedu Data mining in educational technology classroom research: Can it make a contribution? Charoula Angeli a, *, Sarah K. Howard b, Jun Ma b, Jie Yang b, Paul A. Kirschner c, d a University of Cyprus, Cyprus b University of Wollongong, Australia c Open University of the Netherlands, The Netherlands d University of Oulu, Finland a r t i c l e i n f o Article history: Received 18 June 2016 Received in revised form 15 March 2017 Accepted 29 May 2017 Available online 30 May 2017 Keywords: Educational data mining Educational technology research Association rules mining Fuzzy representations * Corresponding author. 11-13 Dramas street, P.O E-mail address: [email protected] (C. Angeli). http://dx.doi.org/10.1016/j.compedu.2017.05.021 0360-1315/© 2017 Elsevier Ltd. All rights reserved. a b s t r a c t The paper addresses and explains some of the key questions about the use of data mining in educational technology classroom research. Two examples of use of data mining tech- niques, namely, association rules mining and fuzzy representations are presented, from a study conducted in Europe and another in Australia. Both of these studies examine student learning, behaviors, and experiences within computer-supported classroom activities. In the first study, the technique of association rules mining was used to understand better how learners with different cognitive types interacted with a simulation to solve a prob- lem. Association rules mining was found to be a useful method for obtaining reliable data about learners' use of the simulation and their performance with it. The study illustrates how data mining can be used to advance educational software evaluation practices in the field of educational technology. In the second study, the technique of fuzzy representations was employed to inductively explore questionnaire data. The study provides a good example of how educational technologists can use data mining for guiding and monitoring school-based technology integration efforts. Based on the outcomes, the implications of the study are discussed in terms of the need to develop educational data mining tools that can display results, information, explanations, comments, and recommendations in meaningful ways to non-expert users in data mining. Lastly, issues related to data privacy are addressed. © 2017 Elsevier Ltd. All rights reserved. 1. Introduction Data mining has long been used in marketing, advertising, health, engineering, and information systems. At its core, data mining is an inductive, analytic, and exploratory approach, which is concerned with knowledge discovery through identi- fication of patterns within large sets of data. In the last 10 years, the field of Educational Data Mining (EDM) has emerged as a distinct area of research concerned with using data mining techniques to answer educational questions, such as, “What are the difficulties students encounter during a learning activity?”, “What sequences of computer interactions lead to successful . Box 20537, Department of Education, University of Cyprus, CY-1678, Nicosia, Cyprus. mailto:[email protected] http://crossmark.crossref.org/dialog/?doi=10.1016/j.compedu.2017.05.021&domain=pdf www.sciencedirect.com/science/journal/03601315 www.elsevier.com/locate/compedu http://dx.doi.org/10.1016/j.compedu.2017.05.021 http://dx.doi.org/10.1016/j.compedu.2017.05.021 http://dx.doi.org/10.1016/j.compedu.2017.05.021 C. Angeli et al. / Computers & Education 113 (2017) 226e242 227 problem-solving performance?”, and “What sequences of actions characterize high performers and low performers in problem-solving activity?” EDM can also provide new insights into “wicked” educational problems, such as, “What are the differences in the ways students experience learning,” and “How can learning designs account for variations in students' learning experiences?” In particular, EDM is concerned with developing methods for analyzing data from an educational system in order to detect patterns in large datasets that would otherwise be very difficult or even impossible to analyze due to the vast volume of data within which they exist (Romero & Ventura, 2013). Consequently, results from data mining can be used for deciding about how to improve the teaching and learning process as well as how to design or redesign a learning environment (Ingram,1999; Romero & Ventura, 2007). Data mining techniques have been mostly used within the context of web-based or e-learning education in order to: (a) suggest activities, resources, learning paths, and tasks for improving learners' performance and adapting learning experience (Tang & McCalla, 2005); (b) provide feedback to teachers and instructional designers in regards to learners' difficulties with the content and structure of a course, so that revisions can be made to facilitate students' learning (Merceron & Yacef, 2010; Zaiane & Luo, 2001); (c) predict learners' performance (Ahmed & Elaraby, 2014); and (d) inform administrators about the effectiveness of instructional programs, so that better planning and allocation of human and ma- terial resources can be achieved (Romero & Ventura, 2007). Based on a number of reviews and meta-analyses published (Baker & Yacef, 2009; Mohamad & Tasir, 2013; Romero & Ventura, 2007, 2010), the most popular data mining techniques include: (a) clustering (Amershi & Conati, 2009; Beal, Qu, & Lee, 2006; He, 2013; Perera, Kay, Koprinska, Yacef, & Zaiane, 2009); (b) regression (Buja & Lee, 2001); (c) association rules mining (Lin, Alvarez, & Ruiz, 2002); and (d) sequential pattern mining (Perera et al., 2009). In clustering, the goal is to split the data into clusters, such that, there is homogeneity within clusters and heterogeneity between clusters (Baker & Siemens, 2014). In educational research, clustering procedures have been used to find patterns of effective problem- solving strategies in exploratory computer-based learning environments (Amershi & Conati, 2009; Angeli & Valanides, 2004; Beal et al., 2006; He, 2013). In regression, the goal is to develop a model that can infer or predict something about a dataset. In a regression analysis, a variable is identified as the predicted variable and a set of other variables as the predictors (similar to dependent and independent variables in traditional statistical analyses) (Baker & Siemens, 2014). In association rules mining, the goal is to extract rules of the form if-then, such that if some set of variable values is found, another variable will generally have a specific value (Baker & Siemens, 2014). In sequential pattern mining, the aim is to find temporal as- sociations between events to determine what path of student behaviors leads to a successful group project (Perera et al., 2009). Currently, most work on data mining has at its base a computer science perspective rather than an educational perspective. Within the educational domain, data mining techniques have been mostly used in e-learning or web-based research, because of the ease of accessing student log data and performing automatic analyses of data. There is, however, also a need to investigate the uses of EDM in real classrooms in order to understand better students' interactions with technology as well as the complexities entailed in investigating how students with diverse needs and cognitive characteristics perform with technology in these settings. The issue then becomes whether EDM can make a contribution to educational technology classroom research in terms of providing tools and techniques that educational technology researchers can easily grasp and apply to their own research in order to answer questions that cannot be easily answered by traditional statistical techniques. In view of that, in this paper, the authors, within the context of two different studies, describe their efforts in using data mining procedures in educational technology classroom research, and, identify difficulties in applying data mining tech- niques and tools in this research context. The first study was carried out in a European country and sought to investigate how field-dependent and field-independent learners solved a problem using a stand-alone simulation tool. For the purposes of the first study, the authors used a sequence, association, and link analysis for capturing and analyzing learners' interactions with the simulation. The analysis provided a detailed and analytic description of the differences in field-dependent and field- independent learners' problem-solving processes, providing at the same time clear understanding of field-dependent learners' difficulties to take full advantage of the affordances of the simulation in order to maximize learning benefits. The study contributes to educational technology research by presenting evidence about the effectiveness of EDM as an approach for extracting useful process-related knowledge and actual student learning data that can be used for improving the learning design of educational software and systems (Abdous, He, & Yen, 2012; Romero & Ventura, 2013). In turn, EDM can replace traditional approaches to software evaluation, which mostly depend on surveys of students' perceptions of the system (Bayram & Nous, 2004), by providing detailed data about what software features are or are not so successful with learners that instructional designers can use in order to decide how to go about improving their learning designs. Consequently, data mining techniques can become extremely useful in terms of providing ideas for implementing personalized learning to meet students' individual needs (Chen, 2008; Lin, Yeh, Hung, & Chang, 2013). Some preliminary work in this area has been reported by Hsu (2008) who applied association rules algorithms in the development of a personalized English Learning Recom- mendation System, as well as by Chen and Duh (2008) who used a fuzzy technique to determine the difficulty parameters of courseware and decide thereafter the content of courseware for personalized recommendation services. The second study addresses the use of educational technology in Australian secondary schools. The research considers variations in student experiences in an integrated learning environment and how this may relate to learning. The aim of the study was to understand better the range of students' experiences with technology and accordingly to inform teachers' in- tegrated learning designs. Due to the complexity of the learning environment and the large number of key factors affecting C. Angeli et al. / Computers & Education 113 (2017) 226e242228 students' experiences in the classroom, association rules mining and fuzzy representations were used to explore relations among students' questionnaire responses and national assessment outcomes. The results showed significantly different patterns of key technology integration factors related to literacy and numeracy outcomes. The findings provide guidance for learning design in relation to how teachers may provide different experiences in technology-integrated learning to support all learners. The study contributes to educational technology research by providing evidence of EDM as a useful approach for (a) understanding school-based technology-related change initiatives, (b) determining where to focus classroom resources and informing choices of technology tools, and (c) developing a deeper understanding of student technology-related experiences (Abdous et al., 2012). In the general discussion section of the paper, the authors discuss the contribution of data mining in educational tech- nology classroom research, within the context of the two studies, while at the same time they also consider obstacles related to the intrinsic difficulty associated with learning how to use data mining tools and apply EDM techniques to educational data. Research directions aiming at making data mining tools and techniques more accessible to educational researchers are discussed. Lastly, data-privacy issues are also addressed. 2. Study 1 2.1. Theoretical framework and research questions In the first study, the authors used a data mining technique called sequence, association, and link analysis to understand and best describe how the cognitive style of field dependence-independence (FD-I) affected undergraduate students' ability to solve a problem using a glass-box simulation (Clariana & Strobel, 2008; Landriscina, 2013). According to Landriscina (2013), simulations are distinguished into black-box or model-opaque simulations, and, glass-box or model-transparent simulations. In black-box or model-opaque simulations, learners explore a system's behavior, but the underlying conceptual and computational model of the simulation remains hidden. Thus, learners can only observe the results of the causal relationships between the variables (Landriscina, 2013). Glass-box or model-transparent simulations, on the other hand, make the structure of the model underlying the simulation visible to the learners in the form of a diagram with nodes and connecting links between them (Landriscina, 2013). FD-I is a cognitive style directly related to how humans perceive, organize, and process information (Morgan, 1997; Price, 2004; Witkin, Moore, Goodenough, & Cox, 1977). It is distinguished from learning styles, in that learning styles are subjective accounts of individuals' instructional preferences across specific domains and tasks (Messick, 1987). FD-I was defined by Witkin et al. (1977) as “the extent to which a person perceives part of a field as discreet from the surrounding field as a whole, rather than embedded in the field; or the extent to which the organization of the prevailing field determines perception of its components; or, to put it in everyday terminology, the extent to which the person perceives analytically” (pp. 6e7). Witkin et al. (1977) conceptualized FD-I as a construct with two discrete modes of perception, such that, at the one extreme end perception is dominated by the prevailing field and is designated as field dependent (FD), and at the other extreme end, perception is more or less separate from the surrounding field and is designated as field independent (FI). Contemporary research studies have examined the effects of learning with glass-box (model-transparent) simulations on FI and FD learners' performance, and, found that FI learners outperformed FD learners during problem solving with this type of simulation (Angeli, Valanides, & Kirschner, 2009; Burnett, 2010; Dragon, 2009). However, these investigations have pri- marily focused on identifying quantitative differences in performance between FD and FI learners without providing detailed information about FD and FI learners' interactions with the simulation, as well as related difficulties that learners encountered during the problem-solving process with the simulation. While quantitative investigations are in general useful, they do not provide enough insight about how to help those learners, such as for example FD learners, who usually encounter problems during problem solving and need to be supported by the teacher so they can also have successful learning experiences with technology. Therefore, given the limitations of the existing body of research on FD and FI learners' problem solving with simulations, the present study applied sequence, association, and link analyses to assess and compare FD and FI learners' interactions with a glass-box simulation in order to solve a problem about immigration policy. The research purpose of the study was to identify sequences of interactions with the simulation that were associated with successful performance and whether they differed between FD and FI learners. Analytically, the research questions were stated as follows: 1. What sequences of interactions with the simulation lead to successful problem-solving performance? 2. How do the sequences of interactions with the simulation differ between FD and FI learners? 3. What are the learning difficulties that FD learners encounter during the problem-solving process with the simulation? Evidently, traditional statistical techniques cannot provide the means for answering these questions, and, thus, the issue becomes whether data mining, and in particular the sequence, association, and link analysis that was employed here, can answer these questions in informative and useful ways for the educational technology researchers. C. Angeli et al. / Computers & Education 113 (2017) 226e242 229 2.2. Method 2.2.1. Participants One hundred and fifteen freshmen from a teacher education department were recruited to participate in the study. Students were initially screened based on their scores on the Hidden Figures Test (HFT; French, Ekstrom, & Price, 1963). The HFT was used for identifying students' FD-I. The highest possible score on the HFT is 32 and the lowest zero. In accordance with other research studies (Angeli & Valanides, 2004; Chen & Macredie, 2002; Daniels & Moore, 2000; Khine,1996), the cut- off points for this study were set to two levels of FD-I, namely FD and FI. Students who scored 18 or lower on the HFT were classified as FD learners, while students who scored 19 or higher were classified as FI. Of the 115 students, 45 of them were found to be FI learners, and the remaining 70 FD. Of the 115 participants, 94 (82%) were females, and 21 (18%) males. The average age of the participants was 17.86 years (SD ¼ 0.45). All students had basic computing skills, but no prior experience with problem solving with simulations. 2.2.2. The simulation task All research participants were asked to interact with a glass-box simulation that was specifically developed for the pur- poses of this study, in order to solve a problem about immigration policy. The researchers explained to the participants that nowadays a lot of people move from one country to another in search of a better life for their children and themselves. Students were given a scenario about people from country A who wanted to move to country B due to a high unemployment rate in country A. The students had to interact with the simulation in order to test hypotheses, and, decide about whether and under what conditions country B could accept immigrants from country A. The underlying model of the glass-box simulation is depicted in Fig. 1. The model shows how an increase in the number of births in country A will cause an increase in the population of country A. This, in turn, and provided that not enough employment opportunities are created in the interim to cover the new demands for employment in country A, will eventually lead to an increase in the unemployment rate of country A. In contrast, an increase in the number of deaths in country A will eventually cause a decrease in the unemployment rate of country A. In the case of an increase in the unemployment rate of country A, people from country A will eventually seek employment in another country - country B. A movement of people from country A to country B will eventually cause an increase in the unemployment rate of country B, if country B does not create in the meantime enough employment opportunities to cover the increased demand for employment. The model shows how an increase in the number of businesses in country B will cause a decrease in country's B unemployment rate, while a Fig. 1. The underlying model about immigration policy of the glass-box simulation. C. Angeli et al. / Computers & Education 113 (2017) 226e242230 movement of businesses from country B to A will cause a decrease in country's A unemployment rate, but in the long run a possible increase in country's B unemployment rate. In total, the tool simulated the phenomenon of immigration using five independent variables, namely number of births in country A, number of births in country B, number of deaths in country A, number of deaths in country B, and movement of businesses from country B to country A. The students had to change the values of the independent variables one at a time to observe the effects on the dependent variables in order to decide, and, propose in writing if and under what conditions country B could possibly accept immigrants from country A. When the learners run the model, the simulation opens a meter for each dependent and independent variable. As shown in Fig. 2, each meter displays the initial value of each variable and the range of values it can take. At each run time, the learner can change the value of one independent variable at a time and observe how the meters of the affected dependent variables change. 2.2.3. Research instruments 2.2.3.1. Hidden figures test. The Hidden Figures Test (HFT) was administered to determine research participants' field type (French et al., 1963). The test consists of two parts, and each part contains 16 questions. The time allotted for answering each part is 12 min. The scores on the HFT range from zero to 32. Basically, each question on the HFT presents five simple geometric figures and a more complex one. Students are instructed to discover which one of the five simpler figures is embedded in the more complex one. According to Rittschof (2010), the HFT is the most reliable and widely used test for measuring FD-I. It is also highly correlated with the Group Embedded Figures Test (r ¼ 0.67 - 0.88), another popular test for determining FD-I (Witkin, Oltman, Raskin, & Karp, 1971). 2.2.3.2. Assessment rubric. A rubric that was inductively constructed was used to assess the quality of learners' written an- swers to the immigration problem. The scoring rubric assessed three levels of quality ranging from 1 (poor quality) to 3 (high quality). The specific criteria for each level are shown in Table 1. Two independent raters evaluated students' answers to the immigration problem, and Cohen's kappa was used to measure interrater reliability. A satisfactory interrater reliability of k ¼ 0.87 was computed, while noted discrepancies between the two raters were resolved after discussion. Fig. 2. Simulation run. Table 1 Rubric for assessing the quality of learners' answers. 3 - High a. The learner's answer is based on a correct interpretation of the simulated outcomes. b. The learner's answer takes into consideration pros and cons of different possible answers. c. The learner's answer takes into consideration possible long-term effects. 2 - Medium a. The learner's answer is based on a correct interpretation of the simulated outcomes. b. The learner's answer takes into consideration pros and cons of different possible answers. c. The learner's answer does not take into consideration possible long-term effects. 1 - Poor a. The learner's answer is not based on a correct interpretation of the simulated outcomes. b. The learner's answer does not take into consideration pros and cons of different possible answers. c. The learner's answer does not take into consideration possible long-term effects. C. Angeli et al. / Computers & Education 113 (2017) 226e242 231 2.2.4. Research procedures Research data were collected in three different sessions. During the first 25-min research session, the researchers administered the HFT in order to determine learners' field type. In a follow-up 60-min session, the researchers demonstrated a glass-box simulation, different than the one that was used for collecting research data for this study, and, showed how to use it in order to solve a problem. The students interacted with the simulation individually in order to explore various problem- solving scenarios and learn how to control variables. The researchers explicitly explained the differences between dependent and independent variables, and, demonstrated how changes in the independent variables affected the dependent variables. During the last 60-min session, the researchers collected the data that were used for the analyses of this study. During the session, the participants interacted with the glass-box simulation, observed, organized, and interpreted the simulated out- comes of the system for the purpose of solving the problem about immigration policy. 2.2.5. Data structure and analysis Students' interactions with the simulation were captured into video files with River Past Screen Recorder, a screen capturing software. Each video file had an average duration of 50 min and a size of about 4GB. A scheme was used for coding learners' interactions in a log file, which took the form of a table with three columns including Student_ID, Time, and Action. Student_ID referred to students' research ID number, Time denoted the start/end time of an event, and Action described what the interaction entailed in terms of a sequence of computer actions. The total number of entries in this table/log file, which constituted the data for the data mining analysis, was 4570 entries. Regarding the Action field in the data table, the simulation afforded five computer actions that the students could employ in order to explore the relationships between all dependent and independent variables, as depicted in Fig. 1, in order to decide if and under what conditions country B could accept immigrants from country A. The first action was about displaying all variables and the relationships amongst them, as represented in the model shown in Fig. 1. The second was about using the test tools in order to run the simulation. The third was about opening the meter of each variable to change the values of the independent variables while observing at the same time the effects on the dependent variables. The fourth was about using the play button for running the simulation, and, lastly, the stop button for stopping the simulation. Thus, the following computer interactions were coded: B for viewing all variables and the relationships between them; T for accessing the test tools needed for a simulation test; M for opening the meter of each variable; P for running/playing the simulation; and S for terminating/stopping the simulation. Additionally, the codes IV1, IV2, IV3, IV4, and IV5 were used for denoting the five independent variables. A sequence, association, and link analysis (Nisbet, Elder, & Miner, 2009) was used in order to identify unique differences between the FD and FI learners. Specifically, the sequence, association, and link analysis was used for extracting association rules in order to determine which simulation actions were closely associated together. The technique was also used for extracting an immediate subsequent action given a previous one, and for mining patterns of interaction between individuals of different field types and computer actions. In association rules mining, relationships and patterns are expressed in the form of an association rule: If A then (likely) C Each rule includes an antecedent (A) and a consequent (C). This can be understood as “IF A then C.” Rules may contain single or multiple antecedents and consequents, such as “IF A and B, then C.” The importance of a rule is determined through critical measurements: support, confidence, and lift (Tan, Kumar, & Srivastava, 2004). The extent to which the antecedent(s) and consequent(s) occur simultaneously in the dataset is indicated through support. The extent to which the consequent(s) oc- cur(s) given the antecedent(s) is indicated through confidence. The correlation between the antecedent(s) and consequent(s) is indicated through lift. For the two sequence, association, and link analyses that were performed, the minimum support was set to 0.55 and the confidence level to 0.95. The authors employed Statistica Data Miner for conducting the sequence, association, and link analyses. While we experimented with a number of other data mining tools, we ended up using Statistica, because compared to other tools we found it easier to use in preparing the data for mining, as well as easier to integrate with the R programming environment. C. Angeli et al. / Computers & Education 113 (2017) 226e242232 Statistica Sequence, Association, and Link Analysis is an implementation of several advanced techniques designed for mining rules from datasets that are generally described as “market-baskets”. The “market-basket” metaphor assumes that customers buy products either in a single transaction or in a sequence of transactions. A transaction relates with a subsequent purchase of a product or products given a previous buy. For example, a purchase of flashlights usually coincides with a purchase of batteries in the same basket. In education, the “market-basket” metaphor can be applied to situations where individuals engage in different actions during learning with others or with a computer system. The analysis reveals items in a dataset that occur together extracting patterns and associations between individuals and actions. 2.3. Results and discussion The quality of FD learners' answers to the immigration problem was found to be 1.43 (SD ¼ 0.63), while the quality of FI learners' answers was found to be 2.10 (SD ¼ 0.75). The time that FD and FI learners spent with the simulation was also measured and no significant differences were found between the two groups … ORIGINAL PAPER Public Internet Data Mining Methods in Instructional Design, Educational Technology, and Online Learning Research Royce Kimmons1 & George Veletsianos2 Published online: 7 June 2018 # Association for Educational Communications & Technology 2018 Abstract We describe the benefits and challenges of engaging in public data mining methods and situate our discussion in the context of studies that we have conducted. Practical, methodological, and scholarly benefits include the ability to access large amounts of data, randomize data, conduct both quantitative and qualitative analyses, connect educational issues with broader issues of concern, identify subgroups/subpopulations of interest, and avoid many biases. Technical, methodological, professional, and ethical issues that arise by engaging in public data mining methods include the need for multifaceted expertise and rigor, focused research questions and determining meaning, and performative and contextual considerations of public data. As the scientific complexity facing research in instructional design, educational technology, and online learning is expanding, it is necessary to better prepare students and scholars in our field to engage with emerging research methodologies. Keywords Public internet data mining . Innovative methods Data mining of the public internet has been an emerging research method for the past two decades as it has been ap- plied to a variety of fields to help solve persistent problems like developing webpage recommender systems (Niwa et al. 2006), combating infectious diseases (Brownstein et al. 2008), identifying cybersecurity threats (Maloof 2006), im- proving network traffic (Wang et al. 2002), and predicting political orientations (Colleoni et al. 2014), just to name a few. Previous work has pointed out some of the technical opportunities and challenges of such methods (Andersen and Feamster 2006), but public internet data mining has not yet been widely applied to addressing issues facing the field of instructional design and technology (IDT), and we do not fully understand the benefits and challenges of its application to our field. Furthermore, though some data mining methods are eagerly being applied in the realms of learning analytics and data dashboard visualization (Baker and Inventado 2014), we have not as a field begun exploring the potentials and ramifications of using massive amounts of disorganized, publicly-available data to address persistent IDT challenges or determining how we must train new professionals to make use of the wealth of data available to them via the public internet. Data mining of the public Internet affords IDT re- searchers the ability to answer important questions that they have henceforth been either unable to answer or unable to explore using non-invasive methods on a large scale. To il- lustrate, in Table 1 we provide a list of potential questions that are of interest to IDT that researchers may be able to address using data mining methods. Over the last two years, we (the two authors of this paper) have conducted more than 10 studies using public data mining methods in IDT. These studies included extracting and ana- lyzing publicly-available data from Websites (e.g., K-12 websites), social media (e.g., Twitter), and discussion fora (e.g., YouTube comments). They generated massive datasets and allowed us to conduct research pertaining to technology use, social media prevalence, equity, and civility in online discussions. In this paper, we will describe the benefits and challenges we encountered while engaging in public data min- ing and situate our discussion in the context of studies that we Royce Kimmons and George Veletsianos contributed equally to this work. * Royce Kimmons [email protected] George Veletsianos [email protected] 1 Brigham Young University, 150J MCKB, BYU, Provo, UT 84602, USA 2 Royal Roads University, 2005 Sooke Rd, Victoria, BC V9B 5Y2, Canada TechTrends (2018) 62:492–500 https://doi.org/10.1007/s11528-018-0307-4 http://crossmark.crossref.org/dialog/?doi=10.1007/s11528-018-0307-4&domain=pdf http://orcid.org/0000-0001-7744-2315 mailto:[email protected] have conducted in order to present authentic examples of the ways that public data mining can be used in our field. As we have put processes in place to collect and analyze public social media data, we have reached out to colleagues and secured funding for graduate students at other universities to conduct collaborative work with us. To date, we have col- laborated with 17 scholars on these projects representing 10 universities in the U.S. and Canada, and our collaborators have included undergraduate, master’s, and doctoral students as well as tenure-track faculty. These efforts have allowed us to take on the role of mentors in public data mining methods to our col- leagues, to expand the horizons of our own research, and to train young researchers in these emerging methods. By doing so, we have identified a curricular need facing our field that we will also discuss here. While the practice of IDT traditionally involves multidisciplinary collaboration (e.g., instructional de- signers, subject matter experts, assessment experts, and faculty may collaborate to create an educational intervention), the sci- entific complexity facing IDT research and practice is increas- ingly expanding. For instance, the infusion of technology in all aspects of education has provided access to a deluge of digital data that was previously unfathomable (Selwyn 2015), and in- structional designers may nowadays collaborate with even more actors, such as data scientists and learning analytics re- searchers. Thus, it is necessary for researchers in our field to explore and understand emerging research methodologies. This paper will conclude by arguing that doctoral preparation pro- grams in our field should include interdisciplinary methodolog- ical training for IDT researchers as a core component. Some Benefits of Public Internet Data Mining As interest in data mining takes hold in many industries, from healthcare to e-commerce, education researchers have started exploring the ways that both large and public datasets can con- tribute to making sense of issues facing educational practice and the science of learning. While substantial literature exists on the use of learning analytics in education (e.g., in Massive Open Online Learning [MOOC] contexts), much less is written about the use of public online data. The benefits or opportunities that mining of public Internet data engenders are numerous. These opportunities are practical and methodological, as well as schol- arly. We organize these in the following themes: & providing large amounts of data and allowing easy randomization; & empowering both quantitative and qualitative analyses; & connecting educational issues with larger public issues; & enabling identification of subgroups/subpopulations for further research; & and avoiding many biases. Providing Large Amounts of Data The data generated by contemporary Internet platforms, and made available to researchers through various means, are un- precedented. For instance, the data associated with posting one single tweet includes information about the person post- ing the tweet (e.g., username, name, biographic information, location, account creation date, and various statistics associat- ed with the account holder such as total tweets posted and total followers), data associated with the actual tweet (e.g., the text of the tweet, the hashtags included in the text of the tweet, the time it was posted, the location associated with the device it was posted from, the application used to post the tweet, and various metrics associated with it such as number of times this particular tweet was retweeted), and similar data for any other accounts interacting with that particular tweet. In other words, a single tweet is associated with copious data points that IDT researchers have rarely seen. This data deluge present in Twitter is typical of online platforms. A similar situation exists with a variety of platforms that are used for teaching, training, and learning purposes (e.g., blogs, YouTube, Reddit, public websites, etc). To illustrate the magnitude of the data Table 1 A selection of typical IDT research questions that may be answered via data mining methods Research question Public internet data source What sorts of IDT skills do employers require? Job ad postings What challenges do teachers face in integrating technology in K-12 classrooms? Public discussion forums What kinds of peer-support do online learners provide to one another? Discussion forums found in public online courses In what ways are particular web-based technologies used in k-12 courses? Blog networks, wiki networks, etc How do instructional designers describe the field to others? Personal portfolios, discussion forums What motivates individuals to contribute to informal learning communities? Discussion forums What sentiments does the public express toward particular educational and technological innovations (e.g., MOOCs, artificial intelligence, online education, adaptive learning, etc)? Discussion forums, newspaper comments What is the relationship between demographic variables (e.g., gender) and achievement in STEM courses? Secondary data made available in public repositories TechTrends (2018) 62:492–500 493 available, in a recent paper we sought to investigate time pat- terns in social media use (Veletsianos et al. under review) and were able to identify a sample of academics on Twitter (n = 3996) and retrieve more than 9 million tweets they posted along with associated metadata, yielding more than 100 mil- lion raw data points. Good data enable one to answer the research questions he/ she poses. While abundant data are not synonymous with good data, large amounts of data provide a number of oppor- tunities for IDT researchers. Large-scale data allow re- searchers to examine whether the results generated by smaller-scale studies (e.g., case studies) hold up to scrutiny, investigate questions that can only be answered by larger datasets (e.g., investigations of populations vis-a-vis samples), and enable investigations of samples drawn at random from large populations. Empowering Both Quantitative and Qualitative Analyses Though data mining is often associated with analyses involv- ing quantitative data, mining the public internet enables re- searchers to collect and analyze both quantitative and qualita- tive data. This method, therefore, accommodates a diverse range of research questions, data analysis methods, and ap- proaches. In other words, as part of the IDT researcher’s meth- odological toolkit, data mining methods may enable the col- lection and analyses of different kinds of data in relation to the research questions being asked. Such versatility is important because it enables IDT researchers to use data mining methods across research paradigms, enabling the use of qualitative data to generate detailed and rich descriptions of phenomena, as well as the use of quantitative data to draw generalizable con- clusions. For example, in investigating ways to scaffold stu- dent learning when interacting with a chatbot, data mining methods may enable IDT researchers to (a) code student prompts in order to develop a taxonomy of help-seeking ques- tions, and (b) compute the frequency with which students ask different types of questions. To illustrate, we were interested in examining the ways higher education institutions used social media for educational purposes with students and the broader public (Kimmons et al. 2017b; Veletsianos et al. 2017). In order to explore this topic, we gathered quantitative data (e.g., number of tweets posted) and qualitative data (e.g., individual tweets and images) asso- ciated with the Twitter accounts of Canadian and US univer- sities. We computed new variables using these data (e.g., num- ber of replies, replies as a proportion of all tweets, number of tweets that included audiovisual elements) and also conducted descriptive, inferential, and qualitative analyses on them. Using this dataset, quantitative analyses enabled us to identify that higher education institutions in both countries mostly used Twitter to broadcast information rather than engage in dialogue. Qualitative analysis of a sample of tweets enabled us to discover that those broadcasted messages portrayed an overwhelmingly positive picture of institutional life. In other words, quantitative analyses enabled us to discover the fre- quency and type of Twitter use, while qualitative analyses allowed us to describe what such participation looked like. Data mining enabled us to develop a multi-layered under- standing of institutional social media use, highlighting a find- ing that is core to IDT, namely that technologies are rarely neutral in their use (e.g., Twitter prompts users to broadcast messages) and that they can be appropriated to serve different needs (e.g., Twitter seemed to be used for promotion rather than educative purposes). Connecting Educational Issues with Larger Public Issues One of the pressing challenges facing our field is in pursuing an understanding of sociocultural and public issues pertaining to education, teaching, learning, scholarship, and technology (Veletsianos and Kimmons 2012). Such issues may involve access, equity, civility, socioeconomic divides, and sociotechnical issues (e.g., the impact of social media algo- rithms on opportunities for informal learning). While some of the field’s research examines issues of broader concern, by and large the focus is on pedagogical applications of technol- ogy, with little attention being paid to the social, cultural, and political aspects and implications of instructional design and educational technology use. We need to pay close attention to these issues because of their societal significance and impli- cations for practice. What is the public concerned about with regards to teaching and learning? In what ways can IDT re- imagine teaching and learning on a massive scale? In what ways are racism and sexism evident in our designs and edu- cational offerings, and what does the field need to do in order to alleviate these problems? We believe that these types of questions (amongst many others) should be central to the field for they aim toward developing a more just and fair society. Public Internet data mining methods may provide opportunities for researchers to examine societal issues of broad concern, and enable the field to take a more active role in societal conversations of interest. For instance, in the same way that Rowe (2015) examined (in)civility in online political discussions occurring on the Washington Post Facebook account, IDT researchers might use data mining methods to investigate (in)civility on public platforms hosting educational interactions such as CrashCourse and Physics Girl on YouTube and develop ways to address this problem. To illustrate how IDT research can be connected to issues of broader concern via data mining, consider the research we reported in Authors (2018). In that study, we sought to connect the educational uses of YouTube to gender issues. While typ- ical IDT research might examine the pedagogical 494 TechTrends (2018) 62:492–500 implications, opportunities, promises, drawbacks, and affordances of video-sharing technologies, we were interested in the sentiment that individuals faced when they asked to go online to share their research or to post their course assign- ments. We were also interested in examining whether different people faced different sentiment. By examining the sentiment expressed in response to TEDx and TED-Ed talks posted on YouTube we found that videos of male presenters showed greater neutrality, while videos of female presenters saw sig- nificantly greater polarity in replies. Such findings have sig- nificant implications for our field, because they question the oft-repeated optimistic narratives of contemporary technolo- gies as necessarily positive for all people. Enabling Identification of Subpopulations for Further Research Due to the massive amounts of data available online, public Internet data mining methods enable researchers to identify particular subpopulations for further inquiry. Granular ap- proaches to identifying participants are important, because they enable researchers to focus on typical, unique, or otherwise significant subpopulations of interest. For instance, consider- ing Twitter as a platform of interest, data mining methods enable researchers to identify and study IDT issues pertaining to professors who tweet frequently (e.g., Kimmons and Veletsianos 2016), educators who engage with a particular top- ic or affinity space (e.g., Paskevicius et al. 2018; Veletsianos 2017b), community members who comment on educational content (e.g., Veletsianos et al. in press), doctoral students who have a large number of followers, teachers who reside in a particular geographic area, faculty members who mention their teaching evaluations, undergraduate engineering students who tweet about positive/negative learning experiences, or IDT faculty who attend both IDT and Learning Sciences con- ferences. Further, the identification of specific subpopulations enables comparisons between groups. For instance, one could examine whether there are differences between science stu- dents’ perceptions of positive learning experiences and human- ities students’ perceptions of said experiences. In one of our research studies, we sought to understand how the content MOOC participants post on social media varies with the role they espouse (Veletsianos 2017a). After identifying a MOOC provider that included hashtags with every course offering, we examined what messages were posted to the course hashtags and how those varied by user role. Following traditional content analysis methods and cat- egorization according to roles, we identified variations in the messages posted by different groups of users. For instance, we found that institutions and the MOOC provider posted more promotional messages than faculty and learners, while MOOC-dedicated accounts and instructors posted more in- structional messages. Such results highlight the need for looking deeper into participant subpopulations to identify and examine the differential practices that subpopulations may employ, especially in the context of open-ended and flex- ible learning environments. Avoiding Many Biases It is widely recognized and acknowledged that conscious and unconscious biases have significant impacts in research out- comes. To mention a few, such biases might include Hawthorne effects (e.g., a teacher engages in behaviors per- ceived to be desired by a researcher observing their instruc- tion), self-reporting biases (e.g., a student provides biased self- assessed measures of the time they spent studying for an ex- am), and self-selection biases (e.g., faculty in support of open access publishing in IDT self-select to participate in a study examining open access publishing in the field). Such biases adversely affect our understanding of issues related to IDT, and, even though researchers are trained to recognize and account for them, we are not always able to control for them. Public Internet data mining approaches avoid many such biases. For instance, researchers are able to unobtrusively ob- serve behavior in situ, mitigating the potential for Hawthorne effects, and self-reporting and self-selection biases. As an ex- ample, our investigation of the types of messages posted by IDT departments on social media sites (Romero-Hall et al. 2018), relied on identifying and categorizing the actual messages al- ready posted by IDT departments online. Thus, IDT department behavior was not impacted by virtue of the study being con- ducted, and self-reporting and self-selection biases were avoided because all available actual messages were collected and analyzed rather than depending on analyzing IDT depart- ments’ perceptions about those messages. It is important to note, however, that it is impossible to account for all potential biases. For instance, in the aforementioned study results are based on the sample of IDT departments identified, and the methods used to identify the specific departments to include in the study may have led to some departments being included/excluded. Some Challenges of Public Internet Data Mining Despite these benefits, public internet data mining as a re- search method presents a variety of noteworthy challenges. These challenges revolve around technical, methodological, professional, and ethical issues that arise from using massive amounts of public observation data from people and organi- zations. We have organized these challenges into the four following themes: & multifaceted expertise and rigor requirements; & focused questions and determining meaning; TechTrends (2018) 62:492–500 495 & performative and contextual considerations of public data; & and emergent ethical dilemmas. Multifaceted Expertise and Rigor Requirements The first challenge and largest barrier to entry for most edu- cation researchers who might have an interest in public inter- net data mining is that collecting, cleaning, organizing, and analyzing these data at any scale relies upon various technical skills that are interdisciplinary (at best) or not taught at all in most education research programs. This is in part due to the relative newness and ever-evolving nature of the internet (e.g., the emergence of APIs) but is also due to the siloed and spe- cializing nature of the academy, which requires education re- searchers to utilize increasingly specialized methods of inqui- ry in order for their work to be considered valid. For instance, researchers who have already devoted years to becoming ex- pert at phenomenological inquiry or structural equation modeling might understandably be slow to venture into a new realm of inquiry that might require them to learn equally specialized technical methods such as website scripting, API querying, tokenization, and so forth. In the reverse situation, however, web developers, data scientists, and internet market- ing professionals might have a variety of skills necessary to do public internet data mining, but they will equally lack the content area expertise necessary to ask meaningful questions of the data and will make various assumptions about educa- tional phenomena, institutions, and stakeholders that are con- troversial, unwarranted, or just wrong. Thus, especially in the case of small-budget projects (such as theses and disserta- tions), it becomes very difficult for a single researcher or even a small group of researchers to have all of the expertise nec- essary to do this kind of work in a way that will be viewed rigorously by education, web development, and data science communities alike. To illustrate some of the expertise required, we will briefly explain some of the data collection steps that we undertook in a recent study of U.S. university Twitter accounts (see Kimmons et al. 2017a, 2017b for a complete explanation of all steps undertaken). After identifying two pre-existing lists of univer- sity websites, we used keyword identifiers and manual coding to merge the lists into a relational database to match Carnegie classifications with university website addresses. We then wrote a series of scripts that systematically opened and parsed the contents of all the university website homepages, searching for embedded Twitter feeds, links, or keyword references to an institutional Twitter account (e.g., BFollow us @OurUniversity^). The script stored all referenced accounts in the relational database with a unique university identifier. Another script we wrote queried the Twitter RESTAPI, retriev- ing the Twitter user objects for all university accounts and storing them in the relational database. Next, we read through all account information (e.g., screen name, location, descrip- tion) and manually coded accounts as either the primary insti- tutional account or other (e.g., athletics department, registrar). This resulted in a maximum of one primary institutional Twitter account for each university (n = 2411), and we exclud- ed other accounts from further analysis. We then wrote another set of scripts to again query the Twitter REST API for all available Twitter activity for each account and stored returned tweet objects in the relational database (n = 5.7 million tweets). Following these data collection steps, we developed scripts to clean the data, developed scripts to identify multimedia in tweets, used an open-source sentiment analyzer, operationaled items of theoretical interest, identified representative samples, and conducted descriptive, inferential, and content analyses. As this highly abridged narrative of some of the steps taken suggests, this one study required many technical steps to com- plete that required web scripting, quantitative analysis, quali- tative coding, SQL querying, API querying, JSON parsing, keyword searching, database management, image analysis, sentiment analysis, and so forth. Furthermore, each study that is undergone in this way may have many unique elements to it that prevents the development of a one-size-fits-all approach to data collection and analysis. These challenges may be alle- viated most readily by building functional teams of re- searchers (e.g., a web programmer, a quantitative methodolo- gist, and a qualitative methodologist), but they also introduce challenges of getting the work published, because just as it is highly infeasible for one researcher to have all of the expertise necessary to conduct a study like this, it is equally infeasible that a single reviewer or editor can meaningfully evaluate a completed study’s significance and rigor. This last point is important for any researcher who is ex- pected to publish their work in certain types of venues, be- cause all journals have a niche audience and rely upon re- viewers that have a unique set of beliefs, attitudes, and skills. When submitting studies like the one described above to the journals we are most interested in publishing in, we have found that reviewers and editors typically come at the study either from an education perspective (and thereby want to see rich, meaningful results in terms of students’ and educators’ lives) or from a computer science or methodological perspec- tive (and thereby want to see conformity to expected norms of data collection and classification as well as methodological insights). This can require the researcher to essentially serve two masters wherein one wants more qualitative examples and less technical jargon while the other wants the opposite and is exacerbated by word limit requirements that essentially re- quire the researcher to choose one over the other. We have found that this issue must be navigated on a study-by-study basis wherein the researchers must iteratively work with the editor and reviewers to determine which elements of the study should be emphasized and which elements can be effectively summarized, placed in an online supplement, or ignored. 496 TechTrends (2018) 62:492–500 Focused Questions and Determining Meaning Second, when working with a pre-existing, massive dataset like the internet, as researchers it is sometimes difficult to navigate the relationship between our research questions and the data. The traditional social science research approach, for instance, is for the research question to come first and for it to guide the collection and analysis of our data. However, with a pre-existing dataset this approach often feels inappropriate, because the researchers are simultaneously constrained and empowered by the parameters of the data, which may not allow them to answer questions that they are interested in but may also empower them to answer new questions that they did not know were possible to answer. It has been our experi- ence that often when embarking on these studies our initial questions become reshaped or somewhat refined as we im- merse ourselves in the data and contemplate their possibilities, but at the same time this often leads to scope creep, wherein we quickly try to tackle too much because we feel that the data are so rich, and theoretical drift, wherein we move away from our theoretically-grounded emphasis to focus on disconnect- ed, emergent issues that we thought were novel and interest- ing. Both scope creep and theoretical drift are problematic for a variety of reasons not least of which is that they lead to studies that overreach or that can delve into areas far outside the researcher’s realm of expertise, and discerning audiences are quick to point this out. This situation has led us to enter these types of studies with focused research questions at the outset and to be much more careful in safeguarding against drastic changes late into the research process. Though we feel that there should always be some flexibility to refocus research questions in light of emergent data issues, those embarking on studies like these should never approach a massive dataset with a Bwe’ll see what the data can tell us^ attitude, because the data are often so rich that they can become more of a distraction than a tool of inquiry. A related issue is how we think about significance and meaning and how our qualitative or quantitative traditions might prepare us to approach massive pre-existing data in inappropriate ways. For instance, in a traditional education research study that employs a quasi-experimental design, a researcher might study as …

75524

YOU ARE ASKED TO WRITE AN ESSAY TO DISCUSS THE TOPIC OF DATA MINING. PLEASE ADDRESS EACH OF THE FOLLOWING:1.BASED ON THE ASSIGNED READINGS(3 RESEARCH PAPERS)AND TEXTBOOK, WHAT IS DATA MINING? ARE THE DEFINITIONS CONSISTENT OR CONTRADICTORY?2.DESCRIBE THE

You are asked to write an essay to discuss the topic of data mining. Please address each of the following:1.Based on the assigned readings(3 research papers)and textbook, what is data mining? Are the definitions consistent or contradictory?2.Describe the - Management

CATEGORIES