5 pages with References in APA format homework assistance - Information Systems
answer the following questions in 5 pages
How does data and classifying data impact data mining?
What is association in data mining?
Select a specific association rule (from the text) and thoroughly explain the key concepts.
Discuss cluster analysis concepts.
Explain what an anomaly is and how to avoid it.
Discuss methods to avoid false discoveries.
This assignment should take into consideration all the course concepts in the book. Be very thorough in your response. The paper should be at least three pages in length and contain at least two-peer reviewed sources.
\ ( (
P A N G . N I N G T A N
M i c h i g a n S t a t e U n i v e r s i t y
M I C H A E L S T E I N B A C H
U n i v e r s i t y o f M i n n e s o t a
V I P I N K U M A R
U n i v e r s i t y o f M i n n e s o t a
a n d A r m y H i g h P e r f o r m a n c e
C o m p u t i n g R e s e a r c h C e n t e r
+f.f_l crf.rfh. .W if f
aqtY 6l$
t . T . R . C .
i&ufe61ttt1/.
Y \ t.\ $t,/,1
n,5 \ . 7 \ V
4 8 !
Boston San Francisco NewYork
London Toronto Sydney Tokyo Singapore Madrid
MexicoCity Munich Paris CapeTown HongKong Montreal
G.R
r+6,q
If you purchased this book within the United States or Canada you should be aware that it has been
wrongfirlly imported without the approval of the Publishel or the Author.
T3
Loo 6
- {)gq*
3 AcquisitionsEditor Matt Goldstein
ProjectEditor Katherine Harutunian
Production Supervisor Marilyn Lloyd
Production Services Paul C. Anagnostopoulos of Windfall Software
Marketing Manager Michelle Brown
Copyeditor Kathy Smith
Proofreader IenniferMcClain
Technicallllustration GeorgeNichols
Cover Design Supervisor Joyce Cosentino Wells
Cover Design Night & Day Design
Cover Image @ 2005 Rob Casey/Brand X pictures
hepress and Manufacturing Caroline Fell
Printer HamiltonPrinting
Access the latest information about Addison-Wesley titles from our iWorld Wide Web site:
http : //www. aw-bc.com/computing
Many of the designations used by manufacturers and sellers to distiriguish their products
are claimed as trademarks. where those designations appear in this book, and Addison-
Wesley was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
The programs and applications presented in this book have been incl,[rded for their
instructional value. They have been tested with care, but are not guatanteed for any
particular purpose. The publisher does not offer any warranties or representations, nor does
it accept any liabilities with respect to the programs or applications.
Copyright @ 2006 by Pearson Education, Inc.
For information on obtaining permission for use of material in this work, please submit a
written request to Pearson Education, Inc., Rights and Contract Department, 75 Arlington
Street, Suite 300, Boston, MA02II6 or fax your request to (617) g4g-j047.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted, in any form or by any means, electronic, mechanical, photocopying,
recording, or any other media embodiments now known or hereafter to become known,
without the prior written permission of the publisher. printed in the united States of
America.
lsBN 0-321-42052-7
2 3 4 5 67 8 9 10-HAM-O8 07 06
our famili,es
Preface
Advances in data generation and collection are producing data sets of mas-
sive size in commerce and a variety of scientific disciplines. Data warehouses
store details of the sales and operations of businesses, Earth-orbiting satellites
beam high-resolution images and sensor data back to Earth, and genomics ex-
periments generate sequence, structural, and functional data for an increasing
number of organisms. The ease with which data can now be gathered and
stored has created a new attitude toward data analysis: Gather whatever data
you can whenever and wherever possible. It has become an article of faith
that the gathered data will have value, either for the purpose that initially
motivated its collection or for purposes not yet envisioned.
The field of data mining grew out of the limitations of current data anal-
ysis techniques in handling the challenges posedl by these new types of data
sets. Data mining does not replace other areas of data analysis, but rather
takes them as the foundation for much of its work. While some areas of data
mining, such as association analysis, are unique to the field, other areas, such
as clustering, classification, and anomaly detection, build upon a long history
of work on these topics in other fields. Indeed, the willingness of data mining
researchers to draw upon existing techniques has contributed to the strength
and breadth of the field, as well as to its rapid growth.
Another strength of the field has been its emphasis on collaboration with
researchers in other areas. The challenges of analyzing new types of data
cannot be met by simply applying data analysis techniques in isolation from
those who understand the data and the domain in which it resides. Often, skill
in building multidisciplinary teams has been as responsible for the success of
data mining projects as the creation of new and innovative algorithms. Just
as, historically, many developments in statistics were driven by the needs of
agriculture, industry, medicine, and business, rxrany of the developments in
data mining are being driven by the needs of those same fields.
This book began as a set of notes and lecture slides for a data mining
course that has been offered at the University of Minnesota since Spring 1998
to upper-division undergraduate and graduate Students. Presentation slides
viii Preface
and exercises developed in these offerings grew with time and served as a basis
for the book. A survey of clustering techniques in data mining, originally
written in preparation for research in the area, served as a starting point
for one of the chapters in the book. Over time, the clustering chapter was
joined by chapters on data, classification, association analysis, and anomaly
detection. The book in its current form has been class tested at the home
institutions of the authors-the University of Minnesota and Michigan State
University-as well as several other universities.
A number of data mining books appeared in the meantime, but were not
completely satisfactory for our students primarily graduate and undergrad-
uate students in computer science, but including students from industry and
a wide variety of other disciplines. Their mathematical and computer back-
grounds varied considerably, but they shared a common goal: to learn about
data mining as directly as possible in order to quickly apply it to problems
in their own domains. Thus, texts with extensive mathematical or statistical
prerequisites were unappealing to many of them, as were texts that required a
substantial database background. The book that evolved in response to these
students needs focuses as directly as possible on the key concepts of data min-
ing by illustrating them with examples, simple descriptions of key algorithms,
and exercises.
Overview Specifically, this book provides a comprehensive introduction to
data mining and is designed to be accessible and useful to students, instructors,
researchers, and professionals. Areas covered include data preprocessing, vi-
sualization, predictive modeling, association analysis, clustering, and anomaly
detection. The goal is to present fundamental concepts and algorithms for
each topic, thus providing the reader with the necessary background for the
application of data mining to real problems. In addition, this book also pro-
vides a starting point for those readers who are interested in pursuing research
in data mining or related fields.
The book covers five main topics: data, classification, association analysis,
clustering, and anomaly detection. Except for anomaly detection, each of these
areas is covered in a pair of chapters. For classification, association analysis,
and clustering, the introductory chapter covers basic concepts, representative
algorithms, and evaluation techniques, while the more advanced chapter dis-
cusses advanced concepts and algorithms. The objective is to provide the
reader with a sound understanding of the foundations of data mining, while
still covering many important advanced topics. Because of this approach, the
book is useful both as a learning tool and as a reference.
Preface ix
To help the readers better understand the concepts that have been pre-
sented, we provide an extensive set of examples, figures, and exercises. Bib-
Iiographic notes are included at the end of each chapter for readers who are
interested in more advanced topics, historically important papers, and recent
trends. The book also contains a comprehensive subject and author index.
To the Instructor As a textbook, this book is suitable for a wide range
of students at the advanced undergraduate or graduate level. Since students
come to this subject with diverse backgrounds that may not include extensive
knowledge of statistics or databases, our book requires minimal prerequisites-
no database knowledge is needed and we assume only a modest background
in statistics or mathematics. To this end, the book was designed to be as
self-contained as possible. Necessary material from statistics, linear algebra,
and machine learning is either integrated into the body of the text, or for some
advanced topics, covered in the appendices.
Since the chapters covering major data mining topics are self-contained,
the order in which topics can be covered is quite flexible. The core material
is covered in Chapters 2, 4, 6, 8, and 10. Although the introductory data
chapter (2) should be covered first, the basic classification, association analy-
sis, and clustering chapters (4, 6, and 8, respectively) can be covered in any
order. Because of the relationship of anomaly detection (10) to classification
(4) and clustering (8), these chapters should precede Chapter 10. Various
topics can be selected from the advanced classification, association analysis,
and clustering chapters (5, 7, and 9, respectively) to fit the schedule and in-
terests of the instructor and students. We also advise that the lectures be
augmented by projects or practical exercises in data mining. Although they
are time consuming, such hands-on assignments greatly enhance the value of
the course.
Support Materials The supplements for the book are available at Addison-
Wesleys Website www.aw.con/cssupport. Support materials available to all
readers of this book include
PowerPoint lecture slides
Suggestions for student projects
Data mining resources such as data mining algorithms and data sets
On-line tutorials that give step-by-step examples for selected data mining
techniques described in the book using actual data sets and data analysis
software
o
o
o
o
x Preface
Additional support materials, including solutions to exercises, are available
only to instructors adopting this textbook for classroom use. Please contact
your schools Addison-Wesley representative for information on obtaining ac-
cess to this material. Comments and suggestions, as well as reports of errors,
can be sent to the authors through [email protected]
Acknowledgments Many people contributed to this book. We begin by
acknowledging our families to whom this book is dedicated. Without their
patience and support, this project would have been impossible.
We would like to thank the current and former students of our data mining
groups at the University of Minnesota and Michigan State for their contribu-
tions. Eui-Hong (Sam) Han and Mahesh Joshi helped with the initial data min-
ing classes. Some ofthe exercises and presentation slides that they created can
be found in the book and its accompanying slides. Students in our data min-
ing groups who provided comments on drafts of the book or who contributed
in other ways include Shyam Boriah, Haibin Cheng, Varun Chandola, Eric
Eilertson, Levent Ertoz, Jing Gao, Rohit Gupta, Sridhar Iyer, Jung-Eun Lee,
Benjamin Mayer, Aysel Ozgur, Uygar Oztekin, Gaurav Pandey, Kashif Riaz,
Jerry Scripps, Gyorgy Simon, Hui Xiong, Jieping Ye, and Pusheng Zhang. We
would also like to thank the students of our data mining classes at the Univer-
sity of Minnesota and Michigan State University who worked with early drafbs
of the book and provided invaluable feedback. We specifically note the helpful
suggestions of Bernardo Craemer, Arifin Ruslim, Jamshid Vayghan, and Yu
Wei.
Joydeep Ghosh (University of Texas) and Sanjay Ranka (University of
Florida) class tested early versions of the book. We also received many useful
suggestions directly from the following UT students: Pankaj Adhikari, Ra-
jiv Bhatia, Fbederic Bosche, Arindam Chakraborty, Meghana Deodhar, Chris
Everson, David Gardner, Saad Godil, Todd Hay, Clint Jones, Ajay Joshi,
Joonsoo Lee, Yue Luo, Anuj Nanavati, Tyler Olsen, Sunyoung Park, Aashish
Phansalkar, Geoff Prewett, Michael Ryoo, Daryl Shannon, and Mei Yang.
Ronald Kostoff (ONR) read an early version of the clustering chapter and
offered numerous suggestions. George Karypis provided invaluable IATEX as-
sistance in creating an author index. Irene Moulitsas also provided assistance
with IATEX and reviewed some of the appendices. Musetta Steinbach was very
helpful in finding errors in the figures.
We would like to acknowledge our colleagues at the University of Min-
nesota and Michigan State who have helped create a positive environment for
data mining research. They include Dan Boley, Joyce Chai, Anil Jain, Ravi
Preface xi
Janardan, Rong Jin, George Karypis, Haesun Park, William F. Punch, Shashi
Shekhar, and Jaideep Srivastava. The collaborators on our many data mining
projects, who also have our gratitude, include Ramesh Agrawal, Steve Can-
non, Piet C. de Groen, FYan Hill, Yongdae Kim, Steve Klooster, Kerry Long,
Nihar Mahapatra, Chris Potter, Jonathan Shapiro, Kevin Silverstein, Nevin
Young, and Zhi-Li Zhang.
The departments of Computer Science and Engineering at the University of
Minnesota and Michigan State University provided computing resources and
a supportive environment for this project. ARDA, ARL, ARO, DOE, NASA,
and NSF provided research support for Pang-Ning Tan, Michael Steinbach,
and Vipin Kumar. In particular, Kamal Abdali, Dick Brackney, Jagdish Chan-
dra, Joe Coughlan, Michael Coyle, Stephen Davis, Flederica Darema, Richard
Hirsch, Chandrika Kamath, Raju Namburu, N. Radhakrishnan, James Sido-
ran, Bhavani Thuraisingham, Walt Tiernin, Maria Zemankova, and Xiaodong
Zhanghave been supportive of our research in data mining and high-performance
computing.
It was a pleasure working with the helpful staff at Pearson Education. In
particular, we would like to thank Michelle Brown, Matt Goldstein, Katherine
Harutunian, Marilyn Lloyd, Kathy Smith, and Joyce Wells. We would also
like to thank George Nichols, who helped with the art work and Paul Anag-
nostopoulos, who provided I4.T[X support. We are grateful to the following
Pearson reviewers: Chien-Chung Chan (University of Akron), Zhengxin Chen
(University of Nebraska at Omaha), Chris Clifton (Purdue University), Joy-
deep Ghosh (University of Texas, Austin), Nazli Goharian (Illinois Institute
of Technology), J. Michael Hardin (University of Alabama), James Hearne
(Western Washington University), Hillol Kargupta (University of Maryland,
Baltimore County and Agnik, LLC), Eamonn Keogh (University of California-
Riverside), Bing Liu (University of Illinois at Chicago), Mariofanna Milanova
(University of Arkansas at Little Rock), Srinivasan Parthasarathy (Ohio State
University), Zbigniew W. Ras (University of North Carolina at Charlotte),
Xintao Wu (University of North Carolina at Charlotte), and Mohammed J.
Zaki (Rensselaer Polvtechnic Institute).
Gontents
Preface
Introduction 1
1.1 What Is Data Mining? 2
7.2 Motivating Challenges 4
1.3 The Origins of Data Mining 6
1.4 Data Mining Tasks 7
1.5 Scope and Organization of the Book 11
1.6 Bibliographic Notes 13
v l l
t.7 Exercises
Data
1 6
1 9
2.I Types of Data 22
2.1.I Attributes and Measurement 23
2.L.2 Types of Data Sets . 29
2.2 Data Quality 36
2.2.I Measurement and Data Collection Issues 37
2.2.2 Issues Related to Applications
2.3 Data Preprocessing
2.3.L Aggregation
2.3.2 Sampling
2.3.3 Dimensionality Reduction
2.3.4 Feature Subset Selection
2.3.5 Feature Creation
2.3.6 Discretization and Binarization
2.3:7 Variable Tlansformation .
2.4 Measures of Similarity and Dissimilarity . . .
2.4.L Basics
2.4.2 Similarity and Dissimilarity between Simple Attributes .
2.4.3 Dissimilarities between Data Objects .
2.4.4 Similarities between Data Objects
43
44
45
47
50
5 2
5 5
5 7
63
6 5
66
6 7
6 9
72
xiv Contents
2.4.5 Examples of Proximity Measures
2.4.6 Issues in Proximity Calculation
2.4.7 Selecting the Right Proximity Measure
2.5 BibliographicNotes
2.6 Exercises
Exploring Data
3.i The Iris Data Set
3.2 Summary Statistics
3.2.L Frequencies and the Mode
3.2.2 Percentiles
3.2.3 Measures of Location: Mean and Median
3.2.4 Measures of Spread: Range and Variance
3.2.5 Multivariate Summary Statistics
3.2.6 Other Ways to Summarize the Data
3.3 Visualization
3.3.1 Motivations for Visualization
3.3.2 General Concepts
3.3.3 Techniques
3.3.4 Visualizing Higher-Dimensional Data .
3.3.5 Dos and Donts
3.4 OLAP and Multidimensional Data Analysis
3.4.I Representing Iris Data as a Multidimensional Array
3.4.2 Multidimensional Data: The General Case .
3.4.3 Analyzing Multidimensional Data
3.4.4 Final Comments on Multidimensional Data Analysis
Bibliographic Notes
Exercises
Classification:
Basic Concepts, Decision Tlees, and Model Evaluation
4.1 Preliminaries
4.2 General Approach to Solving a Classification Problem
4.3 Decision Tlee Induction
4.3.1 How a Decision Tlee Works
4.3.2 How to Build a Decision TYee
4.3.3 Methods for Expressing Attribute Test Conditions
4.3.4 Measures for Selecting the Best Split .
4.3.5 Algorithm for Decision Tlee Induction
4.3.6 An Examole: Web Robot Detection
3 . 5
3 . 6
73
80
83
84
88
9 7
98
98
99
1 0 0
1 0 1
102
704
1 0 5
1 0 5
1 0 5
1 0 6
1 1 0
724
1 3 0
1 3 1
1 3 1
1 3 3
1 3 5
1 3 9
1 3 9
747
L45
746
748
1 5 0
150
1 5 1
1 5 5
1 5 8
164
1 6 6
Contents xv
4.3.7 Characteristics of Decision Tlee Induction
4.4 Model Overfitting
4.4.L Overfitting Due to Presence of Noise
4.4.2 Overfitting Due to Lack of Representative Samples
4.4.3 Overfitting and the Multiple Comparison Procedure
4.4.4 Estimation of Generalization Errors
4.4.5 Handling Overfitting in Decision Tlee Induction
4.5 Evaluating the Performance of a Classifier
4.5.I Holdout Method
4.5.2 Random Subsampling . . .
4.5.3 Cross-Validation
4.5.4 Bootstrap
4.6 Methods for Comparing Classifiers
4.6.L Estimating a Confidence Interval for Accuracy
4.6.2 Comparing the Performance of Two Models .
4.6.3 Comparing the Performance of Two Classifiers
4.7 BibliographicNotes
4.8 Exercises
5 Classification: Alternative Techniques
5.1 Rule-Based Classifier
5.1.1 How a Rule-Based Classifier Works
5.1.2 Rule-Ordering Schemes
5.1.3 How to Build a Rule-Based Classifier
5.1.4 Direct Methods for Rule Extraction
5.1.5 Indirect Methods for Rule Extraction
5.1.6 Characteristics of Rule-Based Classifiers
5.2 Nearest-Neighbor classifiers
5.2.L Algorithm
5.2.2 Characteristics of Nearest-Neighbor Classifiers
5.3 Bayesian Classifiers
5.3.1 Bayes Theorem
5.3.2 Using the Bayes Theorem for Classification
5.3.3 Naive Bayes Classifier
5.3.4 Bayes Error Rate
5.3.5 Bayesian Belief Networks
5.4 Artificial Neural Network (ANN)
5.4.I Perceptron
5.4.2 Multilayer Artificial Neural Network
5.4.3 Characteristics of ANN
1 6 8
1 7 2
L75
L 7 7
178
179
184
186
1 8 6
1 8 7
1 8 7
1 8 8
1 8 8
1 8 9
1 9 1
192
193
1 9 8
207
207
209
2 I I
2r2
2r3
2 2 L
223
223
225
226
227
228
229
23L
238
240
246
247
25r
255
xvi Contents
5.5 Support Vector Machine (SVM)
5.5.1 Maximum Margin Hyperplanes
5.5.2 Linear SVM: Separable Case
5.5.3 Linear SVM: Nonseparable Case
5.5.4 Nonlinear SVM .
5.5.5 Characteristics of SVM
Ensemble Methods
5.6.1 Rationale for Ensemble Method
5.6.2 Methods for Constructing an Ensemble Classifier
5.6.3 Bias-Variance Decomposition
5.6.4 Bagging
5.6.5 Boosting
5.6.6 Random Forests
5.6.7 Empirical Comparison among Ensemble Methods
Class Imbalance Problem
5.7.1 Alternative Metrics
5.7.2 The Receiver Operating Characteristic Curve
5.7.3 Cost-Sensitive Learning . .
5.7.4 Sampling-Based Approaches .
Multiclass Problem
Bibliographic Notes
Exercises
5 . 6
o . t
256
256
259
266
270
276
276
277
278
28r
2 8 3
285
290
294
294
295
298
302
305
306
309
3 1 5
c . 6
5 . 9
5 . 1 0
Association Analysis: Basic Concepts and Algorithms 327
6.1 Problem Definition . 328
6.2 Flequent Itemset Generation 332
6.2.I The Apri,ori Principle 333
6.2.2 Fbequent Itemset Generation in the Apri,ori, Algorithm . 335
6.2.3 Candidate Generation and Pruning . . . 338
6.2.4 Support Counting 342
6.2.5 Computational Complexity 345
6.3 Rule Generatiorr 349
6.3.1 Confidence-Based Pruning 350
6.3.2 Rule Generation in Apri,ori, Algorithm 350
6.3.3 An Example: Congressional Voting Records 352
6.4 Compact Representation of Fbequent Itemsets 353
6.4.7 Maximal Flequent Itemsets 354
6.4.2 Closed Frequent Itemsets 355
6.5 Alternative Methods for Generating Frequent Itemsets 359
6.6 FP-Growth Alsorithm 363
Contents xvii
6.6.1 FP-tee Representation
6.6.2 Frequent Itemset Generation in FP-Growth Algorithm .
6.7 Evaluation of Association Patterns
6.7.l Objective Measures of Interestingness
6.7.2 Measures beyond Pairs of Binary Variables
6.7.3 Simpsons Paradox
6.8 Effect of Skewed Support Distribution
6.9 Bibliographic Notes
363
366
370
37r
382
384
386
390
404
4L5
415
4t8
4 1 8
422
424
426
429
429
431
436
439
442
443
444
447
448
453
457
457
458
458
460
461
463
465
469
473
6.10 Exercises
7 Association Analysis: Advanced
7.I Handling Categorical Attributes
7.2 Handling Continuous Attributes
Concepts
7.2.I Discretization-Based Methods
7.2.2 Statistics-Based Methods
7.2.3 Non-discretizalion Methods
Handling a Concept Hierarchy
Seouential Patterns
7.4.7 Problem Formulation
7.4.2 Sequential Pattern Discovery
7.4.3 Timing Constraints
7.4.4 Alternative Counting Schemes
7.5 Subgraph Patterns
7.5.1 Graphs and Subgraphs .
7.5.2 Frequent Subgraph Mining
7.5.3 Apri,od-like Method
7.5.4 Candidate Generation
7.5.5 Candidate Pruning
7.5.6 Support Counting
7.6 Infrequent Patterns
7.6.7 Negative Patterns
7.6.2 Negatively Correlated Patterns
7.6.3 Comparisons among Infrequent Patterns, Negative Pat-
terns, and Negatively Correlated Patterns
7.6.4 Techniques for Mining Interesting Infrequent Patterns
7.6.5 Techniques Based on Mining Negative Patterns
7.6.6 Techniques Based on Support Expectation .
7.7 Bibliographic Notes
7.8 Exercises
7 . 3
7 . 4
xviii Contents
Cluster Analysis: Basic Concepts and Algorithms
8.1 Overview
8.1.1 What Is Cluster Analysis?
8.I.2 Different Types of Clusterings .
8.1.3 Different Types of Clusters
8.2 K-means
8.2.7 The Basic K-means Algorithm
8.2.2 K-means: Additional Issues
8.2.3 Bisecting K-means
8.2.4 K-means and Different Types of Clusters
8.2.5 Strengths and Weaknesses
8.2.6 K-means as an Optimization Problem
8.3 Agglomerative Hierarchical Clustering
8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm
8.3.2 Specific Techniques
8.3.3 The Lance-Williams Formula for Cluster Proximity .
8.3.4 Key Issues in Hierarchical Clustering .
8.3.5 Strengths and Weaknesses
DBSCAN
8.4.1 Tladitional Density: Center-Based Approach
8.4.2 The DBSCAN Algorithm
8.4.3 Strengths and Weaknesses
Cluster Evaluation
8.5.1 Overview
8.5.2 Unsupervised Cluster Evaluation Using Cohesion and
Separation
8.5.3 Unsupervised Cluster Evaluation Using the Proximity
Matrix
8.5.4 Unsupervised Evaluation of Hierarchical Clustering .
8.5.5 Determining the Correct Number of Clusters
8.5.6 Clustering Tendency
8.5.7 Supervised Measures of Cluster Validity
8.5.8 Assessing the Significance of Cluster Validity Measures .
8 . 4
8 . 5
487
490
490
49r
493
496
497
506
508
5 1 0
5 1 0
5 1 3
5 1 5
5 1 6
5 1 8
524
524
526
526
5 2 7
528
530
532
5 3 3
536
542
544
546
547
548
5 5 3
o o o
5 5 9
8.6 Bibliograph
8.7 Exercises
ic Notes
Cluster Analysis: Additional Issues and Algorithms 569
9.1 Characteristics of Data, Clusters, and Clustering Algorithms . 570
9.1.1 Example: Comparing K-means and DBSCAN . . . . . . 570
9.1.2 Data Characteristics 577
Contents xix
9.1.3 Cluster Characteristics . . 573
9.L.4 General Characteristics of Clustering Algorithms 575
9.2 Prototype-Based Clustering 577
9.2.1 F\zzy Clustering 577
9.2.2 Clustering Using Mixture Models 583
9.2.3 Self-Organizing Maps (SOM) 594
9.3 Density-Based Clustering 600
9.3.1 Grid-Based Clustering 601
9.3.2 Subspace Clustering 604
9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based
Clustering 608
9.4 Graph-Based Clustering 612
9.4.1 Sparsification 613
9.4.2 Minimum Spanning Tlee (MST) Clustering . . . 674
9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities
Using METIS 616
9.4.4 Chameleon: Hierarchical Clustering with Dynamic
Modeling
9.4.5 Shared Nearest Neighbor Similarity
9.4.6 The Jarvis-Patrick Clustering Algorithm
9.4.7 SNN Density
9.4.8 SNN Density-Based Clustering
9.5 Scalable Clustering Algorithms
9.5.1 Scalability: General Issues and Approaches
9 . 5 . 2 B I R C H
9.5.3 CURE
9.6 Which Clustering Algorithm?
9.7 Bibliographic Notes
9.8 Exercises
6 1 6
622
625
627
629
630
630
633
635
639
643
647
10 Anomaly Detection 651
10.1 Preliminaries 653
10.1.1 Causes of Anomalies 653
10.1.2 Approaches to Anomaly Detection 654
10.1.3 The Use of Class Labels 655
1 0 . 1 . 4 I s s u e s 6 5 6
10.2 Statistical Approaches 658
t0.2.7 Detecting Outliers in a Univariate Normal Distribution 659
1 0 . 2 . 2 O u t l i e r s i n a M u l t i v a r i a t e N o r m a l D i s t r i b u t i o n . . . . . 6 6 1
10.2.3 A Mixture Model Approach for Anomaly Detection. 662
xx Contents
10.2.4 Strengths and Weaknesses
10.3 Proximity-Based Outlier Detection
10.3.1 Strengths and Weaknesses
10.4 Density-Based Outlier Detection
10.4.1 Detection of Outliers Using Relative Density
70.4.2 Strengths and Weaknesses
10.5 Clustering-Based Techniques
10.5.1 Assessing the Extent to Which an Object Belongs to a
Cluster
10.5.2 Impact of Outliers on the Initial Clustering
10.5.3 The Number of Clusters to Use
10.5.4 Strengths and Weaknesses
665
666
666
668
669
670
67L
672
674
674
674
675
680
6 8 5
b6i)
10.6 Bibliograph
10.7 Exercises
ic Notes
Appendix A Linear Algebra
A.1 Vectors
A.1.1 Definition 685
4.I.2 Vector Addition and Multiplication by a Scalar 685
A.1.3 Vector Spaces 687
4.7.4 The Dot Product, Orthogonality, and Orthogonal
Projections 688
A.1.5 Vectors and Data Analysis 690
42 Matrices 691
A.2.1 Matrices: Definitions 691
A-2.2 Matrices: Addition and Multiplication by a Scalar 692
4.2.3 Matrices: Multiplication 693
4.2.4 Linear tansformations and Inverse Matrices 695
4.2.5 Eigenvalue and Singular Value Decomposition . 697
4.2.6 Matrices and Data Analysis 699
A.3 Bibliographic Notes 700
Appendix B Dimensionality Reduction 7OL
8.1 PCA and SVD 70I
B.1.1 Principal Components Analysis (PCA) 70L
8 . 7 . 2 S V D . 7 0 6
8.2 Other Dimensionality Reduction Techniques 708
8.2.I Factor Analysis 708
8.2.2 Locally Linear Embedding (LLE) . 770
8.2.3 Multidimensional Scaling, FastMap, and ISOMAP 7I2
Contents xxi
8.2.4 Common Issues
B.3 Bibliographic Notes
Appendix C Probability and Statistics
C.1 Probability
C.1.1 Expected Values
C.2 Statistics
C.2.L Point Estimation
C.2.2 Central Limit Theorem
C.2.3 Interval Estimation
C.3 Hypothesis Testing
Appendix D Regression
D.1 Preliminaries
D.2 Simple Linear Regression
D.2.L Least Square Method
D.2.2 Analyzing Regression Errors
D.2.3 Analyzing Goodness of Fit
D.3 Multivariate Linear Regression
D.4 Alternative Least-Square Regression Methods
Appendix E Optimization
E.1 Unconstrained Optimizafion
E.1.1 Numerical Methods
8.2 Constrained Optimization
E.2.I Equality Constraints
8.2.2 Inequality Constraints
Author Index
Subject Index
Copyright Permissions
715
7L6
7L9
7L9
722
723
724
724
725
726
739
739
742
746
746
747
750
758
769
729
729
730
731
733
735
736
737
1
Introduction
Rapid advances in data collection and storage technology have enabled or-
ganizations to accumulate vast amounts of data. However, extracting useful
information has proven extremely challenging. Often, traditional data analy-
sis tools and techniques cannot be used because of the massive size of a data
set. Sometimes, the non-traditional nature of the data means that traditional
approaches cannot be applied even if the data set is relatively small. In other
situations, the questions that need to be answered cannot be addressed using
existing data analysis techniques, and thus, new methods need to be devel-
oped.
Data mining is a technology that blends traditional data analysis methods
with sophisticated algorithms for processing large volumes of data. It has also
opened up exciting opportunities for exploring and analyzing new types of
data and for analyzing old types of data in new ways. In this introductory
chapter, we present an overview of data mining and outline the key topics
to be covered in this book. We start with a description of some well-known
applications that require new techniques for data analysis.
Business Point-of-sale data collection (bar code scanners, radio frequency
identification (RFID), and smart card technology) have allowed retailers to
collect up-to-the-minute data about customer purchases at the checkout coun-
ters of their stores. Retailers can utilize this information, along with other
business-critical data such as Web logs from e-commerce Web sites and cus-
tomer service records from call centers, to help them better understand the
needs of their customers and make more informed business decisions.
Data mining techniques can be used to support a wide range of business
intelligence applications such as customer profiling, targeted marketing, work-
flow management, store layout, and fraud detection. It can also help retailers
2 Chapter 1 lntroduction
answer important business questions such as Who are the most profitable
customers? What products can be cross-sold or up-sold? and What is the
revenue outlook of the company for next year?)) Some of these questions …
CATEGORIES
Economics
Nursing
Applied Sciences
Psychology
Science
Management
Computer Science
Human Resource Management
Accounting
Information Systems
English
Anatomy
Operations Management
Sociology
Literature
Education
Business & Finance
Marketing
Engineering
Statistics
Biology
Political Science
Reading
History
Financial markets
Philosophy
Mathematics
Law
Criminal
Architecture and Design
Government
Social Science
World history
Chemistry
Humanities
Business Finance
Writing
Programming
Telecommunications Engineering
Geography
Physics
Spanish
ach
e. Embedded Entrepreneurship
f. Three Social Entrepreneurship Models
g. Social-Founder Identity
h. Micros-enterprise Development
Outcomes
Subset 2. Indigenous Entrepreneurship Approaches (Outside of Canada)
a. Indigenous Australian Entrepreneurs Exami
Calculus
(people influence of
others) processes that you perceived occurs in this specific Institution Select one of the forms of stratification highlighted (focus on inter the intersectionalities
of these three) to reflect and analyze the potential ways these (
American history
Pharmacology
Ancient history
. Also
Numerical analysis
Environmental science
Electrical Engineering
Precalculus
Physiology
Civil Engineering
Electronic Engineering
ness Horizons
Algebra
Geology
Physical chemistry
nt
When considering both O
lassrooms
Civil
Probability
ions
Identify a specific consumer product that you or your family have used for quite some time. This might be a branded smartphone (if you have used several versions over the years)
or the court to consider in its deliberations. Locard’s exchange principle argues that during the commission of a crime
Chemical Engineering
Ecology
aragraphs (meaning 25 sentences or more). Your assignment may be more than 5 paragraphs but not less.
INSTRUCTIONS:
To access the FNU Online Library for journals and articles you can go the FNU library link here:
https://www.fnu.edu/library/
In order to
n that draws upon the theoretical reading to explain and contextualize the design choices. Be sure to directly quote or paraphrase the reading
ce to the vaccine. Your campaign must educate and inform the audience on the benefits but also create for safe and open dialogue. A key metric of your campaign will be the direct increase in numbers.
Key outcomes: The approach that you take must be clear
Mechanical Engineering
Organic chemistry
Geometry
nment
Topic
You will need to pick one topic for your project (5 pts)
Literature search
You will need to perform a literature search for your topic
Geophysics
you been involved with a company doing a redesign of business processes
Communication on Customer Relations. Discuss how two-way communication on social media channels impacts businesses both positively and negatively. Provide any personal examples from your experience
od pressure and hypertension via a community-wide intervention that targets the problem across the lifespan (i.e. includes all ages).
Develop a community-wide intervention to reduce elevated blood pressure and hypertension in the State of Alabama that in
in body of the report
Conclusions
References (8 References Minimum)
*** Words count = 2000 words.
*** In-Text Citations and References using Harvard style.
*** In Task section I’ve chose (Economic issues in overseas contracting)"
Electromagnetism
w or quality improvement; it was just all part of good nursing care. The goal for quality improvement is to monitor patient outcomes using statistics for comparison to standards of care for different diseases
e a 1 to 2 slide Microsoft PowerPoint presentation on the different models of case management. Include speaker notes... .....Describe three different models of case management.
visual representations of information. They can include numbers
SSAY
ame workbook for all 3 milestones. You do not need to download a new copy for Milestones 2 or 3. When you submit Milestone 3
pages):
Provide a description of an existing intervention in Canada
making the appropriate buying decisions in an ethical and professional manner.
Topic: Purchasing and Technology
You read about blockchain ledger technology. Now do some additional research out on the Internet and share your URL with the rest of the class
be aware of which features their competitors are opting to include so the product development teams can design similar or enhanced features to attract more of the market. The more unique
low (The Top Health Industry Trends to Watch in 2015) to assist you with this discussion.
https://youtu.be/fRym_jyuBc0
Next year the $2.8 trillion U.S. healthcare industry will finally begin to look and feel more like the rest of the business wo
evidence-based primary care curriculum. Throughout your nurse practitioner program
Vignette
Understanding Gender Fluidity
Providing Inclusive Quality Care
Affirming Clinical Encounters
Conclusion
References
Nurse Practitioner Knowledge
Mechanics
and word limit is unit as a guide only.
The assessment may be re-attempted on two further occasions (maximum three attempts in total). All assessments must be resubmitted 3 days within receiving your unsatisfactory grade. You must clearly indicate “Re-su
Trigonometry
Article writing
Other
5. June 29
After the components sending to the manufacturing house
1. In 1972 the Furman v. Georgia case resulted in a decision that would put action into motion. Furman was originally sentenced to death because of a murder he committed in Georgia but the court debated whether or not this was a violation of his 8th amend
One of the first conflicts that would need to be investigated would be whether the human service professional followed the responsibility to client ethical standard. While developing a relationship with client it is important to clarify that if danger or
Ethical behavior is a critical topic in the workplace because the impact of it can make or break a business
No matter which type of health care organization
With a direct sale
During the pandemic
Computers are being used to monitor the spread of outbreaks in different areas of the world and with this record
3. Furman v. Georgia is a U.S Supreme Court case that resolves around the Eighth Amendments ban on cruel and unsual punishment in death penalty cases. The Furman v. Georgia case was based on Furman being convicted of murder in Georgia. Furman was caught i
One major ethical conflict that may arise in my investigation is the Responsibility to Client in both Standard 3 and Standard 4 of the Ethical Standards for Human Service Professionals (2015). Making sure we do not disclose information without consent ev
4. Identify two examples of real world problems that you have observed in your personal
Summary & Evaluation: Reference & 188. Academic Search Ultimate
Ethics
We can mention at least one example of how the violation of ethical standards can be prevented. Many organizations promote ethical self-regulation by creating moral codes to help direct their business activities
*DDB is used for the first three years
For example
The inbound logistics for William Instrument refer to purchase components from various electronic firms. During the purchase process William need to consider the quality and price of the components. In this case
4. A U.S. Supreme Court case known as Furman v. Georgia (1972) is a landmark case that involved Eighth Amendment’s ban of unusual and cruel punishment in death penalty cases (Furman v. Georgia (1972)
With covid coming into place
In my opinion
with
Not necessarily all home buyers are the same! When you choose to work with we buy ugly houses Baltimore & nationwide USA
The ability to view ourselves from an unbiased perspective allows us to critically assess our personal strengths and weaknesses. This is an important step in the process of finding the right resources for our personal learning style. Ego and pride can be
· By Day 1 of this week
While you must form your answers to the questions below from our assigned reading material
CliftonLarsonAllen LLP (2013)
5 The family dynamic is awkward at first since the most outgoing and straight forward person in the family in Linda
Urien
The most important benefit of my statistical analysis would be the accuracy with which I interpret the data. The greatest obstacle
From a similar but larger point of view
4 In order to get the entire family to come back for another session I would suggest coming in on a day the restaurant is not open
When seeking to identify a patient’s health condition
After viewing the you tube videos on prayer
Your paper must be at least two pages in length (not counting the title and reference pages)
The word assimilate is negative to me. I believe everyone should learn about a country that they are going to live in. It doesnt mean that they have to believe that everything in America is better than where they came from. It means that they care enough
Data collection
Single Subject Chris is a social worker in a geriatric case management program located in a midsize Northeastern town. She has an MSW and is part of a team of case managers that likes to continuously improve on its practice. The team is currently using an
I would start off with Linda on repeating her options for the child and going over what she is feeling with each option. I would want to find out what she is afraid of. I would avoid asking her any “why” questions because I want her to be in the here an
Summarize the advantages and disadvantages of using an Internet site as means of collecting data for psychological research (Comp 2.1) 25.0\% Summarization of the advantages and disadvantages of using an Internet site as means of collecting data for psych
Identify the type of research used in a chosen study
Compose a 1
Optics
effect relationship becomes more difficult—as the researcher cannot enact total control of another person even in an experimental environment. Social workers serve clients in highly complex real-world environments. Clients often implement recommended inte
I think knowing more about you will allow you to be able to choose the right resources
Be 4 pages in length
soft MB-920 dumps review and documentation and high-quality listing pdf MB-920 braindumps also recommended and approved by Microsoft experts. The practical test
g
One thing you will need to do in college is learn how to find and use references. References support your ideas. College-level work must be supported by research. You are expected to do that for this paper. You will research
Elaborate on any potential confounds or ethical concerns while participating in the psychological study 20.0\% Elaboration on any potential confounds or ethical concerns while participating in the psychological study is missing. Elaboration on any potenti
3 The first thing I would do in the family’s first session is develop a genogram of the family to get an idea of all the individuals who play a major role in Linda’s life. After establishing where each member is in relation to the family
A Health in All Policies approach
Note: The requirements outlined below correspond to the grading criteria in the scoring guide. At a minimum
Chen
Read Connecting Communities and Complexity: A Case Study in Creating the Conditions for Transformational Change
Read Reflections on Cultural Humility
Read A Basic Guide to ABCD Community Organizing
Use the bolded black section and sub-section titles below to organize your paper. For each section
Losinski forwarded the article on a priority basis to Mary Scott
Losinksi wanted details on use of the ED at CGH. He asked the administrative resident