Data Mining for
Business Applications
Edited by
Longbing Cao Philip S. Yu Chengqi Zhang Huaifeng Zhang
1 3
Longbing Cao Philip S.Yu
School of Software Department of Computer Science Faculty of Engineering and University of Illinois at Chicago Information Technology 851 S. Morgan St.
University of Technology, Sydney Chicago, IL 60607
PO Box 123 [email protected]
Broadway NSW 2007, Australia [email protected]
Chengqi Zhang Huaifeng Zhang
Centre for Quantum Computation and School of Software Intelligent Systems Faculty of Engineering and Faculty of Engineering and Information Technology Information Technology University of Technology, Sydney University of Technology, Sydney PO Box 123
PO Box 123 Broadway NSW 2007, Australia Broadway NSW 2007, Australia [email protected]
ISBN: 978-0-387-79419-8 e-ISBN: 978-0-387-79420-4 DOI: 10.1007/978-0-387-79420-4
Library of Congress Control Number: 2008933446
¤ 2009 Springer Science+Business Media, LLC
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer soft- ware, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Printed on acid-free paper springer.com
Preface
This edited book,Data Mining for Business Applications, together with an up- coming monograph also by Springer,Domain Driven Data Mining, aims to present a full picture of the state-of-the-art research and development ofactionable knowl- edge discovery(AKD) in real-world businesses and applications.
The book is triggered by ubiquitous applications of data mining and knowledge discovery (KDD for short), and the real-world challenges and complexities to the current KDD methodologies and techniques. As we have seen, and as is often ad- dressed by panelists of SIGKDD and ICDM conferences, even though thousands of algorithms and methods have been published, very few of them have been validated in business use.
A major reason for the above situation, we believe, is the gap between academia and businesses, and the gap between academic research and real business needs.
Ubiquitous challenges and complexities from the real-world complex problems can be categorized by the involvement of six types of intelligence (6Is), namelyhuman roles and intelligence,domain knowledge and intelligence,network and web intel- ligence,organizational and social intelligence,in-depth data intelligence, and most importantly, themetasynthesis of the above intelligences.
It is certainly not our ambition to cover everything of the 6Isin this book. Rather, this edited book features the latest methodological, technical and practical progress on promoting the successful use of data mining in a collection of business domains.
The book consists of two parts, one on AKD methodologies and the other on novel AKD domains in business use.
In Part I, the book reports attempts and efforts in developing domain-driven workable AKD methodologies. This includes domain-driven data mining, post- processing rules for actions, domain-driven customer analytics, roles of human in- telligence in AKD, maximal pattern-based cluster, and ontology mining.
Part II selects a large number of novel KDD domains and the corresponding techniques. This involves great efforts to develop effective techniques and tools for emergent areas and domains, including mining social security data, community se- curity data, gene sequences, mental health information, traditional Chinese medicine data, cancer related data, blog data, sentiment information, web data, procedures,
v
moving object trajectories, land use mapping, higher education, flight scheduling, and algorithmic asset management.
The intended audience of this book will mainly consist of researchers, research students and practitioners in data mining and knowledge discovery. The book is also of interest to researchers and industrial practitioners in areas such as knowl- edge engineering, human-computer interaction, artificial intelligence, intelligent in- formation processing, decision support systems, knowledge management, and AKD project management.
Readers who are interested in actionable knowledge discovery in the real world, please also refer to our monograph:Domain Driven Data Mining, which has been scheduled to be published by Springer in 2009. The monograph will present our re- search outcomes on theoretical and technical issues in real-world actionable knowl- edge discovery, as well as working examples in financial data mining and social security mining.
We would like to convey our appreciation to all contributors including the ac- cepted chapters’ authors, and many other participants who submitted their chapters that cannot be included in the book due to space limits. Our special thanks to Ms.
Melissa Fearon and Ms. Valerie Schofield from Springer US for their kind support and great efforts in bringing the book to fruition. In addition, we also appreciate all reviewers, and Ms. Shanshan Wu’s assistance in formatting the book.
Longbing Cao, Philip S.Yu, Chengqi Zhang, Huaifeng Zhang July 2008
Contents
Part I Domain Driven KDD Methodology
1 Introduction to Domain Driven Data Mining . . . . 3
Longbing Cao 1.1 Why Domain Driven Data Mining . . . 3
1.2 What Is Domain Driven Data Mining . . . 5
1.2.1 Basic Ideas . . . 5
1.2.2 D3Mfor Actionable Knowledge Discovery . . . 6
1.3 Open Issues and Prospects . . . 9
1.4 Conclusions . . . 9
References . . . 10
2 Post-processing Data Mining Models for Actionability . . . . 11
Qiang Yang 2.1 Introduction . . . 11
2.2 Plan Mining for Class Transformation . . . 12
2.2.1 Overview of Plan Mining . . . 12
2.2.2 Problem Formulation . . . 14
2.2.3 From Association Rules to State Spaces . . . 14
2.2.4 Algorithm for Plan Mining . . . 17
2.2.5 Summary . . . 19
2.3 Extracting Actions from Decision Trees . . . 20
2.3.1 Overview . . . 20
2.3.2 Generating Actions from Decision Trees . . . 22
2.3.3 The Limited Resources Case . . . 23
2.4 Learning Relational Action Models from Frequent Action Sequences . . . 25
2.4.1 Overview . . . 25
2.4.2 ARMS Algorithm: From Association Rules to Actions . . 26
2.4.3 Summary of ARMS . . . 28
2.5 Conclusions and Future Work . . . 29
vii
References . . . 29
3 On Mining Maximal Pattern-Based Clusters. . . . 31
Jian Pei, Xiaoling Zhang, Moonjung Cho, Haixun Wang, and Philip S.Yu 3.1 Introduction . . . 32
3.2 Problem Definition and Related Work . . . 34
3.2.1 Pattern-Based Clustering . . . 34
3.2.2 Maximal Pattern-Based Clustering . . . 35
3.2.3 Related Work . . . 35
3.3 AlgorithmsMaPleandMaPle+ . . . 36
3.3.1 An Overview ofMaPle . . . 37
3.3.2 Computing and Pruning MDS’s . . . 38
3.3.3 Progressively Refining, Depth-first Search of Maximal pClusters . . . 40
3.3.4 MaPle+: Further Improvements . . . 44
3.4 Empirical Evaluation . . . 46
3.4.1 The Data Sets . . . 46
3.4.2 Results on Yeast Data Set . . . 47
3.4.3 Results on Synthetic Data Sets . . . 48
3.5 Conclusions . . . 50
References . . . 50
4 Role of Human Intelligence in Domain Driven Data Mining. . . . 53
Sumana Sharma and Kweku-Muata Osei-Bryson 4.1 Introduction . . . 53
4.2 DDDM Tasks Requiring Human Intelligence . . . 54
4.2.1 Formulating Business Objectives . . . 54
4.2.2 Setting up Business Success Criteria . . . 55
4.2.3 Translating Business Objective to Data Mining Objectives 56 4.2.4 Setting up of Data Mining Success Criteria . . . 56
4.2.5 Assessing Similarity Between Business Objectives of New and Past Projects . . . 57
4.2.6 Formulating Business, Legal and Financial Requirements . . . 57
4.2.7 Narrowing down Data and Creating Derived Attributes . . 58
4.2.8 Estimating Cost of Data Collection, Implementation and Operating Costs . . . 58
4.2.9 Selection of Modeling Techniques . . . 59
4.2.10 Setting up Model Parameters . . . 59
4.2.11 Assessing Modeling Results . . . 59
4.2.12 Developing a Project Plan . . . 60
4.3 Directions for Future Research . . . 60
4.4 Summary . . . 61
References . . . 61
5 Ontology Mining for Personalized Search . . . . 63
Yuefeng Li and Xiaohui Tao 5.1 Introduction . . . 63
5.2 Related Work . . . 64
5.3 Architecture . . . 65
5.4 Background Definitions . . . 66
5.4.1 World Knowledge Ontology . . . 66
5.4.2 Local Instance Repository . . . 67
5.5 Specifying Knowledge in an Ontology . . . 68
5.6 Discovery of Useful Knowledge in LIRs . . . 70
5.7 Experiments . . . 71
5.7.1 Experiment Design . . . 71
5.7.2 Other Experiment Settings . . . 74
5.8 Results and Discussions . . . 75
5.9 Conclusions . . . 77
References . . . 77
Part II Novel KDD Domains & Techniques 6 Data Mining Applications in Social Security. . . . 81
Yanchang Zhao, Huaifeng Zhang, Longbing Cao, Hans Bohlscheid, Yuming Ou, and Chengqi Zhang 6.1 Introduction and Background . . . 81
6.2 Case Study I: Discovering Debtor Demographic Patterns with Decision Tree and Association Rules . . . 83
6.2.1 Business Problem and Data . . . 83
6.2.2 Discovering Demographic Patterns of Debtors . . . 83
6.3 Case Study II: Sequential Pattern Mining to Find Activity Sequences of Debt Occurrence . . . 85
6.3.1 Impact-Targeted Activity Sequences . . . 86
6.3.2 Experimental Results . . . 87
6.4 Case Study III: Combining Association Rules from Heterogeneous Data Sources to Discover Repayment Patterns . . . . 89
6.4.1 Business Problem and Data . . . 89
6.4.2 Mining Combined Association Rules . . . 89
6.4.3 Experimental Results . . . 90
6.5 Case Study IV: Using Clustering and Analysis of Variance to Verify the Effectiveness of a New Policy . . . 92
6.5.1 Clustering Declarations with Contour and Clustering . . . . 92
6.5.2 Analysis of Variance . . . 94
6.6 Conclusions and Discussion . . . 94
References . . . 95
7 Security Data Mining: A Survey Introducing Tamper-Resistance . . . 97
Clifton Phua and Mafruz Ashrafi 7.1 Introduction . . . 97
7.2 Security Data Mining . . . 98
7.2.1 Definitions . . . 98
7.2.2 Specific Issues . . . 99
7.2.3 General Issues . . . 101
7.3 Tamper-Resistance . . . 102
7.3.1 Reliable Data . . . 102
7.3.2 Anomaly Detection Algorithms . . . 104
7.3.3 Privacy and Confidentiality Preserving Results . . . 105
7.4 Conclusion . . . 108
References . . . 108
8 A Domain Driven Mining Algorithm on Gene Sequence Clustering. . 111
Yun Xiong, Ming Chen, and Yangyong Zhu 8.1 Introduction . . . 111
8.2 Related Work . . . 112
8.3 The Similarity Based on Biological Domain Knowledge . . . 114
8.4 Problem Statement . . . 114
8.5 A Domain-Driven Gene Sequence Clustering Algorithm . . . 117
8.6 Experiments and Performance Study . . . 121
8.7 Conclusion and Future Work . . . 124
References . . . 125
9 Domain Driven Tree Mining of Semi-structured Mental Health Information . . . . 127
Maja Hadzic, Fedja Hadzic, and Tharam S. Dillon 9.1 Introduction . . . 127
9.2 Information Use and Management within Mental Health Domain . 128 9.3 Tree Mining - General Considerations . . . 130
9.4 Basic Tree Mining Concepts . . . 131
9.5 Tree Mining of Medical Data . . . 135
9.6 Illustration of the Approach . . . 139
9.7 Conclusion and Future Work . . . 139
References . . . 140
10 Text Mining for Real-time Ontology Evolution. . . . 143
Jackei H.K. Wong, Tharam S. Dillon, Allan K.Y. Wong, and Wilfred W.K. Lin 10.1 Introduction . . . 144
10.2 Related Text Mining Work . . . 145
10.3 Terminology and Multi-representations . . . 145
10.4 Master Aliases Table and OCOE Data Structures . . . 149
10.5 Experimental Results . . . 152
10.5.1 CAV Construction and Information Ranking . . . 153
10.5.2 Real-Time CAV Expansion Supported by Text Mining . . 154
10.6 Conclusion . . . 155
10.7 Acknowledgement . . . 156
References . . . 156
11 Microarray Data Mining: Selecting Trustworthy Genes with Gene Feature Ranking. . . . 159
Franco A. Ubaudi, Paul J. Kennedy, Daniel R. Catchpoole, Dachuan Guo, and Simeon J. Simoff 11.1 Introduction . . . 159
11.2 Gene Feature Ranking . . . 161
11.2.1 Use of Attributes and Data Samples in Gene Feature Ranking . . . 162
11.2.2 Gene Feature Ranking: Feature Selection Phase 1 . . . 163
11.2.3 Gene Feature Ranking: Feature Selection Phase 2 . . . 163
11.3 Application of Gene Feature Ranking to Acute Lymphoblastic Leukemia data . . . 164
11.4 Conclusion . . . 166
References . . . 167
12 Blog Data Mining for Cyber Security Threats. . . . 169
Flora S. Tsai and Kap Luk Chan 12.1 Introduction . . . 169
12.2 Review of Related Work . . . 170
12.2.1 Intelligence Analysis . . . 171
12.2.2 Information Extraction from Blogs . . . 171
12.3 Probabilistic Techniques for Blog Data Mining . . . 172
12.3.1 Attributes of Blog Documents . . . 172
12.3.2 Latent Dirichlet Allocation . . . 173
12.3.3 Isometric Feature Mapping (Isomap) . . . 174
12.4 Experiments and Results . . . 175
12.4.1 Data Corpus . . . 175
12.4.2 Results for Blog Topic Analysis . . . 176
12.4.3 Blog Content Visualization . . . 178
12.4.4 Blog Time Visualization . . . 179
12.5 Conclusions . . . 180
References . . . 181
13 Blog Data Mining: The Predictive Power of Sentiments. . . . 183
Yang Liu, Xiaohui Yu, Xiangji Huang, and Aijun An 13.1 Introduction . . . 183
13.2 Related Work . . . 185
13.3 Characteristics of Online Discussions . . . 186
13.3.1 Blog Mentions . . . 186
13.3.2 Box Office Data and User Rating . . . 187
13.3.3 Discussion . . . 187
13.4 S-PLSA: A Probabilistic Approach to Sentiment Mining . . . 188
13.4.1 Feature Selection . . . 188
13.4.2 Sentiment PLSA . . . 188
13.5 ARSA: A Sentiment-Aware Model . . . 189
13.5.1 The Autoregressive Model . . . 190
13.5.2 Incorporating Sentiments . . . 191
13.6 Experiments . . . 192
13.6.1 Experiment Settings . . . 192
13.6.2 Parameter Selection . . . 193
13.7 Conclusions and Future Work . . . 194
References . . . 194
14 Web Mining: Extracting Knowledge from the World Wide Web . . . . 197
Zhongzhi Shi, Huifang Ma, and Qing He 14.1 Overview of Web Mining Techniques . . . 197
14.2 Web Content Mining . . . 199
14.2.1 Classification: Multi-hierarchy Text Classification . . . 199
14.2.2 Clustering Analysis: Clustering Algorithm Based on Swarm Intelligence and k-Means . . . 200
14.2.3 Semantic Text Analysis: Conceptual Semantic Space . . . . 202
14.3 Web Structure Mining: PageRank vs. HITS . . . 203
14.4 Web Event Mining . . . 204
14.4.1 Preprocessing for Web Event Mining . . . 205
14.4.2 Multi-document Summarization: A Way to Demonstrate Event’s Cause and Effect . . . 206
14.5 Conclusions and Future Works . . . 206
References . . . 207
15 DAG Mining for Code Compaction. . . . 209
T. Werth, M. Wörlein, A. Dreweke, I. Fischer, and M. Philippsen 15.1 Introduction . . . 209
15.2 Related Work . . . 211
15.3 Graph and DAG Mining Basics . . . 211
15.3.1 Graph–based versus Embedding–based Mining . . . 212
15.3.2 Embedded versus Induced Fragments . . . 213
15.3.3 DAG Mining IsNP–complete . . . 213
15.4 Algorithmic Details of DAGMA . . . 214
15.4.1 A Canonical Form for DAG enumeration . . . 214
15.4.2 Basic Structure of the DAG Mining Algorithm . . . 215
15.4.3 Expansion Rules . . . 216
15.4.4 Application to Procedural Abstraction . . . 219
15.5 Evaluation . . . 220
15.6 Conclusion and Future Work . . . 222
References . . . 223
16 A Framework for Context-Aware Trajectory Data Mining. . . . 225
Vania Bogorny and Monica Wachowicz 16.1 Introduction . . . 225
16.2 Basic Concepts . . . 227
16.3 A Domain-driven Framework for Trajectory Data Mining . . . 229
16.4 Case Study . . . 232
16.4.1 The Selected Mobile Movement-aware Outdoor Game . . 233
16.4.2 Transportation Application . . . 234
16.5 Conclusions and Future Trends . . . 238
References . . . 239
17 Census Data Mining for Land Use Classification . . . . 241
E. Roma Neto and D. S. Hamburger 17.1 Content Structure . . . 241
17.2 Key Research Issues . . . 242
17.3 Land Use and Remote Sensing . . . 242
17.4 Census Data and Land Use Distribution . . . 243
17.5 Census Data Warehouse and Spatial Data Mining . . . 243
17.5.1 Concerning about Data Quality . . . 243
17.5.2 Concerning about Domain Driven . . . 244
17.5.3 Applying Machine Learning Tools . . . 246
17.6 Data Integration . . . 247
17.6.1 Area of Study and Data . . . 247
17.6.2 Supported Digital Image Processing . . . 248
17.6.3 Putting All Steps Together . . . 248
17.7 Results and Analysis . . . 249
References . . . 251
18 Visual Data Mining for Developing Competitive Strategies in Higher Education . . . . 253
Gürdal Ertek 18.1 Introduction . . . 253
18.2 Square Tiles Visualization . . . 255
18.3 Related Work . . . 256
18.4 Mathematical Model . . . 257
18.5 Framework and Case Study . . . 260
18.5.1 General Insights and Observations . . . 261
18.5.2 Benchmarking . . . 262
18.5.3 High School Relationship Management (HSRM) . . . 263
18.6 Future Work . . . 264
18.7 Conclusions . . . 264
References . . . 265
19 Data Mining For Robust Flight Scheduling. . . . 267
Ira Assent, Ralph Krieger, Petra Welter, Jörg Herbers, and Thomas Seidl 19.1 Introduction . . . 267
19.2 Flight Scheduling in the Presence of Delays . . . 268
19.3 Related Work . . . 270
19.4 Classification of Flights . . . 272
19.4.1 Subspaces for Locally Varying Relevance . . . 272
19.4.2 Integrating Subspace Information for Robust Flight Classification . . . 272
19.5 Algorithmic Concept . . . 274
19.5.1 Monotonicity Properties of Relevant Attribute Subspaces 274 19.5.2 Top-down Class Entropy Algorithm: Lossless Pruning Theorem . . . 275
19.5.3 Algorithm: Subspaces, Clusters, Subspace Classification . 276 19.6 Evaluation of Flight Delay Classification in Practice . . . 278
19.7 Conclusion . . . 280
References . . . 280
20 Data Mining for Algorithmic Asset Management. . . . 283
Giovanni Montana and Francesco Parrella 20.1 Introduction . . . 283
20.2 Backbone of the Asset Management System . . . 285
20.3 Expert-based Incremental Learning . . . 286
20.4 An Application to the iShare Index Fund . . . 290
References . . . 294
Reviewer List. . . 297
Index . . . 299
List of Contributors
Longbing Cao
School of Software, University of Technology Sydney, Australia, e-mail:
[email protected] Qiang Yang
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, e-mail:[email protected]
Jian Pei
Simon Fraser University, e-mail:[email protected] Xiaoling Zhang
Boston University, e-mail:[email protected] Moonjung Cho
Prism Health Networks, e-mail:[email protected] Haixun Wang
IBM T.J.Watson Research Center e-mail:[email protected] Philip S.Yu
University of Illinois at Chicago, e-mail:[email protected] Sumana Sharma
Virginia Commonwealth University, e-mail:[email protected] Kweku-Muata Osei-Bryson
Virginia Commonwealth University, e-mail:[email protected] Yuefeng Li
Information Technology, Queensland University of Technology, Australia, e-mail:
xv
[email protected] Yanchang Zhao
Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia, e-mail:[email protected]
Huaifeng Zhang
Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia, e-mail:[email protected]
Yuming Ou
Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia, e-mail:[email protected]
Chengqi Zhang
Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia, e-mail:[email protected]
Hans Bohlscheid
Data Mining Section, Business Integrity Programs Branch, Centrelink, Australia, e-mail:[email protected]
Clifton Phua
A*STAR, Institute of Infocomm Research, Room 04-21 (+6568748406), 21, Heng Mui Keng Terrace, Singapore 119613, e-mail:[email protected] Mafruz Ashrafi
A*STAR, Institute of Infocomm Research, Room 04-21 (+6568748406), 21, Heng Yun Xiong
Department of Computing and Information Technology, Fudan University, Shanghai 200433, China, e-mail:[email protected]
Ming Chen
Department of Computing and Information Technology, Fudan University, Shanghai 200433, China, e-mail:[email protected]
Yangyong Zhu
Department of Computing and Information Technology, Fudan University, Shanghai 200433, China, e-mail:[email protected]
Maja Hadzic
Digital Ecosystems and Business Intelligence Institute (DEBII), Curtin University of Technology, Australia, e-mail:[email protected]
Fedja Hadzic
Digital Ecosystems and Business Intelligence Institute (DEBII), Curtin University of Technology, Australia, e-mail:[email protected]
Xiaohui Tao
Information Technology, Queensland University of Technology, Australia, e-mail:
Mui Keng Terrace, Singapore 119613, e-mail:[email protected]
Digital Ecosystems and Business Intelligence Institute (DEBII), Curtin University of Technology, Australia, e-mail:[email protected]
Jackei H.K. Wong
Department of Computing, Hong Kong Polytechnic University, Hong Kong SAR, e-mail:[email protected]
Allan K.Y. Wong
Department of Computing, Hong Kong Polytechnic University, Hong Kong SAR, e-mail:[email protected]
Wilfred W.K. Lin
Department of Computing, Hong Kong Polytechnic University, Hong Kong SAR, e-mail:[email protected]
Franco A. Ubaudi
Faculty of IT, University of Technology, Sydney, e-mail:[email protected].
edu.au Paul J. Kennedy
Faculty of IT, University of Technology, Sydney, e-mail:[email protected].
au
Daniel R. Catchpoole
Tumour Bank, The Childrens Hospital at Westmead, e-mail:DanielC@chw.
edu.au Dachuan Guo
Tumour Bank, The Childrens Hospital at Westmead, e-mail:dachuang@chw.
edu.au Simeon J. Simoff
University of Western Sydney, e-mail:[email protected] Flora S. Tsai
Nanyang Technological University, Singapore, e-mail:[email protected] Kap Luk Chan
Nanyang Technological University, Singapore e-mail:[email protected] Yang Liu
Department of Computer Science and Engineering, York University, Toronto, ON, Canada M3J 1P3, e-mail:[email protected]
Xiaohui Yu
School of Information Technology, York University, Toronto, ON, Canada M3J 1P3, e-mail:[email protected]
Xiangji Huang
School of Information Technology, York University, Toronto, ON, Canada M3J 1P3, e-mail:[email protected]
Tharam S. Dillon
Department of Computer Science and Engineering, York University, Toronto, ON, Canada M3J 1P3, e-mail:[email protected]
Zhongzhi Shi
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, No. 6 Kexueyuan Nanlu, Beijing 100080, People’s Republic of China, e-mail:[email protected] Huifang Ma
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, No. 6 Kexueyuan Nanlu, Beijing 100080, People’s Republic of China,e-mail:[email protected] Qing He
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, No. 6 Kexueyuan Nanlu, Beijing 100080, People’s Republic of China, e-mail:[email protected] T. Werth
Programming Systems Group, Computer Science Department, University of Erlangen–Nuremberg, Germany, phone: +49 9131 85-28865, e-mail:
[email protected] M. Wörlein
Programming Systems Group, Computer Science Department, University of Erlangen–Nuremberg, Germany, phone: +49 9131 85-28865, e-mail:
[email protected] A. Dreweke
Programming Systems Group, Computer Science Department, University of Erlangen–Nuremberg, Germany, phone: +49 9131 85-28865, e-mail:
[email protected] M. Philippsen
Programming Systems Group, Computer Science Department, University of Erlangen–Nuremberg, Germany, phone: +49 9131 85-28865, e-mail:
[email protected] I. Fischer
Nycomed Chair for Bioinformatics and Information Mining, University of Konstanz, Germany, phone: +49 7531 88-5016, e-mail:Ingrid.Fischer@
inf.uni-konstanz.de Vania Bogorny
Instituto de Informatica, Universidade Federal do Rio Grande do Sul (UFRGS), Av. Bento Gonalves, 9500 - Campus do Vale - Bloco IV, Bairro Agronomia - Porto Alegre - RS -Brasil, CEP 91501-970 Caixa Postal: 15064, e-mail:
[email protected] Aijun An
ETSI Topografia, Geodesia y Cartografa, Universidad Politecnica de Madrid, KM 7,5 de la Autovia de Valencia, E-28031 Madrid - Spain, e-mail:m.wachowicz@
topografia.upm.es E.Roma Neto
Av. Eng. Euséio Stevaux, 823 - 04696-000, São Paulo, SP, Brazil, e-mail:
[email protected] D. S. Hamburger
Av. Eng. Euséio Stevaux, 823 - 04696-000, São Paulo, SP, Brazil, e-mail:
[email protected] Gürdal Ertek
Sabancı University, Faculty of Engineering and Natural Sciences, Orhanlı, Tuzla, 34956, Istanbul, Turkey, e-mail:[email protected]
Ira Assent
Data Management and Exploration Group, RWTH Aachen University, Germany, phone: +492418021910, e-mail:[email protected]
Ralph Krieger
Data Management and Exploration Group, RWTH Aachen University, Germany, phone: +492418021910, e-mail:[email protected]
Thomas Seidl
Data Management and Exploration Group, RWTH Aachen University, Germany, phone: +492418021910, e-mail:[email protected]
Petra Welter
Dept. of Medical Informatics, RWTH Aachen University, Germany, e-mail:
[email protected] Jörg Herbers
INFORM GmbH, Pascalstraße 23, Aachen, Germany, e-mail:joerg.herbers@
inform-ac.com Giovanni Montana
Imperial College London, Department of Mathematics, 180 Queen’s Gate, London SW7 2AZ, UK, e-mail:[email protected]
Francesco Parrella
Imperial College London, Department of Mathematics, 180 Queen’s Gate, London SW7 2AZ, UK, e-mail:[email protected]
Monica Wachowicz
Introduction to Domain Driven Data Mining
Longbing Cao
AbstractThe mainstream data mining faces critical challenges and lacks of soft power in solving real-world complex problems when deployed. Following the paradigm shift from ‘data mining’ to ‘knowledge discovery’, we believe much more thorough efforts are essential for promoting the wide acceptance and employment of knowledge discovery in real-world smart decision making. To this end, we expect a new paradigm shift from ‘data-centered knowledge discovery’ to ‘domain-driven actionable knowledge discovery’. In the domain-driven actionable knowledge dis- covery, ubiquitous intelligence must be involved and meta-synthesized into the min- ing process, and an actionable knowledge discovery-based problem-solving system is formed as the space for data mining. This is the motivation and aim of developing Domain Driven Data Mining(D3Mfor short). This chapter briefs the main reasons, ideas and open issues inD3M.
1.1 Why Domain Driven Data Mining
Data mining and knowledge discovery (data mining or KDD for short) [9] has emerged to be one of the most vivacious areas in information technology in the last decade. It has boosted a major academic and industrial campaign crossing many traditional areas such as machine learning, database, statistics, as well as emergent disciplines, for example, bioinformatics. As a result, KDD has published thousands of algorithms and methods, as widely seen in regular conferences and workshops crossing international, regional and national levels.
Compared with the booming fact in academia, data mining applications in the real world has not been as active, vivacious and charming as that of academic re- search. This can be easily found from the extremely imbalanced numbers of pub- Longbing Cao
School of Software, University of Technology Sydney, Australia, e-mail:[email protected].
au
3
lished algorithms versus those really workable in the business environment. That is to say, there is a big gap between academic objectives and business goals, and between academic outputs and business expectations. However, this runs in the op- posite direction of KDD’s original intention and its nature. It is also against the value of KDD as a discipline, which generates the power of enabling smart busi- nesses and developing business intelligence for smart decisions in production and living environment.
If we scrutinize the reasons of the existing gaps, we probably can point out many things. For instance, academic researchers do not really know the needs of business people, and are not familiar with the business environment. With many years of development of this promising scientific field, it is time and worthwhile to review the major issues blocking the step of KDD into business use widely.
While after the origin ofdata mining, researchers with strong industrial engage- ment realized the need from ‘data mining’ to ‘knowledge discovery’ [1, 7, 8] to deliver useful knowledge for the business decision-making . Many researchers, in particular early career researchers in KDD, are still only or mainly focusing on
‘data mining’, namely mining for patterns in data. The main reason for such a dom- inant situation, either explicitly or implicitly, is on its originally narrow focus and overemphasized by innovative algorithm-driven research (unfortunately we are not at the stage of holding as many effective algorithms as we need in the real world applications).
Knowledge discovery is further expected to migrate intoactionable knowledge discovery (AKD) . AKD targets knowledge that can be delivered in the form of business-friendly and decision-making actions, and can be taken over by business people seamlessly. However, AKD is still a big challenge to the current KDD re- search and development. Reasons surrounding the challenge of AKD include many critical aspects on both macro-level and micro-level.
On the macro-level, issues are related to methodological and fundamental as- pects, for instance,
• An intrinsic difference existing in academic thinking and business deliverable expectation; for example, researchers usually are interested in innovative pattern types, while practitioners care about getting a problem solved;
• The paradigm of KDD, whether as a hidden pattern mining process centered by data, or an AKD-based problem-solving system ; the latter emphasizes not only innovation but also impact of KDD deliverables.
The micro-level issues are more related to technical and engineering aspects, for instance,
• If KDD is an AKD-based problem-solving system, we then need to care about many issues such as system dynamics, system environment, and interaction in a system;
• If AKD is the target, we then have to cater for real-world aspects such as busi- ness processes, organizational factors, and constraints.
In scrutinizing both macro-level and micro-level of issues in AKD, we propose a new KDD methodology on top of the traditional data-centered pattern mining
framework , that isDomain Driven Data Mining(D3M) [2,4,5]. In the next section, we introduce the main idea ofD3M.
1.2 What Is Domain Driven Data Mining 1.2.1 Basic Ideas
The motivation ofD3Mis to view KDD as AKD-based problem-solving systems through developing effective methodologies, methods and tools. The aim ofD3M is to make AKD system deliver business-friendly and decision-making rules and actions that are of solid technical significance as well. To this end,D3Mcaters for the effective involvement of the following ubiquitous intelligence surrounding AKD- based problem-solving.
• Data Intelligence, tells stories hidden in the data about a business problem.
• Domain Intelligence, refers to domain resources that not only wrap a problem and its target data but also assist in the understanding and problem-solving of the problem. Domain intelligence consists of qualitative and quantitative intel- ligence. Both types of intelligence are instantiated in terms of aspects such as domain knowledge, background information, constraints, organization factors and business process, as well as environment intelligence, business expectation and interestingness.
• Network Intelligence, refers to both web intelligence and broad-based network intelligence such as distributed information and resources, linkages, searching, and structured information from textual data.
• Human Intelligence, refers to (1) explicit or direct involvement of humans such as empirical knowledge, belief, intention and expectation, run-time supervision, evaluating, and expert group; (2) implicit or indirect involvement of human in- telligence such as imaginary thinking, emotional intelligence, inspiration, brain- storm, and reasoning inputs.
• Social Intelligence , consists of interpersonal intelligence, emotional intelli- gence, social cognition, consensus construction, group decision, as well as orga- nizational factors, business process, workflow, project management and deliv- ery, social network intelligence, collective interaction, business rules, law, trust and so on.
• Intelligence Metasynthesis , the above ubiquitous intelligence has to be com- bined for the problem-solving. The methodology for combining such intelli- gence is calledmetasynthesis[10, 11], which provides a human-centered and human-machine-cooperated problem-solving process by involving, synthesiz- ing and using ubiquitous intelligence surrounding AKD as need for problem- solving.
1.2.2 D
3M for Actionable Knowledge Discovery
Real-world data mining is a complex problem-solving system. From the view of systems and microeconomy, the endogenous character of actionable knowledge dis- covery (AKD) determines that it is an optimization problem with certain objectives in a particular environment. We present a formal definition of AKD in this section.
We first define several notions as follows.
LetDBbe a database collected from business problems (Ψ),X={x1,x2,···, xL} be the set of items in theDB, wherexl (l=1,...,L) be an itemset, and the number of attributes (v) inDBbeS. SupposeE={e1,e2,···,eK}denotes the envi- ronment set, whereek represents a particular environment setting for AKD. Fur- ther, let M ={m1,m2,···,mN} be the data mining method set, where mn (n= 1,...,N) is a method. For the methodmn, suppose its identified pattern setPmn = {pm1n,pm2n,···,pUmn}includes all patterns discovered inDB, wherepmun(u=1,...,U) denotes a pattern discovered by the methodmn.
In the real world, data mining is a problem-solving process from business prob- lems (Ψ, with problem statusτ) to problem-solving solutions (Φ):
Ψ→Φ (1.1)
From the modeling perspective, such a problem-solving process is a state trans- formation process from source dataDB(Ψ→DB) to resulting pattern setP(Φ→P).
Ψ→Φ::DB(v1,...,vS)→P(f1,...,fQ) (1.2)
wherevs(s=1,...,S) are attributes in the source dataDB, while fq(q=1,...,Q) are features used for mining the pattern setP.
Definition 1.1.(Actionable Patterns)
LetP={p˜1,p˜2,···,p˜Z}be anActionable Pattern Setmined by methodmnfor the given problemΨ (its data set isDB), in which each pattern ˜pzisactionablefor the problem-solving if it satisfies the following conditions:
1.a.ti(p˜z)≥ti,0; indicating the pattern ˜pzsatisfying technical interestingnesstiwith thresholdti,0;
1.b.bi(p˜z)≥bi,0; indicating the pattern ˜pzsatisfying business interestingnessbiwith thresholdbi,0;
1.c.R:τ1A,−→mn(p˜z)τ2; the pattern can support business problem-solving (R) by tak- ing actionA, and correspondingly transform the problem status from initially nonoptimal stateτ1to greatly improved stateτ2.
Therefore, the discovery of actionable knowledge (AKD) on data setDBis an iterative optimization process toward the actionable pattern setP.
AKD:DBe,τ,m−→1P1e,τ,m−→2P2···e−→,τ,mnP (1.3)
Definition 1.2.(Actionable Knowledge Discovery)
TheActionable Knowledge Discovery(AKD) is the procedure to find theActionable Pattern SetPthrough employing all valid methodsM. Its mathematical description is as follows:
AKDmi∈M−→Op∈PInt(p), (1.4) whereP=Pm1U Pm2,···,U Pmn,Int(.)is the evaluation function,O(.)is the opti- mization function to extract those ˜p∈PwhereInt(p)˜ can beat a given benchmark.
For a patternp,Int(p)can be further measured in terms oftechnical interesting- ness(ti(p)) andbusiness interestingness(bi(p)) [3].
Int(p) =I(ti(p),bi(p)) (1.5) whereI(.)is the function for aggregating the contributions of all particular aspects of interestingness.
Further,Int(p)can be described in terms ofobjective(o)andsubjective(s)fac- tors from bothtechnical(t)andbusiness(b)perspectives.
Int(p) =I(to(),ts(),bo(),bs()) (1.6) whereto()is objective technical interestingness,ts()is subjective technical interest- ingness,bo()is objective business interestingness, andbs()is subjective business interestingness.
We saypis trulyactionable(i.e.,p) both to academia and business if it satisfies the following condition:
Int(p) =to(x,p) ∧ts(x,p) ∧bo(x,p)∧bs(x,p) (1.7) whereI→‘∧indicates the ‘aggregation’ of the interestingness.
In general,to(),ts(),bo()andbs()of practical applications can be regarded as independent of each other. With their normalization (expressed by ˆ), we can get the following:
Int(p)→I(ˆtˆo(),tˆs(),bˆo(),bˆs())
=αtˆo() +βtˆs() +γbˆo() +δbˆs() (1.8) So, the AKD optimization problem can be expressed as follows:
AKDe,τ,m∈M−→Op∈P(Int(p))
→O(αtˆo()) +O(βtˆs()) +
O(γbˆo()) +O(δbˆs()) (1.9)
Definition 1.3.(Actionability of a Pattern)
Theactionabilityof a patternpis measured byact(p):
act(p) =Op∈P(Int(p))
→O(αtˆo(p)) +O(βtˆs(p)) + O(γbˆo(p)) +O(δbˆs(p))
→toact+tsact+bacto +bacts
→tiact+bacti (1.10)
wheretoact,tsact,bacto andbacts measure the respective actionable performance in terms of each interestingness element.
Due to the inconsistency often existing at different aspects, we often find the identified patterns only fitting in one of the following sub-sets:
Int(p)→ {{tiact,bacti },{¬tiact,bacti },
{tiact,¬bacti },{¬tiact,¬bacti }} (1.11) where ’¬’ indicates the corresponding element is not satisfactory.
Ideally, we look for actionable patternspthat can satisfy the following:
IF
∀p∈P,∃x:to(x,p)∧ts(x,p)∧bo(x,p)
∧bs(x,p)→act(p) (1.12) THEN:
p→p. (1.13)
However, in real-world mining, as we know, it is very challenging to find the most actionable patterns that are associated with both ‘optimal’tiact andbacti . Quite often a pattern with significantti()is associated with unconfidentbi(). Contrarily, it is not rare that patterns with lowti()are associated with confidentbi(). Clearly, AKD targets patterns confirming the relationship{tiact,bacti }.
Therefore, it is necessary to deal with such possible conflict and uncertainty amongst respective interestingness elements. However, it is a kind of artwork and needs to involve domain knowledge and domain experts to tune thresholds and bal- ance difference betweenti() andbi(). Another issue is to develop techniques to balance and combine all types of interestingness metrics to generate uniform, bal- anced and interpretable mechanisms for measuring knowledge deliverability and ex- tracting and selecting resulting patterns. A reasonable way is to balance both sides toward an acceptable tradeoff. To this end, we need to develop interestingness ag- gregation methods, namely theI−f unction(or ‘∧‘) to aggregate all elements of interestingness. In fact, each of the interestingness categories may be instantiated into more than one metric. There could be several methods of doing the aggrega- tion, for instance, empirical methods such as business expert-based voting, or more quantitative methods such as multi-objective optimization methods.
1.3 Open Issues and Prospects
solving systems, many research issues need to be studied or revisited.
• Typical research issues and techniques inData Intelligenceinclude mining in- depth data patterns, and mining structured knowledge in unstructured data.
• Typical research issues and techniques inDomain Intelligenceconsist of repre- sentation, modeling and involvement of domain knowledge, constraints, orga- nizational factors, and business interestingness.
• Typical research issues and techniques inNetwork Intelligenceinclude informa- tion retrieval, text mining, web mining, semantic web, ontological engineering techniques, and web knowledge management.
• Typical research issues and techniques inHuman Intelligenceinclude human- machine interaction, representation and involvement of empirical and implicit knowledge.
• Typical research issues and techniques inSocial Intelligenceinclude collective intelligence, social network analysis, and social cognition interaction.
• Typical issues in intelligence metasynthesisconsist of building metasynthetic interaction (m-interaction) as working mechanism, and metasynthetic space (m- space) as an AKD-based problem-solving system [6].
Typical issues in actionable knowledge discovery through m-spaces consist of
• Mechanisms for acquiring and representing unstructured and ill-structured, un- certain knowledge such as empirical knowledge stored in domain experts’
brains, such as unstructured knowledge representation and brain informatics;
• Mechanisms for acquiring and representing expert thinking such as imaginary thinking and creative thinking in group heuristic discussions;
• Mechanisms for acquiring and representing group/collective interaction behav- ior and impact emergence, such as behavior informatics and analytics;
• Mechanisms for modeling learning-of-learning, i.e., learning other participants’
behavior which is the result of self-learning or ex-learning, such as learning evolution and intelligence emergence.
1.4 Conclusions
The mainstream data mining research features its dominating focus on the in- novation of algorithms and tools yet caring little for their workable capability in the real world. Consequently, data mining applications face significant problem of the workability of deployed algorithms, tools and resulting deliverables. To funda- mentally change such situations, and empower the workable capability and perfor- mance of advanced data mining in real-world production and economy, there is an urgent need to develop next-generation data mining methodologies and techniques To effectively synthesize the above ubiquitous intelligence in AKD-based problem-
that target the paradigm shift from data-centered hidden pattern mining to domain- driven actionable knowledge discovery. Its goal is to build KDD as an AKD-based problem-solving system.
Based on our experience in conducting large-scale data analysis for several do- mains, for instance, finance data mining and social security mining, we have pro- posed theDomain Driven Data Mining(D3M for short) methodology.D3M em- phasizes the development of methodologies, techniques and tools foractionable knowledge discovery. It involves relevantly ubiquitous intelligence surrounding the business problem-solving, such as human intelligence, domain intelligence, network intelligence and organizational/social intelligence, and the meta-synthesis of such ubiquitous intelligence into a human-computer-cooperated closed problem-solving system.
Our current work includes an attempt on theoretical studies and working case studies on a set of typically open issues inD3M. The results will come into a mono- graph namedDomain Driven Data Mining, which will be published by Springer in 2009.
Acknowledgements This work is sponsored in part by Australian Research Council Grants (DP0773412, LP0775041, DP0667060).
References
1. Ankerst, M.: Report on the SIGKDD-2002 Panel the Perfect Pata Mining Tool: Interactive or Automated? ACM SIGKDD Explorations Newsletter, 4(2):110-111, 2002.
2. Cao, L., Yu, P., Zhang, C., Zhao, Y., Williams, G.: DDDM2007: Domain Driven Data Mining, ACM SIGKDD Explorations Newsletter, 9(2): 84-86, 2007.
3. Cao, L., Zhang, C.: Knowledge Actionability: Satisfying Technical and Business Interesting- ness, International Journal of Business Intelligence and Data Mining, 2(4): 496-514, 2007.
4. Cao, L., Zhang, C.: The Evolution of KDD: Towards Domain-Driven Data Mining, Interna- tional Journal of Pattern Recognition and Artificial Intelligence, 21(4): 677-692, 2007.
5. Cao, L.: Domain-Driven Actionable Knowledge Discovery, IEEE Intelligent Systems, 22(4):
78-89, 2007.
6. Cao, L., Dai, R., Zhou, M.: Metasynthesis, M-Space and M-Interaction for Open Complex Giant Systems, technical report, 2008.
7. Fayyad, U., Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discovery in Databases, AI Magazine, 37-54, 1996.
8. Fayyad, U., Shapiro, G., Uthurusamy, R.: Summary from the KDD-03 Panel - Data mining:
The Next 10 Years, ACM SIGKDD Explorations Newsletter, 5(2): 191-196, 2003.
9. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edition, Morgan Kauf- mann, 2006.
10. Qian, X.S., Yu, J.Y., Dai, R.W.: A New Scientific Field–Open Complex Giant Systems and the Methodology, Chinese Journal of Nature, 13(1) 3-10, 1990.
11. Qian, X.S. (Tsien H.S.): Revisiting issues on open complex giant systems, Pattern Recogni- tion and Artificial Intelligence, 4(1): 5-8, 1991.
Post-processing Data Mining Models for Actionability
Qiang Yang
AbstractData mining and machine learning algorithms are, in the most part, aimed at generating statistical models for decision making. These models are typically mathematical formulas or classification results on the test data. However, many of the output models do not themselves correspond to actions that can be executed.
In this paper, we consider how to take the output of data mining algorithms as in- put, and produce collections of high-quality actions to perform in order to bring out the desired world states. This article gives an overview on two of our approaches in this actionable data mining framework, including an algorithm that extracts ac- tions from decision trees and a system that generates high-utility association rules and an algorithm that can learn relational action models from frequent item sets for automatic planning. These two problems and solutions highlight our novel compu- tational framework for actionable data mining.
2.1 Introduction
In data mining and machine learning areas, much research has been done on constructing statistical models from the underlying data. These models include Bayesian probability models, decision trees, logistic and linear regression models, kernel machines and support vector machines as well as clusters and association rules, to name a few [1,11]. Most of these techniques are what we refer to as predic- tive pattern-based models, in that they summarize the distributions of the training data in one way or another. Thus, they typically stop short of achieving the final objectives of data mining by maximizing utility when tested on the test data. The real action work is waiting to be done by humans, who read the patterns, interpret them and decide which ones to select to put into actions.
Qiang Yang
Department of Computer Science and Engineering, Hong Kong University of Science and Tech- nology, e-mail:[email protected]
11
In short, the predictive pattern-based models are aimed for human consumption, similar to what the World Wide Web (WWW) was originally designed for. However, similar to the movement from Web pages to XML pages, we also wish to see knowl- edge in the form of machine-executable patterns, which constitutes truly actionable knowledge.
In this paper, we consider how to take the output of data mining algorithms as input and produce collections of high-quality actions to perform in order to bring out the desired world states. We argue that the data mining methods should not stop when a model is produced, but rather give collections of actions that can be executed either automatically or semi-automatically, to effect the final outcome of the system.
The effect of the generated actions can be evaluated using the test data in a cross- validation manner. We argue that only in this way can a data mining system be truly considered asactionable.
In this paper, we consider three approaches that we have adopted in post- processing data mining models for generation actionable knowledge . We first con- sider in the next section how to postprocess association rules into action sets for direct marketing [14]. Then, we give an overview of a novel approach that extracts actions from decision trees in order to allow each test instance to fall in a desirable state (a detailed description is in [16]). We then describe an algorithm that can learn relational action models from frequent item sets for automatic planning [15].
2.2 Plan Mining for Class Transformation 2.2.1 Overview of Plan Mining
In this section, we first consider the following challenging problem: how to con- vert customers from a less desirable class to a highly desirable class. In this section, we give an overview of our approach in building an actionable plan from association mining results. More detailed algorithms and test results can be found in [14].
We start with a motivating example. A financial company might be interested in transforming some of the valuable customers from reluctant to active customers through a series of marketing actions. The objective is find an unconditional se- quence of actions, a plan, to transform as many from a group of individuals as possible to a more desirable status. This problem is what we call the class- transformation problem. In this section, we describe a planning algorithm for the class-transformation problem that finds a sequence of actions that will transform an initialundesirablecustomer group (e.g., brand-hopping low spenders) into adesir- ablecustomer group (e.g., brand-loyal big spenders).
We consider a state as a group of customers with similar properties. We apply machine learning algorithms that take as input a database of individual customer profiles and their responses to past marketing actions and produce the customer groups and the state space information including initial state and the next states
after action executions. We have a set of actions with state-transition probabilities.
At each state, we can identify whether we have arrived at adesired classthrough a classifier.
Suppose that a company is interested in marketing to a large group of customers in a financial market to promote a special loan sign-up. We start with a customer- loan database with historical customer information on past loan-marketing results in Table 2.1. Suppose that we are interested in building a 3-step plan to market to the selected group of customers in the new customer list. There are many candidate plans to consider in order to transform as many customers as possible from non- sign-up status to a sign-up one. The sign-up status corresponds to a positive class that we would like to move the customers to, and the non-signup status corresponds to the initial state of our customers. Our plan will choose not only low-cost actions, but also highly successful actions from the past experience. For example, a candidate plan might be:
Step 1: Offer to reduce interest rate;
Step 2: Send flyer;
Step 3: Follow up with a home phone call.
Table 2.1 An example ofCustomertable
Customer Interest Rate Flyer Salary Signup
John 5% Y 110K Y
Mary 4% N 30K Y
... ... ... ... ...
Steve 8% N 80K N
This example introduces a number of interesting aspects for the problem at hand.
We consider the input data source, which consists of customer information and their desirability class labels. In this database of customers, not all people should be con- sidered as candidates for the class transformation, because for some people it is too costly or nearly impossible to convert them to the more desirable states. Our output plan is assumed to be anunconditionalsequence of actions rather than conditional plans. When these actions are executed in sequence, no intermediate state informa- tion is needed. This makes thegroup marketing problem fundamentally different from the direct marketing problem. In the former, the aim is to find a single se- quence of actions with maximal chance of success without inserting if-branches in the plan. In contrast, for direct marketing problems, the aim is to find conditional plans such that a best decision is taken depending on the customers’ intermediate state. These are best suited for techniques such as the Markov Decision Processes (MDP) [5, 10, 13].
2.2.2 Problem Formulation
To formulate the problem as a data mining problem, we first consider how to build a state space from a given set of customer records and a set of plan traces in the past. We have two datasets as input. As in any machine learning and data mining schemes, the input customer records consist of a set of attributes for each customer, along with a class attribute that describes the customer status. A second source of input is the previous plans recorded in a database. We also have the costs of actions. As an example, after a customer receives a promotional mail, the customer’s response to the marketing action is obtained and recorded. As a result of the mailing, the action count for the customer in this marketing campaign is incremented by one, and the customer may have decided to respond by filling out a general information form and mailing it back to the bank. Table 2.2 shows an example ofplan trace table.
Table 2.2 A set of plan traces as input
Plan # State0 Action0 State1 Action1 State2 Plan1 S0 A0 S1 A1 S5 Plan2 S0 A0 S1 A2 S5 Plan3 S0 A0 S1 A2 S6 Plan4 S0 A0 S1 A2 S7 Plan5 S0 A0 S2 A1 S6 Plan6 S0 A0 S2 A1 S8 Plan7 S0 A1 S3
Plan8 S0 A1 S4
2.2.3 From Association Rules to State Spaces
From the customer records, a can be constructed by piecing together the associ- ation rule mining [1]. Each state node corresponds to a state in planning, on which a classification model can be built to classify a customer falling onto this state into either a positive (+) or a negative (-) class based on the training data. Between two states in this state space, an edge is defined as a state-action sequence which allows a probabilistic mapping from a state to a set of states. A cost is associated with each action.
To enable planning in this state space, we apply sequential association rule min- ing [1] to the plan traces. Each rule is of the form:S1,a1,a2,...,→Sn, where each ai is an action,S1andSnare the initial and end states for this sequence of actions.
All actions in this rule start fromS1and follow the order in the given sequence to result inSn. By only keeping the sequential rules that have high enough support,
we can get segments or paths that we can piece together to form a search space. In particular, in this space, we can gather the following information:
• fs(ri) =sj maps a customer recordri to a statesj. This function is known as the customer-state mapping function. In our work, this function is obtained by applying odd-log ratio analysis [8] to perform a feature selection in the cus- tomer database. Other methods such as Chi-squared methods or PCA can also be applied.
• p(+|s)is the classification function that is represented as a probability function.
This function returns the conditional probability that state sis in a desirable class. We call this function the state-classification function;
• p(sk|si,aj)returns the transition probability that, after executing an actionajin statesi, one ends up in statesk.
Once the customer records have been converted to states and the state transitions, we are now ready to consider the notion of a plan. To clarify matters, we describe the state space as an AND/OR graph. In this graph, there are two types of node. Astate noderepresents a state. From each state node, an action links the state node to an outcome node, which represents the outcome of performing the action from the state.
An outcome node then splits into multiple state nodes according to the probability distribution given by the p(sk|si,aj) function. This AND/OR graph unwraps the original state space, where each state is an OR node and the actions that can be performed on the node form the OR branches. Each outcome node is an AND node, where the different arcs connecting the outcome node to the state nodes are the AND edges. Figure 2.1 is an example AND/OR graph. An example plan in this space is shown in Figure 2.2.
6
$ $
6 6
6 6
$ $
6 6 6 6
6 6
$
Fig. 2.1 An example of AND/OR graph
We define the utilityU(s,P)of the planP=a1a2...anfrom an initial state s as follows. LetP be the subplan ofPafter taking out the first actiona1; that is, P=a1P. LetSbe a set of states. Then the utility of the planPis defined recursively