1 3

(1)

(2)

Data Mining for

Business Applications

Edited by

Longbing Cao Philip S. Yu Chengqi Zhang Huaifeng Zhang

1 3

(3)

Longbing Cao Philip S.Yu

School of Software Department of Computer Science Faculty of Engineering and University of Illinois at Chicago Information Technology 851 S. Morgan St.

University of Technology, Sydney Chicago, IL 60607

PO Box 123 [email protected]

Broadway NSW 2007, Australia [email protected]

Chengqi Zhang Huaifeng Zhang

Centre for Quantum Computation and School of Software Intelligent Systems Faculty of Engineering and Faculty of Engineering and Information Technology Information Technology University of Technology, Sydney University of Technology, Sydney PO Box 123

PO Box 123 Broadway NSW 2007, Australia Broadway NSW 2007, Australia [email protected]

[email protected]

ISBN: 978-0-387-79419-8 e-ISBN: 978-0-387-79420-4 DOI: 10.1007/978-0-387-79420-4

Library of Congress Control Number: 2008933446

¤ 2009 Springer Science+Business Media, LLC

All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.

The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

Printed on acid-free paper springer.com

(4)

Preface

This edited book,Data Mining for Business Applications, together with an up- coming monograph also by Springer,Domain Driven Data Mining, aims to present a full picture of the state-of-the-art research and development ofactionable knowl- edge discovery(AKD) in real-world businesses and applications.

The book is triggered by ubiquitous applications of data mining and knowledge discovery (KDD for short), and the real-world challenges and complexities to the current KDD methodologies and techniques. As we have seen, and as is often ad- dressed by panelists of SIGKDD and ICDM conferences, even though thousands of algorithms and methods have been published, very few of them have been validated in business use.

A major reason for the above situation, we believe, is the gap between academia and businesses, and the gap between academic research and real business needs.

Ubiquitous challenges and complexities from the real-world complex problems can be categorized by the involvement of six types of intelligence (6Is), namelyhuman roles and intelligence,domain knowledge and intelligence,network and web intel- ligence,organizational and social intelligence,in-depth data intelligence, and most importantly, themetasynthesis of the above intelligences.

It is certainly not our ambition to cover everything of the 6Isin this book. Rather, this edited book features the latest methodological, technical and practical progress on promoting the successful use of data mining in a collection of business domains.

The book consists of two parts, one on AKD methodologies and the other on novel AKD domains in business use.

In Part I, the book reports attempts and efforts in developing domain-driven workable AKD methodologies. This includes domain-driven data mining, post- processing rules for actions, domain-driven customer analytics, roles of human intelligence in AKD, maximal pattern-based cluster, and ontology mining.

Part II selects a large number of novel KDD domains and the corresponding techniques. This involves great efforts to develop effective techniques and tools for emergent areas and domains, including mining social security data, community security data, gene sequences, mental health information, traditional Chinese medicine data, cancer related data, blog data, sentiment information, web data, procedures,

v

(5)

moving object trajectories, land use mapping, higher education, ﬂight scheduling, and algorithmic asset management.

The intended audience of this book will mainly consist of researchers, research students and practitioners in data mining and knowledge discovery. The book is also of interest to researchers and industrial practitioners in areas such as knowledge engineering, human-computer interaction, artiﬁcial intelligence, intelligent information processing, decision support systems, knowledge management, and AKD project management.

Readers who are interested in actionable knowledge discovery in the real world, please also refer to our monograph:Domain Driven Data Mining, which has been scheduled to be published by Springer in 2009. The monograph will present our research outcomes on theoretical and technical issues in real-world actionable knowledge discovery, as well as working examples in ﬁnancial data mining and social security mining.

We would like to convey our appreciation to all contributors including the ac- cepted chapters’ authors, and many other participants who submitted their chapters that cannot be included in the book due to space limits. Our special thanks to Ms.

Melissa Fearon and Ms. Valerie Schoﬁeld from Springer US for their kind support and great efforts in bringing the book to fruition. In addition, we also appreciate all reviewers, and Ms. Shanshan Wu’s assistance in formatting the book.

Longbing Cao, Philip S.Yu, Chengqi Zhang, Huaifeng Zhang July 2008

(6)

List of Contributors

Longbing Cao

School of Software, University of Technology Sydney, Australia, e-mail:

[email protected] Qiang Yang

Department of Computer Science and Engineering, Hong Kong University of Science and Technology, e-mail:[email protected]

Jian Pei

Simon Fraser University, e-mail:[email protected] Xiaoling Zhang

Boston University, e-mail:[email protected] Moonjung Cho

Prism Health Networks, e-mail:[email protected] Haixun Wang

IBM T.J.Watson Research Center e-mail:[email protected] Philip S.Yu

University of Illinois at Chicago, e-mail:[email protected] Sumana Sharma

Virginia Commonwealth University, e-mail:[email protected] Kweku-Muata Osei-Bryson

Virginia Commonwealth University, e-mail:[email protected] Yuefeng Li

Information Technology, Queensland University of Technology, Australia, e-mail:

[email protected]

xv

(15)

[email protected] Yanchang Zhao

Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia, e-mail:[email protected]

Huaifeng Zhang

Yuming Ou

Chengqi Zhang

Hans Bohlscheid

Data Mining Section, Business Integrity Programs Branch, Centrelink, Australia, e-mail:[email protected]

Clifton Phua

A*STAR, Institute of Infocomm Research, Room 04-21 (+6568748406), 21, Heng Mui Keng Terrace, Singapore 119613, e-mail:[email protected] Mafruz Ashraﬁ

A*STAR, Institute of Infocomm Research, Room 04-21 (+6568748406), 21, Heng Yun Xiong

Department of Computing and Information Technology, Fudan University, Shanghai 200433, China, e-mail:[email protected]

Ming Chen

Yangyong Zhu

Maja Hadzic

Digital Ecosystems and Business Intelligence Institute (DEBII), Curtin University of Technology, Australia, e-mail:[email protected]

Fedja Hadzic

Xiaohui Tao

Information Technology, Queensland University of Technology, Australia, e-mail:

Mui Keng Terrace, Singapore 119613, e-mail:[email protected]

(16)

Jackei H.K. Wong

Department of Computing, Hong Kong Polytechnic University, Hong Kong SAR, e-mail:[email protected]

Allan K.Y. Wong

Wilfred W.K. Lin

Franco A. Ubaudi

Faculty of IT, University of Technology, Sydney, e-mail:[email protected].

edu.au Paul J. Kennedy

Faculty of IT, University of Technology, Sydney, e-mail:[email protected].

au

Daniel R. Catchpoole

Tumour Bank, The Childrens Hospital at Westmead, e-mail:DanielC@chw.

edu.au Dachuan Guo

Tumour Bank, The Childrens Hospital at Westmead, e-mail:dachuang@chw.

edu.au Simeon J. Simoff

University of Western Sydney, e-mail:[email protected] Flora S. Tsai

Nanyang Technological University, Singapore, e-mail:[email protected] Kap Luk Chan

Nanyang Technological University, Singapore e-mail:[email protected] Yang Liu

Department of Computer Science and Engineering, York University, Toronto, ON, Canada M3J 1P3, e-mail:[email protected]

Xiaohui Yu

School of Information Technology, York University, Toronto, ON, Canada M3J 1P3, e-mail:[email protected]

Xiangji Huang

School of Information Technology, York University, Toronto, ON, Canada M3J 1P3, e-mail:[email protected]

Tharam S. Dillon

(17)

Department of Computer Science and Engineering, York University, Toronto, ON, Canada M3J 1P3, e-mail:[email protected]

Zhongzhi Shi

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, No. 6 Kexueyuan Nanlu, Beijing 100080, People’s Republic of China, e-mail:[email protected] Huifang Ma

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, No. 6 Kexueyuan Nanlu, Beijing 100080, People’s Republic of China,e-mail:[email protected] Qing He

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, No. 6 Kexueyuan Nanlu, Beijing 100080, People’s Republic of China, e-mail:[email protected] T. Werth

Programming Systems Group, Computer Science Department, University of Erlangen–Nuremberg, Germany, phone: +49 9131 85-28865, e-mail:

[email protected] M. Wörlein

[email protected] A. Dreweke

[email protected] M. Philippsen

[email protected] I. Fischer

Nycomed Chair for Bioinformatics and Information Mining, University of Konstanz, Germany, phone: +49 7531 88-5016, e-mail:Ingrid.Fischer@

inf.uni-konstanz.de Vania Bogorny

Instituto de Informatica, Universidade Federal do Rio Grande do Sul (UFRGS), Av. Bento Gonalves, 9500 - Campus do Vale - Bloco IV, Bairro Agronomia - Porto Alegre - RS -Brasil, CEP 91501-970 Caixa Postal: 15064, e-mail:

[email protected] Aijun An

(18)

ETSI Topograﬁa, Geodesia y Cartografa, Universidad Politecnica de Madrid, KM 7,5 de la Autovia de Valencia, E-28031 Madrid - Spain, e-mail:m.wachowicz@

topografia.upm.es E.Roma Neto

Av. Eng. Euséio Stevaux, 823 - 04696-000, São Paulo, SP, Brazil, e-mail:

[email protected] D. S. Hamburger

Av. Eng. Euséio Stevaux, 823 - 04696-000, São Paulo, SP, Brazil, e-mail:

[email protected] Gürdal Ertek

Sabancı University, Faculty of Engineering and Natural Sciences, Orhanlı, Tuzla, 34956, Istanbul, Turkey, e-mail:[email protected]

Ira Assent

Data Management and Exploration Group, RWTH Aachen University, Germany, phone: +492418021910, e-mail:[email protected]

Ralph Krieger

Thomas Seidl

Petra Welter

Dept. of Medical Informatics, RWTH Aachen University, Germany, e-mail:

[email protected] Jörg Herbers

INFORM GmbH, Pascalstraße 23, Aachen, Germany, e-mail:joerg.herbers@

inform-ac.com Giovanni Montana

Imperial College London, Department of Mathematics, 180 Queen’s Gate, London SW7 2AZ, UK, e-mail:[email protected]

Francesco Parrella

Imperial College London, Department of Mathematics, 180 Queen’s Gate, London SW7 2AZ, UK, e-mail:[email protected]

Monica Wachowicz

(19)

Introduction to Domain Driven Data Mining

Longbing Cao

AbstractThe mainstream data mining faces critical challenges and lacks of soft power in solving real-world complex problems when deployed. Following the paradigm shift from ‘data mining’ to ‘knowledge discovery’, we believe much more thorough efforts are essential for promoting the wide acceptance and employment of knowledge discovery in real-world smart decision making. To this end, we expect a new paradigm shift from ‘data-centered knowledge discovery’ to ‘domain-driven actionable knowledge discovery’. In the domain-driven actionable knowledge discovery, ubiquitous intelligence must be involved and meta-synthesized into the mining process, and an actionable knowledge discovery-based problem-solving system is formed as the space for data mining. This is the motivation and aim of developing Domain Driven Data Mining(D³Mfor short). This chapter briefs the main reasons, ideas and open issues inD³M.

1.1 Why Domain Driven Data Mining

Data mining and knowledge discovery (data mining or KDD for short) [9] has emerged to be one of the most vivacious areas in information technology in the last decade. It has boosted a major academic and industrial campaign crossing many traditional areas such as machine learning, database, statistics, as well as emergent disciplines, for example, bioinformatics. As a result, KDD has published thousands of algorithms and methods, as widely seen in regular conferences and workshops crossing international, regional and national levels.

Compared with the booming fact in academia, data mining applications in the real world has not been as active, vivacious and charming as that of academic research. This can be easily found from the extremely imbalanced numbers of pub- Longbing Cao

School of Software, University of Technology Sydney, Australia, e-mail:[email protected].

au

3

(20)

lished algorithms versus those really workable in the business environment. That is to say, there is a big gap between academic objectives and business goals, and between academic outputs and business expectations. However, this runs in the op- posite direction of KDD’s original intention and its nature. It is also against the value of KDD as a discipline, which generates the power of enabling smart businesses and developing business intelligence for smart decisions in production and living environment.

If we scrutinize the reasons of the existing gaps, we probably can point out many things. For instance, academic researchers do not really know the needs of business people, and are not familiar with the business environment. With many years of development of this promising scientiﬁc ﬁeld, it is time and worthwhile to review the major issues blocking the step of KDD into business use widely.

While after the origin ofdata mining, researchers with strong industrial engage- ment realized the need from ‘data mining’ to ‘knowledge discovery’ [1, 7, 8] to deliver useful knowledge for the business decision-making . Many researchers, in particular early career researchers in KDD, are still only or mainly focusing on

‘data mining’, namely mining for patterns in data. The main reason for such a dom- inant situation, either explicitly or implicitly, is on its originally narrow focus and overemphasized by innovative algorithm-driven research (unfortunately we are not at the stage of holding as many effective algorithms as we need in the real world applications).

Knowledge discovery is further expected to migrate intoactionable knowledge discovery (AKD) . AKD targets knowledge that can be delivered in the form of business-friendly and decision-making actions, and can be taken over by business people seamlessly. However, AKD is still a big challenge to the current KDD research and development. Reasons surrounding the challenge of AKD include many critical aspects on both macro-level and micro-level.

On the macro-level, issues are related to methodological and fundamental aspects, for instance,

• An intrinsic difference existing in academic thinking and business deliverable expectation; for example, researchers usually are interested in innovative pattern types, while practitioners care about getting a problem solved;

• The paradigm of KDD, whether as a hidden pattern mining process centered by data, or an AKD-based problem-solving system ; the latter emphasizes not only innovation but also impact of KDD deliverables.

The micro-level issues are more related to technical and engineering aspects, for instance,

• If KDD is an AKD-based problem-solving system, we then need to care about many issues such as system dynamics, system environment, and interaction in a system;

• If AKD is the target, we then have to cater for real-world aspects such as business processes, organizational factors, and constraints.

In scrutinizing both macro-level and micro-level of issues in AKD, we propose a new KDD methodology on top of the traditional data-centered pattern mining

(21)

framework , that isDomain Driven Data Mining(D³M) [2,4,5]. In the next section, we introduce the main idea ofD³M.

1.2 What Is Domain Driven Data Mining 1.2.1 Basic Ideas

The motivation ofD³Mis to view KDD as AKD-based problem-solving systems through developing effective methodologies, methods and tools. The aim ofD³M is to make AKD system deliver business-friendly and decision-making rules and actions that are of solid technical signiﬁcance as well. To this end,D³Mcaters for the effective involvement of the following ubiquitous intelligence surrounding AKD- based problem-solving.

• Data Intelligence, tells stories hidden in the data about a business problem.

• Domain Intelligence, refers to domain resources that not only wrap a problem and its target data but also assist in the understanding and problem-solving of the problem. Domain intelligence consists of qualitative and quantitative intelligence. Both types of intelligence are instantiated in terms of aspects such as domain knowledge, background information, constraints, organization factors and business process, as well as environment intelligence, business expectation and interestingness.

• Network Intelligence, refers to both web intelligence and broad-based network intelligence such as distributed information and resources, linkages, searching, and structured information from textual data.

• Human Intelligence, refers to (1) explicit or direct involvement of humans such as empirical knowledge, belief, intention and expectation, run-time supervision, evaluating, and expert group; (2) implicit or indirect involvement of human intelligence such as imaginary thinking, emotional intelligence, inspiration, brain- storm, and reasoning inputs.

• Social Intelligence , consists of interpersonal intelligence, emotional intelligence, social cognition, consensus construction, group decision, as well as organizational factors, business process, workﬂow, project management and deliv- ery, social network intelligence, collective interaction, business rules, law, trust and so on.

• Intelligence Metasynthesis , the above ubiquitous intelligence has to be com- bined for the problem-solving. The methodology for combining such intelligence is calledmetasynthesis[10, 11], which provides a human-centered and human-machine-cooperated problem-solving process by involving, synthesiz- ing and using ubiquitous intelligence surrounding AKD as need for problem- solving.

(22)

1.2.2 D

³

M for Actionable Knowledge Discovery

Real-world data mining is a complex problem-solving system. From the view of systems and microeconomy, the endogenous character of actionable knowledge discovery (AKD) determines that it is an optimization problem with certain objectives in a particular environment. We present a formal deﬁnition of AKD in this section.

We ﬁrst deﬁne several notions as follows.

LetDBbe a database collected from business problems (Ψ^),^X⁼{x₁,x₂,···, x_L} be the set of items in theDB, wherex_l (l=1,...,L) be an itemset, and the number of attributes (v) inDBbeS. SupposeE={e₁,e₂,···,e_K}denotes the environment set, wheree_k represents a particular environment setting for AKD. Fur- ther, let M ={m1,m2,···,m_N} be the data mining method set, where m_n (n= 1,...,N) is a method. For the methodmn, suppose its identiﬁed pattern setP^mⁿ = {p^m₁ⁿ,p^m₂ⁿ,···,p_U^mⁿ}includes all patterns discovered inDB, wherep^m_uⁿ(u=1,...,U) denotes a pattern discovered by the methodm_n.

In the real world, data mining is a problem-solving process from business problems (Ψ, with problem statusτ) to problem-solving solutions (Φ^):

Ψ→Φ ^(1.1)

From the modeling perspective, such a problem-solving process is a state transformation process from source dataDB(Ψ→DB) to resulting pattern setP(Φ→P).

Ψ→Φ^::^DB(v1,...,v_S)→P(f₁,...,f_Q) (1.2)

wherev_s(s=1,...,S) are attributes in the source dataDB, while f_q(q=1,...,Q) are features used for mining the pattern setP.

Deﬁnition 1.1.(Actionable Patterns)

LetP={p˜₁,p˜₂,···,p˜_Z}be anActionable Pattern Setmined by methodm_nfor the given problemΨ (its data set isDB), in which each pattern ˜p_zisactionablefor the problem-solving if it satisﬁes the following conditions:

1.a.t_i(p˜_z)≥t_i,0; indicating the pattern ˜p_zsatisfying technical interestingnesst_iwith thresholdt_i,0;

1.b.b_i(p˜_z)≥b_i,0; indicating the pattern ˜p_zsatisfying business interestingnessb_iwith thresholdb_i,0;

1.c.R:τ¹^A^,−→^mⁿ⁽^p^˜^z⁾τ²; the pattern can support business problem-solving (R) by taking actionA, and correspondingly transform the problem status from initially nonoptimal stateτ1to greatly improved stateτ2.

Therefore, the discovery of actionable knowledge (AKD) on data setDBis an iterative optimization process toward the actionable pattern setP.

AKD:DBê,τ,m−→¹P₁ê,τ,m−→²P₂···ê−→^,τ,^mⁿP (1.3)

(23)

Deﬁnition 1.2.(Actionable Knowledge Discovery)

TheActionable Knowledge Discovery(AKD) is the procedure to ﬁnd theActionable Pattern SetPthrough employing all valid methodsM. Its mathematical description is as follows:

AKD^mⁱ^∈M−→Op∈PInt(p), (1.4) whereP=P^m¹U P^m²,···,U P^mⁿ,Int(.)is the evaluation function,O(.)is the optimization function to extract those ˜p∈PwhereInt(p)˜ can beat a given benchmark.

For a patternp,Int(p)can be further measured in terms oftechnical interesting- ness(ti(p)) andbusiness interestingness(bi(p)) [3].

Int(p) =I(t_i(p),b_i(p)) (1.5) whereI(.)is the function for aggregating the contributions of all particular aspects of interestingness.

Further,Int(p)can be described in terms ofobjective(o)andsubjective(s)factors from bothtechnical(t)andbusiness(b)perspectives.

Int(p) =I(to(),t_s(),b_o(),b_s()) (1.6) wheret_o()is objective technical interestingness,t_s()is subjective technical interestingness,b_o()is objective business interestingness, andb_s()is subjective business interestingness.

We saypis trulyactionable(i.e.,p) both to academia and business if it satisﬁes the following condition:

Int(p) =t_o(x,p) ∧t_s(x,p) ∧b_o(x,p)∧b_s(x,p) (1.7) whereI→‘∧indicates the ‘aggregation’ of the interestingness.

In general,t_o(),t_s(),b_o()andb_s()of practical applications can be regarded as independent of each other. With their normalization (expressed by ˆ), we can get the following:

Int(p)→I(ˆtˆ_o(),tˆ_s(),bˆ_o(),bˆ_s())

=α^t^ˆo() +β^t^ˆs() +γ^b^ˆo() +δ^b^ˆs() (1.8) So, the AKD optimization problem can be expressed as follows:

AKD^e,τ,m∈M−→O_p∈P(Int(p))

→O(α^t^ˆo()) +O(β^t^ˆs()) +

O(γ^b^ˆo()) +O(δ^b^ˆs()) (1.9)

Deﬁnition 1.3.(Actionability of a Pattern)

Theactionabilityof a patternpis measured byact(p):

(24)

act(p) =O_p∈P(Int(p))

→O(α^t^ˆo(p)) +O(β^t^ˆs(p)) + O(γ^b^ˆo(p)) +O(δ^b^ˆs(p))

→t_oâct+t_sâct+bâct_o +bâct_s

→t_i^act+b^act_i (1.10)

wheret_oâct,t_sâct,bâct_o andbâct_s measure the respective actionable performance in terms of each interestingness element.

Due to the inconsistency often existing at different aspects, we often find the identified patterns only fitting in one of the following sub-sets:

Int(p)→ {{t_iâct,bâct_i },{¬t_iâct,bâct_i },

{t_iâct,¬bâct_i },{¬t_iâct,¬bâct_i }} (1.11) where ’¬’ indicates the corresponding element is not satisfactory.

Ideally, we look for actionable patternspthat can satisfy the following:

IF

∀p∈P,∃x:t_o(x,p)∧t_s(x,p)∧b_o(x,p)

∧b_s(x,p)→act(p) (1.12) THEN:

p→p. (1.13)

However, in real-world mining, as we know, it is very challenging to find the most actionable patterns that are associated with both ‘optimal’t_iâct andbâct_i . Quite often a pattern with significantt_i()is associated with unconfidentb_i(). Contrarily, it is not rare that patterns with lowti()are associated with confidentbi(). Clearly, AKD targets patterns confirming the relationship{t_iâct,bâct_i }.

Therefore, it is necessary to deal with such possible conﬂict and uncertainty amongst respective interestingness elements. However, it is a kind of artwork and needs to involve domain knowledge and domain experts to tune thresholds and balance difference betweent_i() andb_i(). Another issue is to develop techniques to balance and combine all types of interestingness metrics to generate uniform, bal- anced and interpretable mechanisms for measuring knowledge deliverability and ex- tracting and selecting resulting patterns. A reasonable way is to balance both sides toward an acceptable tradeoff. To this end, we need to develop interestingness aggregation methods, namely theI−f unction(or ‘∧‘) to aggregate all elements of interestingness. In fact, each of the interestingness categories may be instantiated into more than one metric. There could be several methods of doing the aggregation, for instance, empirical methods such as business expert-based voting, or more quantitative methods such as multi-objective optimization methods.

(25)

1.3 Open Issues and Prospects

solving systems, many research issues need to be studied or revisited.

• Typical research issues and techniques inData Intelligenceinclude mining in- depth data patterns, and mining structured knowledge in unstructured data.

• Typical research issues and techniques inDomain Intelligenceconsist of representation, modeling and involvement of domain knowledge, constraints, organizational factors, and business interestingness.

• Typical research issues and techniques inNetwork Intelligenceinclude information retrieval, text mining, web mining, semantic web, ontological engineering techniques, and web knowledge management.

• Typical research issues and techniques inHuman Intelligenceinclude human- machine interaction, representation and involvement of empirical and implicit knowledge.

• Typical research issues and techniques inSocial Intelligenceinclude collective intelligence, social network analysis, and social cognition interaction.

• Typical issues in intelligence metasynthesisconsist of building metasynthetic interaction (m-interaction) as working mechanism, and metasynthetic space (m- space) as an AKD-based problem-solving system [6].

Typical issues in actionable knowledge discovery through m-spaces consist of

• Mechanisms for acquiring and representing unstructured and ill-structured, un- certain knowledge such as empirical knowledge stored in domain experts’

brains, such as unstructured knowledge representation and brain informatics;

• Mechanisms for acquiring and representing expert thinking such as imaginary thinking and creative thinking in group heuristic discussions;

• Mechanisms for acquiring and representing group/collective interaction behavior and impact emergence, such as behavior informatics and analytics;

• Mechanisms for modeling learning-of-learning, i.e., learning other participants’

behavior which is the result of self-learning or ex-learning, such as learning evolution and intelligence emergence.

1.4 Conclusions

The mainstream data mining research features its dominating focus on the innovation of algorithms and tools yet caring little for their workable capability in the real world. Consequently, data mining applications face signiﬁcant problem of the workability of deployed algorithms, tools and resulting deliverables. To fundamentally change such situations, and empower the workable capability and performance of advanced data mining in real-world production and economy, there is an urgent need to develop next-generation data mining methodologies and techniques To effectively synthesize the above ubiquitous intelligence in AKD-based problem-

(26)

that target the paradigm shift from data-centered hidden pattern mining to domain- driven actionable knowledge discovery. Its goal is to build KDD as an AKD-based problem-solving system.

Based on our experience in conducting large-scale data analysis for several domains, for instance, ﬁnance data mining and social security mining, we have pro- posed theDomain Driven Data Mining(D³M for short) methodology.D³M emphasizes the development of methodologies, techniques and tools foractionable knowledge discovery. It involves relevantly ubiquitous intelligence surrounding the business problem-solving, such as human intelligence, domain intelligence, network intelligence and organizational/social intelligence, and the meta-synthesis of such ubiquitous intelligence into a human-computer-cooperated closed problem-solving system.

Our current work includes an attempt on theoretical studies and working case studies on a set of typically open issues inD³M. The results will come into a mono- graph namedDomain Driven Data Mining, which will be published by Springer in 2009.

Acknowledgements This work is sponsored in part by Australian Research Council Grants (DP0773412, LP0775041, DP0667060).

References

1. Ankerst, M.: Report on the SIGKDD-2002 Panel the Perfect Pata Mining Tool: Interactive or Automated? ACM SIGKDD Explorations Newsletter, 4(2):110-111, 2002.

2. Cao, L., Yu, P., Zhang, C., Zhao, Y., Williams, G.: DDDM2007: Domain Driven Data Mining, ACM SIGKDD Explorations Newsletter, 9(2): 84-86, 2007.

3. Cao, L., Zhang, C.: Knowledge Actionability: Satisfying Technical and Business Interesting- ness, International Journal of Business Intelligence and Data Mining, 2(4): 496-514, 2007.

4. Cao, L., Zhang, C.: The Evolution of KDD: Towards Domain-Driven Data Mining, Interna- tional Journal of Pattern Recognition and Artiﬁcial Intelligence, 21(4): 677-692, 2007.

5. Cao, L.: Domain-Driven Actionable Knowledge Discovery, IEEE Intelligent Systems, 22(4):

78-89, 2007.

6. Cao, L., Dai, R., Zhou, M.: Metasynthesis, M-Space and M-Interaction for Open Complex Giant Systems, technical report, 2008.

7. Fayyad, U., Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discovery in Databases, AI Magazine, 37-54, 1996.

8. Fayyad, U., Shapiro, G., Uthurusamy, R.: Summary from the KDD-03 Panel - Data mining:

The Next 10 Years, ACM SIGKDD Explorations Newsletter, 5(2): 191-196, 2003.

9. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edition, Morgan Kauf- mann, 2006.

10. Qian, X.S., Yu, J.Y., Dai, R.W.: A New Scientiﬁc Field–Open Complex Giant Systems and the Methodology, Chinese Journal of Nature, 13(1) 3-10, 1990.

11. Qian, X.S. (Tsien H.S.): Revisiting issues on open complex giant systems, Pattern Recogni- tion and Artiﬁcial Intelligence, 4(1): 5-8, 1991.

(27)

Post-processing Data Mining Models for Actionability

Qiang Yang

AbstractData mining and machine learning algorithms are, in the most part, aimed at generating statistical models for decision making. These models are typically mathematical formulas or classiﬁcation results on the test data. However, many of the output models do not themselves correspond to actions that can be executed.

In this paper, we consider how to take the output of data mining algorithms as input, and produce collections of high-quality actions to perform in order to bring out the desired world states. This article gives an overview on two of our approaches in this actionable data mining framework, including an algorithm that extracts actions from decision trees and a system that generates high-utility association rules and an algorithm that can learn relational action models from frequent item sets for automatic planning. These two problems and solutions highlight our novel compu- tational framework for actionable data mining.

2.1 Introduction

In data mining and machine learning areas, much research has been done on constructing statistical models from the underlying data. These models include Bayesian probability models, decision trees, logistic and linear regression models, kernel machines and support vector machines as well as clusters and association rules, to name a few [1,11]. Most of these techniques are what we refer to as predictive pattern-based models, in that they summarize the distributions of the training data in one way or another. Thus, they typically stop short of achieving the ﬁnal objectives of data mining by maximizing utility when tested on the test data. The real action work is waiting to be done by humans, who read the patterns, interpret them and decide which ones to select to put into actions.

Qiang Yang

Department of Computer Science and Engineering, Hong Kong University of Science and Tech- nology, e-mail:[email protected]

11

(28)

In short, the predictive pattern-based models are aimed for human consumption, similar to what the World Wide Web (WWW) was originally designed for. However, similar to the movement from Web pages to XML pages, we also wish to see knowledge in the form of machine-executable patterns, which constitutes truly actionable knowledge.

In this paper, we consider how to take the output of data mining algorithms as input and produce collections of high-quality actions to perform in order to bring out the desired world states. We argue that the data mining methods should not stop when a model is produced, but rather give collections of actions that can be executed either automatically or semi-automatically, to effect the ﬁnal outcome of the system.

The effect of the generated actions can be evaluated using the test data in a cross- validation manner. We argue that only in this way can a data mining system be truly considered asactionable.

In this paper, we consider three approaches that we have adopted in post- processing data mining models for generation actionable knowledge . We ﬁrst consider in the next section how to postprocess association rules into action sets for direct marketing [14]. Then, we give an overview of a novel approach that extracts actions from decision trees in order to allow each test instance to fall in a desirable state (a detailed description is in [16]). We then describe an algorithm that can learn relational action models from frequent item sets for automatic planning [15].

2.2 Plan Mining for Class Transformation 2.2.1 Overview of Plan Mining

In this section, we ﬁrst consider the following challenging problem: how to convert customers from a less desirable class to a highly desirable class. In this section, we give an overview of our approach in building an actionable plan from association mining results. More detailed algorithms and test results can be found in [14].

We start with a motivating example. A financial company might be interested in transforming some of the valuable customers from reluctant to active customers through a series of marketing actions. The objective is find an unconditional sequence of actions, a plan, to transform as many from a group of individuals as possible to a more desirable status. This problem is what we call the class- transformation problem. In this section, we describe a planning algorithm for the class-transformation problem that finds a sequence of actions that will transform an initialundesirablecustomer group (e.g., brand-hopping low spenders) into adesir- ablecustomer group (e.g., brand-loyal big spenders).

We consider a state as a group of customers with similar properties. We apply machine learning algorithms that take as input a database of individual customer proﬁles and their responses to past marketing actions and produce the customer groups and the state space information including initial state and the next states

(29)

after action executions. We have a set of actions with state-transition probabilities.

At each state, we can identify whether we have arrived at adesired classthrough a classiﬁer.

Suppose that a company is interested in marketing to a large group of customers in a ﬁnancial market to promote a special loan sign-up. We start with a customer- loan database with historical customer information on past loan-marketing results in Table 2.1. Suppose that we are interested in building a 3-step plan to market to the selected group of customers in the new customer list. There are many candidate plans to consider in order to transform as many customers as possible from non- sign-up status to a sign-up one. The sign-up status corresponds to a positive class that we would like to move the customers to, and the non-signup status corresponds to the initial state of our customers. Our plan will choose not only low-cost actions, but also highly successful actions from the past experience. For example, a candidate plan might be:

Step 1: Offer to reduce interest rate;

Step 2: Send ﬂyer;

Step 3: Follow up with a home phone call.

Table 2.1 An example ofCustomertable

Customer Interest Rate Flyer Salary Signup

John 5% Y 110K Y

Mary 4% N 30K Y

... ... ... ... ...

Steve 8% N 80K N

This example introduces a number of interesting aspects for the problem at hand.

We consider the input data source, which consists of customer information and their desirability class labels. In this database of customers, not all people should be considered as candidates for the class transformation, because for some people it is too costly or nearly impossible to convert them to the more desirable states. Our output plan is assumed to be anunconditionalsequence of actions rather than conditional plans. When these actions are executed in sequence, no intermediate state information is needed. This makes thegroup marketing problem fundamentally different from the direct marketing problem. In the former, the aim is to ﬁnd a single sequence of actions with maximal chance of success without inserting if-branches in the plan. In contrast, for direct marketing problems, the aim is to ﬁnd conditional plans such that a best decision is taken depending on the customers’ intermediate state. These are best suited for techniques such as the Markov Decision Processes (MDP) [5, 10, 13].

(30)

2.2.2 Problem Formulation

To formulate the problem as a data mining problem, we ﬁrst consider how to build a state space from a given set of customer records and a set of plan traces in the past. We have two datasets as input. As in any machine learning and data mining schemes, the input customer records consist of a set of attributes for each customer, along with a class attribute that describes the customer status. A second source of input is the previous plans recorded in a database. We also have the costs of actions. As an example, after a customer receives a promotional mail, the customer’s response to the marketing action is obtained and recorded. As a result of the mailing, the action count for the customer in this marketing campaign is incremented by one, and the customer may have decided to respond by ﬁlling out a general information form and mailing it back to the bank. Table 2.2 shows an example ofplan trace table.

Table 2.2 A set of plan traces as input

Plan # State0 Action0 State1 Action1 State2 Plan₁ S₀ A₀ S₁ A₁ S₅ Plan₂ S₀ A₀ S₁ A₂ S₅ Plan₃ S₀ A₀ S₁ A₂ S₆ Plan₄ S₀ A₀ S₁ A₂ S₇ Plan₅ S₀ A₀ S₂ A₁ S₆ Plan₆ S₀ A₀ S₂ A₁ S₈ Plan₇ S₀ A₁ S₃

Plan₈ S₀ A₁ S₄

2.2.3 From Association Rules to State Spaces

From the customer records, a can be constructed by piecing together the association rule mining [1]. Each state node corresponds to a state in planning, on which a classiﬁcation model can be built to classify a customer falling onto this state into either a positive (+) or a negative (-) class based on the training data. Between two states in this state space, an edge is deﬁned as a state-action sequence which allows a probabilistic mapping from a state to a set of states. A cost is associated with each action.

To enable planning in this state space, we apply sequential association rule mining [1] to the plan traces. Each rule is of the form:S₁,a₁,a₂,...,→S_n, where each a_i is an action,S₁andS_nare the initial and end states for this sequence of actions.

All actions in this rule start fromS₁and follow the order in the given sequence to result inS_n. By only keeping the sequential rules that have high enough support,

(31)

we can get segments or paths that we can piece together to form a search space. In particular, in this space, we can gather the following information:

• f_s(r_i) =s_j maps a customer recordr_i to a states_j. This function is known as the customer-state mapping function. In our work, this function is obtained by applying odd-log ratio analysis [8] to perform a feature selection in the customer database. Other methods such as Chi-squared methods or PCA can also be applied.

• p(+|s)is the classiﬁcation function that is represented as a probability function.

This function returns the conditional probability that state sis in a desirable class. We call this function the state-classiﬁcation function;

• p(s_k|s_i,a_j)returns the transition probability that, after executing an actiona_jin states_i, one ends up in states_k.

Once the customer records have been converted to states and the state transitions, we are now ready to consider the notion of a plan. To clarify matters, we describe the state space as an AND/OR graph. In this graph, there are two types of node. Astate noderepresents a state. From each state node, an action links the state node to an outcome node, which represents the outcome of performing the action from the state.

An outcome node then splits into multiple state nodes according to the probability distribution given by the p(s_k|si,aj) function. This AND/OR graph unwraps the original state space, where each state is an OR node and the actions that can be performed on the node form the OR branches. Each outcome node is an AND node, where the different arcs connecting the outcome node to the state nodes are the AND edges. Figure 2.1 is an example AND/OR graph. An example plan in this space is shown in Figure 2.2.

6

$ $

6 6

$ $

6 6 6 6

6 6

$

Fig. 2.1 An example of AND/OR graph

We define the utilityU(s,P)of the planP=a1a2...a_nfrom an initial state s as follows. LetP be the subplan ofPafter taking out the first actiona1; that is, P=a₁P. LetSbe a set of states. Then the utility of the planPis defined recursively

1 3