Text analytics in industry: Challenges, desiderata and trends

(1)

Review

Text

analytics

in

industry:

Challenges,

desiderata

and

trends

Ashwin

Ittoo

a,

*

,

Le

Minh

Nguyen

b,c,d,

*

,

Antal

van

den

Bosch

e

a_Department_of_Operations,_University_of_Lie`ge,_Lie`ge,_Belgium

b_Division_of_Data_Science,_Ton_Duc_Thang_University,_Ho_Chi_Minh_City,_Viet_Nam c_Faculty_of_Information_Technology,_Ton_Duc_Thang_University,_Ho_Chi_Minh_City,_Viet_Nam d

SchoolofInformationScience,JAIST,Japan

e

CentreforLanguageStudies,RadboudUniversity,Nijmegen,TheNetherlands

Contents

1. Introduction... 97

2. Populartextanalyticsapplicationsinindustry... 97

3. Textanalyticsresearchandapplicationsinindustry... 98

3.1. Textanalyticsinindustrialapplicationsreview... 98

3.1.1. Adistributionalapproachtoopenquestionsinmarketresearch... 98

3.1.2. Analysingandevaluatingthetaskofautomatictweetgeneration:knowledgetobusiness... 99

3.1.3. Amethodologyfortrafﬁc-relatedtwittermessagesinterpretation... 99

3.1.4. Textclassificationbasedfiltersforadomain-specificsearchengine... 100

3.1.5. Naturallanguageprocessingforaviationsafetyreports:fromclassiﬁcationtointeractiveanalysis ... 100

3.1.6. Integratingasemantic-basedretrievalagentintocase-basedreasoningsystems:acasestudyofanonlinebookstore... 101

3.1.7. Turningusergeneratedhealth-relatedcontentintoactionableknowledgethroughtextanalyticsservices... 101

3.2. Overallreview... 102

3.2.1. Methods,techniquesandapplications... 102

3.2.2. Implementation... 102

3.2.3. Evaluationandmetrics... 103

4. Challengesinindustrialapplications ... 103

4.1. Challenge1:Variety/handlingheterogeneousdatasources... 103

4.2. Challenge2:Textandgenreartifacts ... 103

4.3. Challenge3:Lackofgold-standardsandannotateddata... 103

4.4. Challenge4:Qualityofresults... 103

ARTICLE INFO

Articlehistory:

Received27November2015 Accepted1December2015 Availableonline30December2015 Keywords:

Textanalytics Textdatamining

Naturallanguageprocessing Industrialapplications Trends

ABSTRACT

Therecentdecadeshavewitnessedanunprecedentedexpansioninthevolumeofunstructureddatain

digitaltextualformats.Companiesarenowstartingtorecognizethepotentialeconomicvaluelying

untappedintheirtextdatarepositoriesandsources, includingexternalones,suchassocial media

platforms,andinternalones,suchassafetyreportsandothercompany-speciﬁcdocumentcollections.

Informationextractedfromthesetextualdatasourcesisvaluableforarangeofenterpriseapplication

andforinformeddecisionmaking.Inthisarticleweprovideasystematicreviewofthecurrentstateof

theartintheapplicationoftextanalyticsinindustry.Ourreviewisstructuredalongthreedimensions:

theapplicationcontext,themethodsandtechniquesutilized,andtheevaluationprocedure.Basedonthe

review,weidentifythedifferentchallengesandconstraintsthatanreal-world,industrialenvironment

imposesontextanalyticstechniques,asopposedto theirdeploymentinmorecontrolled,research

environments.Inaddition,weformulateasetofdesideratathattextanalyticstechniquesshouldsatisfy

in order to alleviate these challenges and to ensure their successful deployment in industry.

Furthermore,wediscussfuturetrendsintextanalyticsandtheirpotentialapplicationinindustry.

ß2015ElsevierB.V.Allrightsreserved.

* Correspondingauthors.

E-mailaddresses:[email protected](A.Ittoo),[email protected],[email protected](L.M.Nguyen). ContentslistsavailableatScienceDirect

Computers

in

Industry

j ou rna l h ome p a ge : w ww . e l se v i e r. co m/ l oc a te / c om pi nd

http://dx.doi.org/10.1016/j.compind.2015.12.001

(2)

4.5. Challenge5:Velocity... 104

5. Desiderataofindustrialapplications ... 104

5.1. Desideratum1:Flexibility–data... 104

5.2. Desideratum2:Flexibility–language... 104

5.3. Desideratum3:Robustness... 104

5.4. Desideratum3:Minimalsupervision... 104

5.5. Desideratum4:Humanintervention... 104

5.6. Desideratum5:Easeofuse... 105

5.7. Desideratum6:Velocity–fastanalysis... 105

5.8. Desideratum7:Extrinsicevaluation... 105

6. Futuretrends... 105

References... 106

1. Introduction

Inrecentyears,theworldhaswitnessedasurgeinthevolume ofdatainunstructured,textualformat.Thisproliferationcanbe attributedtoseveralfactors,includingtheubiquityandcontinuous growthofICTandsocialmediaplatforms.Forinstance,societyat largecannowexpressitselfviamessagespostedonline;similarly, individualuserscandescribetheirpositiveornegativeexperiences withproductsorservicesindedicatedforums.

Thisrapidexpansionintextdatahasalsobeenwitnessedinthe enterprise ICT realm, which has evolved from a landscape dominatedby structured data,organizedin relational database managementsystemsanddatawarehouses,toonedominatedby semi-structuredandunstructureddata,includingtextdata.Buried within these huge volumes of texts are valuable information nuggets, which, if correctly identified and extracted, can be exploitedforinformeddecision-makingandforsupportingawide range of enterprise activities. These include mining customer opinions(sentiments)onproductsinordertoenhancecustomer satisfaction and improve product quality, and increasing the efficiency of certain tasks to optimize workflows, such as document classification. The need to mine useful information from these textual data sources led to the emergence of text miningtechniques,which canbeconsideredasanextensionof classicaldatamining(knowledgediscoveryindatabases)intended fortraditionalstructuredandunstructurednon-textualdata.

ArecentreportoftheMcKinseyGlobalInstitute[1]highlights theeconomicpotentialoftext(anddata)miningtechniques.For instance,abetterexploitationofdatausingappropriatedataand textminingtechniques would enableUS Health Care tocreate more than $300 billion in value. Similarly, the efﬁcient useof insightsgarneredfromdatatoimproveoperationsandtodetect fraudscouldgenerateupto$250millionofpotentialannualvalue toEurope’spublicsectoradministration.

Thus,textminingispoisedtoplayanincreasinglyimportant roleinindustryinthecomingyears.Infact,manycompanieshave acquired a competitive advantage by exploiting text mining technologies.AcaseinpointisNetﬂix:accordingtoitsdirectorof StreamingScienceand Algorithms,oneof the mostinteresting waystoexploitthehugevolumesofdatacollectedfrommembers isintheuseoftextminingtoanalyzethesedatatoimprovethe actualqualityofthemoviesandshows[2,3].Inasimilarvein,the IBMWatsonQuestion-Answeringsystem[4–6]hasopenedupnew opportunitiestoimprovethequalityofserviceinvariousdomains andindustries,includinghealthcareandﬁnance.

Over the years, text mining techniques have evolved in sophistication,incorporatingmoreandmoreelementsoriginating from computational linguistics research. This evolution has nuancedthedistinction betweentextminingand otherrelated researchdisciplines,suchasnaturallanguageprocessing. Conse-quently,inthisarticle,weusethetermtextanalytics(TA)toreferto bothtextminingandnatural languageprocessing(NLP), unless

otherwisestated.AnothermotivationforpreferringTAstemsfrom itbeingmorepopularinindustrythanTMorNLP.

Ouraimwiththisarticleistoprovideanoverviewofindustrial TAapplications.Speciﬁcally,wewillreviewseveral state-of-the-artTAtechniques,developedandevaluatedwithinthecontextof academicresearch, and appliedtoaddressreal-lifeproblems in industry. Thus,an importantpointtonoteis thatourarticle is neitherareviewofcommercialsolutions,suchasTApackagesfrom vendorslikeSASorIBM,norareviewofcurrentTA/NLPresearch efforts, suchasthose publishedin scientiﬁc outletsas theACL conferencesandjournals.1

Our article is structured as follows. In Section2 we give examplesofpopularcompaniesthataredevelopingoradoptingTA solutionsforvariousapplications.Then,inSection3,wepresent our actual review and describe state-of-the-art TA techniques developedinthescientiﬁccommunityandappliedtosolve real-worldindustrialproblems.Wepresenteachtechnique systemati-callybystructuringourreviewalongthreedimensions: Overall Context,MethodsandTechniques,andEvaluation.Basedonour reviewinSection3weidentifyasetofchallengesandconstraints that industrial applications pose to TA techniques in Section4.Next,inSection5weformulateasetofdesideratathat TA techniques should satisfy in order to ensuretheir efﬁcient deployment within industrial environments. Finally, Section6

discussesfuturetrends.Wediscusscurrentinnovationsintheﬁeld ofTA,suchasDeepLearning,WordEmbeddings,andTAandVision, and theirapplicationsin industrialcontextstosolvereal-world problems.

Anoteontheterminology:asmentionedbefore,wewillusethe termtextanalytics(TA)torefertotextmining(TM)andnatural languageprocessing(NLP).Also,wewillusethetermtechniqueand algorithminterchangeably,whereverdeemedmoreappropriate. 2. Populartextanalyticsapplicationsinindustry

Severalwell-knowncompaniesareadoptingTAtosupportand improve their core activities.We give 3examples, which were chosen to reﬂect the added value that TA brings to diverse applicationdomains.

Netﬂix:Netﬂixmakesheavyuseof TAtechniquestoanalyze feedback and comments that its members post on various onlinesources,includingsocialmedianetworks.The informa-tionminedfromthesesourcesarethenusedforaplethoraof activities.Forinstance,informationonmembers’preferencesis useful in recommendation systems and in providing more personalizedcontents.Thisinformationalsoprovidesvaluable informationoncontentsthatneedtobereplacedastheyfailto satisfythedemands ofthemembers,andon inaccuracies in captions and subtitles. Thus, at a more strategic level, TA

1

(3)

techniques are at the core of powerful models that enable Netflixtodeliverbetterandenrichedcontents,closingtheloop onqualityandenhancingthemembers’overallexperience[2]. Bank of England: The bank’s Advanced Analytics division is adoptingTA techniquestoaddress severalpotential applica-tionsacrosscentralbankingandtobuildmoreagileand wide-ranging data analysis capabilities for the future. In one exemplar application, they searched for tweets containing phrases or terms that could indicate that depositors were preparing to withdraw their money from Scottish financial institutions in the run up and immediate aftermath of the ScottishindependencereferenduminSeptember2014[7]. IBMWatsonQuestion-AnsweringSystem:TheWatson question-answeringsystemisoftentoutedasaflagshipTAtechnology fromIBM.Inessence,aquestion-answeringsystem(QA)system takes asinput questions expressed in natural language and looks up or reasons about the answers based on relevant information extractedfromvarious sources.Theinput ques-tionsrangefromsimpledefinitionor‘‘whatis’’questions,such as ‘‘what is the capital of Belgium?’’ to more complex formulations,suchasriddles or‘‘why’’and‘‘how’’questions. Healthcareisoftenconsideredasoneofthemostpromising applicationdomainsforQAsystems[5].Forinstance,Watson’s QAcapabilitieshavebeenaugmentedwithspeechrecognition technologies foruseasanaidtomedicaldiagnosisina joint project involving IBM, Nuance Communications, Columbia UniversityMedicalCenter,andUniversityofMaryland Univer-sityofMedicine[5].Alsointhehealthcaredomain,Wellpoint insurersandMemorialSloan-KetteringCancerCenter(MSKCC) relied on Watson technology toread and interpret massive volumesof datain textformat,includingmedical literature, patients’ treatments and family histories and clinical trials. Information fromthese textdocuments wereused to assist oncologisttorecommendcoursesoftreatments[8].According toIBM,atamoregenerallevel,WatsonanditsTAcapabilities willchangetheparadigminwhichweworkwithcomputers andwilltransformmanyindustries.Watson’sabilitytoingest, process and understand large volumes of data faster than human counterparts extends its applications over a whole rangeofinformation-sensitiveindustries,suchasinthelegal domain where lawyers have to sift through large piles of evidence materials aswell as in theinsurance and banking domains, which requireprocessingmassive amounts of text dataintheformofclaims,emailsandfinancialstatementsand reports[9–11].

TheaforementionedapplicationstendtorelyoncommercialTA solutions,oftenmarketedaspartofenterpriseapplicationsoftware. Asmentionedearlier,our aimwith this articleis toprovidean overviewofstate-of-the-artTAtechniques,developedinthecontext ofscientiﬁcresearch,andappliedtoaddressreal-world,industrial problems.Thiswillbethethemeofthenextsection.

3. Textanalyticsresearchandapplicationsinindustry Ourreviewofthecurrentstate-of-the-artwillbebasedonthe scientiﬁcarticlespublishedinthisspecialissueoftheComputers inIndustryjournal2_on_text_analytics_in_industry,_namely_[12–18]_. Ourrationaleforfocusingonthesearticlesareasfollows.

Recency: by virtueof their recency, thesearticlesreﬂect the currentstate-of-the-artandtrendsintheﬁeld.Thus,theyserve asanup-to-datereferenceforgroundingouranalysis. Heterogeneity of Application Domain: the TA techniques presented in the articles are applied to a range of diverse

domains,fromclassicalapplicationareassuchashealthcare/ bio-medical to more novel ones such as road trafﬁc and aviation.Thus,theyhighlighttheversatilityofTAtechniques anddemonstratetheirpotentialandthepromisingroletheyare expectedtoplayinvirtuallyallindustrysegments.

ScientificallySoundMethodsAppliedtoIndustry:theapplications presented inthe articlesillustratehow scientific techniques couldbeappliedtoaddressreal-life,industrialproblems.Thus, theyreflectagoodmixof‘‘theoryandpractice’’.

Quality/JournalReputation:theComputersinIndustry journal has a good reputation in publishing original, high-quality, application-oriented research papers that demonstrate the industrial useof neworexisting technologies in knowledge intensivedomains.

WewillstartinSection3.1byreviewingtheTAtechniquesand applications in each article individually.3 _To _facilitate _our discussion, each review will be structured along three main dimensions,viz:

1.Overallcontext:providinggeneralinformation(fore.g. applica-tiondomain).

2.Methods and techniques: describing the main methods and techniquesemployedaswellastheirimplementations(fore.g. machinelearningandnaturallanguageprocessinglibrariesor toolkitsthatwereused).

3.Evaluation:describingtheexperimentsperformedtoassessthe performanceoftheproposedtechniques.

Afterwards,inSection3.2,weprovideamoregeneralreview, encompassingthedifferentTAtechniques, andwehighlightthe salientaspectsacrossthem.

3.1. Textanalyticsinindustrialapplicationsreview

3.1.1. Adistributionalapproachtoopenquestionsinmarketresearch 3.1.1.1. Overallcontext. ThisarticleproposesasystemcalledThe Klugator Engine (TKE), for analyzing responses to open-ended surveyquestions. Theresponsesareexpressed infreelyformed natural language texts in English and German. The proposed systemwasdevelopedinacollaborativeeffortbetweenacademic andindustrialpartners.Thesystemisalreadyincommercialuseas thecoreengineoftheRogatorTextClusteringSolution(RogTCS), distributedbyGermanmarketresearchcompanyRogatorAG. 3.1.1.2. Methodsandtechniques. Theproposedsystemstartswith the classical step of pre-processing, in which the input texts (documents)aresuccessivelyanalyzedbydifferentcomponentsof a standard (NLP)pipeline. This involves sentence splittingand tokenizationwitha customizedversionofNTLK[19],and word stemmingwiththeSnowballstemmer[20].

Documents are represented according to the basic bag-of-words(BoW)model,yieldingaterm-documentmatrix(vectors). To alleviatethe issue of sparsity in the matrix, Singular Value Decomposition(SVD)[21]isappliedinordertoidentifythemost importantmatrix‘‘dimensions’’.

Then, clusters of similar documents are identified using the traditionalK-Meansalgorithms.Similaritybetweendocumentsis computedasthecosineanglebetweentheirrespectivevectors.The numberofclusterstobediscovered,i.e.theparameterK,issetto20,a valuefoundtoyieldadequatelyinformativeclustersforthedomain. Thesentencesarealsoassignedasentimentscore,rangingfrom 1(extremelynegative)to+1(extremelypositive),reflectingtheir polarity,withasimplifiedversionoftheSentiKlue[22]system.

2

http://www.journals.elsevier.com/computers-in-industry/. 3

(4)

Finally,theresultsofthetextanalysisaredisplayedintheform ofasemanticmap.Inthismap,clustersofsimilardocumentsare depictedasnodes,whoseareaisproportionaltotheclusters’size. Node are labeled with corresponding topics, which are salient wordsidentiﬁedfromthedocumentsusingthelog-likelihoodratio

[23].Thenodesarecoloredaccordingtotheoverallsentimentof the documents in the clusters. Clusters with documents that expressed a positive sentiment are displayed as green nodes. Conversely,rednodesdenotedanegativesentiment,whileyellow is used for neutral sentiment. Furthermore, the nodes were positioned such that their distances reﬂected the similarity betweentheirtopics.

3.1.1.3. Evaluation. The evaluation focuses on the systems’ two main components, namely, text clustering and sentiment analysis.Evaluation provedto bea ratherchallenging process given the lack of clearly deﬁned gold-standards in industrial applications,whichisfurthercompoundedbythehighlevelof subjectivityinvolvedintaskssuchasclusteringandsentiment analysis.Theexperimentsshowthatevenhumanjudgesfoundit difﬁculttoassessandagreeonthecorrectnessoftheclustersand sentiments.

Concerning the text clustering task, the performance is evaluated using the standard purity measure on different ‘‘configurations’’. The best configuration was one in which the 20clustersgeneratedautomaticallybyK-Meanswererefinedby human intervention (for e.g. merging and splitting clusters), resultingin17clusters,withanoverallpurityscoreof0.66.

Concerningtheevaluationofthesentimentanalysistask,the systemfavorsprecisionoverrecall.Therationaleindoingsoisto ensurethatasentenceislabeledaspositiveornegativeonlyifits polarity could be inferred (as positive or negative) with high conﬁdence (i.e. precision), and to avoid giving misleading impressionsonthesentimentpolarity.Thebestprecisionscores reported are 0.79 (forpositive polarity) and 0.90 (for negative polarity).Itisalsoworthnotingthatthesystemachievesanoverall F1-scoreof0.69, comparingrelativelywellwiththeF1-scoreof 0.76ofahumanjudge.

3.1.2. Analysingandevaluatingthetaskofautomatictweet generation:knowledgetobusiness

3.1.2.1. Overall context. This article deals with the issue of automatictweetgenerationinEnglishandSpanish.Aninteresting aspectis thattweetgenerationisaddressednot froma natural language generation (NLG) perspective, but rather as a text summarizationproblem.Inessence,textsummarization techni-quesareappliedtonewsarticles,yieldingultra-concise summa-ries,whichthenserveastweets.Apotentialapplicationofthese automaticallygeneratedtweetsisinsupportingthemarketingand communicationactivitiesofcompanies.Therefore,itisimportant toassessthetweets’quality,especiallytakingintoaccountthefact thattheyarecreatedwithouthumanintervention.

Thetweets’qualityisassessedintermsof2dimensions,viz: 1.Interestingness: towhat extent users would be interested in

knowingmoreaboutthetweets’contents.

2.Informativeness: whether the tweets accurately reﬂected the newsarticlesfromwhichtheyweregenerated.

Furthermore, this study also attempted at identifying linguistic featurescontributingtotheinterestingnessandinformativenessof tweets.

3.1.2.2. Methodsandtechniques. Sevenstate-of-the-arttext sum-marizationtechniques,describedin[24],wereappliedto201news articles (101 in English and 100 in Spanish), resulting into

1407summaries, which wereconsideredas tweets.In addition, 201tweetswerealsocreatedmanually,i.e.bymanually summariz-ing201newsarticles.ThetweetswereprocessedusingFreeLing

[25],anTAtoolkitforEnglishandSpanish,toextractvarioustypesof linguisticfeaturesfromtheircontents,including:

Lexicalfeatures,suchasParts-of-Speech(POS)and numberof words

Syntacticfeatures,suchasnounphrasesandverbphrases Semantic features, such as named entities (e.g. persons,

locations)

3.1.2.3. Evaluation. Thetweets’quality(interestingness, informa-tiveness)wasevaluatedmanuallybyEnglishandSpanishnative speakers. These judges were not informed on the tweets’ provenance,i.e.whethertheyhadbeenautomaticallyormanually generated. They were, however, given the news articles from whichthetweetshadbeengenerated.Eachtweetwasevaluated bythejudgesona3-levelLikertscalefortheirinterestingnessand informativeness.Thesescoresweresubsequentlyusedtocalculate ameaninterestingnessandmeaninformativenessscore.

AnanalysisoftheevaluationresultsrevealedthatfortheEnglish tweets,therewasnosigniﬁcantdifferenceintheinformativeness scoreof automaticand manualtweets,i.e.usersconsideredthe automatically generated tweets to be as informative as their manually generatedcounterparts. Conversely, for Spanish,users consideredthemanualtweetsasbeingmoreinformative.

Asfor theinterestingness,usersperceivedtheautomatically generated tweetsin Englishasbeingmoreinteresting thanthe manuallycreatedones.Conversely,forSpanish,manually gener-atedtweetswereconsideredasmoreinteresting.

Withregardstothelinguisticfeatures,syntacticandsemantic featureswerefoundtocontributethemosttothequalityofEnglish tweets.However, forSpanishtweets,no particularfeatures that werearchetypalofgoodqualitytweetscouldbeidentiﬁed. 3.1.3. Amethodologyfortrafﬁc-relatedtwittermessages interpretation

3.1.3.1. Overall context. This article presentsa systemfor auto-matically interpreting trafﬁc-related Twitter messages in the Portuguese language. Interpretation in this case refers to the transformationofthetweets’unstructuredtextualcontentsintoa morestructuredformalism,namelyasRDF-triples.

Thesystemisdeployedasaprototypeformonitoringthetruck fleet of a gas transportation and fuel transportation company. Severalbenefitsarereportedasaresultofthesystem’sapplication in thesecompanies, suchascost reduction,more efficientfleet management,andimprovedcustomersatisfactionduetoamore accuratepredictionofdeliverytime.

3.1.3.2. Methodsandtechniques. Anontology,calledTEDO(Traffic Event Domain Ontology), specific to the traffic domain was manuallyconstructed.Itisarelativelysimpleontology, compris-ingofnineclasses, sevenobjectproperties,andeightdatatypes properties(relations).ExampleconceptsinTEDOareLocationand WeatherCondition,whileexamplerelationsinclude‘‘causes’’and ‘‘hasPrimaryLocation’’. The ontology was subsequently used to converttweetsintocorrespondingRDFtriples.

Theﬁrststepofthisconversionwastopreprocessthetweets’ textualcontentsusingastandardpipeline,whichinvolved tokeniz-ing the textsand annotating the tokens with morpho-syntactic information,suchasPOSandstems.Thesepre-processingtaskswere executedusingtheF-EXTweb-service[26].

Thetokensarethenannotatedwithparticulartagsusingnamed entity recognition, implemented with a sequential minimal

(5)

optimizationsupportvectormachine(SMOSVM)[27]fromthe Weka machine learning library [28]. Example tags in the application include ‘‘interdiction’’, ‘‘collision’’. Each tag corre-sponds to one ontology term. Furthermore, the coordinates of entitiestaggedasLocationinthetweetsaredeterminedusingthe SmartGeocodealgorithm[29]forgeocoding.

Thenexttaskintheinterpretationoftweetsisthatofdetecting relationsbetween entities.Itinvolves inferringthedependency treebetweentheentitiesandlearningthepossiblerelationsthat could exist between them using a large margin structured perceptron[30].

Finally, thetweets are converted into RDF-triples.This step reliedonthetaggedtweets,theTEDOontology(eachtag/entityina tweetcorresponded toanontology term), and on therelations betweenthe entitiesinorder tosimplifythetranslation of the naturallanguagetextsintoRDF-triples.

3.1.3.3. Evaluation. Theevaluationisperformedon690Portuguese tweetsdescribingtrafﬁc-relatedevents,andfocusesonthetasksof namedentityrecognitionandrelationextraction.Therespective performancesofthese2lattertasks,measuredwiththeF1metric, arebotharound0.7.

3.1.4. Textclassificationbasedfiltersforadomain-specificsearch engine

3.1.4.1. Overall context. The system presented in this article attemptsto automaticallypredictfilters for refiningthesearch resultsofdomain-specificsearch-engines.Thetargeteddomainis thatofjobsearch.Examplefiltersforthisdomainarethe‘‘region’’ towhichajobadvertisementisapplicable,orthejobtype (full-timeorpart-time).Theissueofpredictingfiltersisaddressedasa classificationtask,wheretheaimistoassignthedocumentsto predefinedcategories(e.g.‘‘full-time’’,‘‘part-time’’),which actual-lyactasfilters.Theproposedsystemforfilterpredictionhasbeen deployedtheinKimetaapplication,4areal-life(domain-specific) search-engineforjobsearchesintheGermanlanguage.

3.1.4.2. Methods and techniques. Filter prediction (document categorization)isperformedwiththeCENFAsystem[31],which incorporatesanensembleofsupportvectormachines(SVM)inan activelearningconfiguration.First, abaseclassifier attempts to predict the filters to be assigned to the documents. Those documentsthatcouldnotbeconclusivelycategorized,basedon athresholdscore,arecategorizedbyhumanannotators(domain experts).Thesemanuallycategorizeddocumentsarethenusedto train another classifier ensemble, known as the specialized classifier. This iterative procedure of manually annotating ambiguous documents, and then using them to train the specialized classifier is repeated until the final classification resultsaredeemedtobesatisfactorybyhumanexperts.In this activelearningapproach,thebaseclassifieristrainedonlyonce, whilethe specializedclassifier is trained iterativelybut with a smallernumberofdocumentsineachsubsequentiterations.Such aconfigurationconsiderablyreducestheoveralltrainingtime. 3.1.4.3. Evaluation. The system is evaluated by measuring its performanceinpredicting3specificfilters,namely‘‘R&D’’, ‘‘Full-Time’’and‘‘Service/CustomerSupport’’,onacorpusof300 docu-ments.Theperformancemetricsusedwerethe(i)areaundercurve (AUC), which provides a more robust estimate when the distributionof documentsper category is imbalanced, and (ii) weightedaverageprecisionsincetheaimofasearch-enginewasto provide results with the highest accuracy and recall was

consideredaslessimportant.ThereportedAUCachievedbythe systeminpredictingdocumentcategoriesrangesfrom0.7to0.8 (fordifferentcategories),whilethemeanaverageprecisionranges from 0.8 to 0.85. The experiments also show that the active learning approach is useful in gradually reducing the size of thetrainingset,andthus,thetrainingtime.Theperformanceofthe systemalsoincreasedduringeachiterationoftheactivelearning process.

3.1.5. Naturallanguageprocessingforaviationsafetyreports:from classiﬁcationtointeractiveanalysis

3.1.5.1. Overallcontext. Inthisarticle,asystemforautomatically analyzing safety reports in the aviation domain is proposed. ReportsareexpressedinFrenchandin English.Analysisinthis application involves classifyingthereports accordingto safety-relatedcategories,deﬁnedintheADREP.5_The_categorized_reports canthenbeusedforvariousactivities,e.g.theycanbequeriedand the evolution of existing safety events (categories) can be monitored over time. However, text categorization does not enabletheidentiﬁcationofemergingtrendsandtopics.Forthis task,theproposedsystemreliesontopicmodeling.Inaddition,the systemalso incorporatesaninformation retrieval (IR)tool that enables users to explore the report corpus interactively by searchingforreportsdescribingsimilarincidents.

The system is developed by a joint academia-industrial collaboration. Some modules have been deployed for real-life usagebytheFrenchDirectionGnraledelAviationCivile6_and_are_at anadvancedtestingphaseataFrenchairlinecompanybeforetheir potentialintegrationintheirsafetymanagementsystem. 3.1.5.2. Methods and techniques. Analysis of the aviation safety reportsstartswithapre-processingphase.Duringpre-processing, terminologicalvariations(e.g.‘‘take-off’’,‘‘T/O’’)areconﬂatedtoa single representation (e.g. ‘‘take off’’). This normalization is achievedusingarule-basedapproach.Then,wordsarestemmed with the Snowball stemmer [20] and annotated with their respectivePOStagsandlemmatizedusingTreeTagger[32]. Rele-vant words are identiﬁed using the term frequency-inverse documentfrequencyscore(TF-IDF).

As mentionedearlier, the categories used for classifying the reportsarefromtheADREPtaxonomy,which,with37categories, offers a fine-grained scheme for incident classification. In the proposedsystem,thereportsclassificationisperformedusingSVM, implementedin theLiblinearJava Library[33]. Featuresinclude wordstems,charactern-grams,andstem-grams.Oneofthemain challenges inthe classificationtask isthat asingle reportcould belongtomultiplecategories.Thisissueisaddressedbytrainingone independentbinarySVMperADREPcategory(i.e.37inall)topredict whetherareportbelongstoaparticularcategoryornot.

Thereportsarealsoanalyzedinordertodetectlatenttopics fromtheircontents.Thisis achievedusingtopicmodelingwith Gibbs sampling, implemented as part of the Gensim library

[34].Thenumberoftopicstobediscoveredissetto50,chosenasa valueasitenablesasatisfactoryinterpretationofthediscovered topics.

The ﬁnal analysis task is that of information retrieval, implementedintheformofasearchengine,whichenablesusers toidentifyreportsdescribingsimilarsafetyevents.Thereportsare represented using the traditional BoW model as vectors, and similarityiscomputedasthecosineoftheanglebetweenvectors.

4

http://www.kimeta.de/.

5

Accident/incident Data REPorting; http://www.icao.int/safety/airnavigation/ aig/pages/adrep-taxonomies.aspx.

6

DGAC,FrenchCivilAviationAuthority,http://www.developpement-durable. gouv.fr/-Secteur-Aerien,1633-.html.

(6)

Searchresults,i.e.similardocuments,areplottedalongatemporal axis on a 2-D scatter-plot for easier visualization. Reports are displayedadotsontheplot;thehigherthedotsareontheplot,the moresimilaritytheyexhibitwiththeinputquery/sourcereport. Suchatemporaldisplayisusefultomonitorwhetherincidentsare isolatedorconstituteanemergingthreat,andtodeterminetherisk levelassociatedwithincidents.

The search-engine solution is used together with the text categorizationmoduleinanactivelearningconfiguration.Inthis configuration,humanexpertsstartwitharoughestimationofthe reportsbelongingtoaparticularincidentcategory.Thisinitialset ofreportscanbeobtainedviathesearchengine,andisusedtotrain anSVM.Borderlinecases,i.e.reportsthatcouldnotbeconclusively classifiedby theSVM,are presentedtothe experts formanual classification,andareusedtotraintheSVM.Thisiterativeprocess ofmanuallyannotatingborderlinecasesandusingthemtotrain theclassifier isrepeateduntiltheexpertsaresatisfiedwiththe results.

3.1.5.3. Evaluation. The differentcomponents of thesystemare evaluatedseparately.Theperformanceonreportclassificationis estimatedusing10-foldcross-validationoveracorpusof136,681 reports.TheoverallF1-score,computedoverallthecategories,is 0.79.Asexpected,somevariationsareobservedintheperformance acrossdifferentcategories,withhigherperformancefornarrowly definedcategoriesandlowerperformanceforlooselydefinedones. Concerningthetopicmodelingtask,15representativewords foreachofthe50topicsgeneratedwerepresentedtoadomain expert.Thelatterthenassessedwhethereachofthe15word-set (pertopic)accuratelydescribedauniquetheme(topic).Basedon thisevaluationprocedure,43(outofthe50)word-setcouldindeed beassignedtoaparticulartheme.Asanothermeanstoevaluatethe topics’ quality, the Pearson correlation statistics is computed between the15 word-sets and themeta-dataof corresponding reports.Theresultsaremitigatedwithahighdegreeofcorrelation of0.7forsometopics,andamuchlowercorrelationof0.11for others.

3.1.6. Integratingasemantic-basedretrievalagentintocase-based reasoningsystems:acasestudyofanonlinebookstore

3.1.6.1. Overall context. The system presented in this article employs Case-Based-Reasoning (CBR) to enhance the search experience of end users in business-to-consumer websites. Speciﬁcally,itacceptsqueriesasinputs,whichcouldbeformulated either as natural language questions or as keywords. It then consultsacasebaseconsistingoftargetproblems(e.g.questions) andtheir correspondingsolutions(answers).Thebestmatching solutioninthecaseisthenreturnedasanswertotheinputquery. 3.1.6.2. Methodsandtechniques. Atwo-stepmechanismisusedto ﬁndmatchingcasesinresponsetoaninputquery.First,techniques basedonrecognizingtextualentailment(RTE)determinewhich casesinthecase-basearemostrepresentativeoftheinputcase.Ifa representative case is found, based on a threshold score, its solution is returned in response to the input. Else, Short-Text Semantic Similarity (STSS) techniques are used to identify potentiallymatchingcases,whichmayprovideasolutiontothe targetproblem.

Similaritybetweentheinputqueryandthecasesinthe case-baseiscomputedbyvarioussemanticsimilaritymeasures,suchas the PATH method [35] and other traditional WordNet-based measuressuchas[36]and[37].

3.1.6.3. Evaluation. Two types of evaluations are performed, namely(1)ascientiﬁcevaluationagainststandardbaselinesand

on standardcorpora,astypicallydone inTAresearch,and(2)a case-studyevaluationinanonlinebookstore.

For the scientiﬁc evaluation, the performance of the RTE component wasassessedover theMicrosoftParaphrase Corpus

[38],consistingof4076trainingand1725testingsentencepairs, with each pair labeledas 0 or 1 to indicate whether one is a paraphraseoftheother.Theproposedsystemachievesanaverage similarityscore (betweensentences)of 0.73,comparing wellto otherbenchmarkssuchas[39]and[40].Theperformanceonthe STSStaskwasassessedoverthecorpuspresentedinLietal.[41], consistingof65sentencepairs.Theproposedsystemachievesa score of 0.83, as measured using the Pearson correlation coefﬁcient.Thus,it outperformsotherSTSSapproaches,suchas

[42]and [43].Itshould benotedthat among all thesimilarity measuresevaluatedintheproposedsystem,thebestperformance forboththeRTEandSTSStasksisachievedbythePATHmethod

[35].

Concerning the application-type evaluation, the system’s performanceisassessedoveracase-baseconsistingof1200books fromtheAmazononlinebookstore.Queriestothecase-basewere thenissued,andtheperformancewasmeasuredusingthemean averageprecision(MAP).Thisexperimentshowsthattheproposed system achieves an MAPof 0.93, markedly outperforming two baselinesagainstwhichthesystemwascompared.

3.1.7. Turningusergeneratedhealth-relatedcontentintoactionable knowledgethroughtextanalyticsservices

3.1.7.1. Overall context. This article proposes a text analytics systemfor pharmacovigilance. Inessence, theproposedsystem extractsinformationpertainingtoAdverseDrugReactions(ADR) andtheirrelationshipstomedicationsfromSpanishsocialmedia websites.

3.1.7.2. Methodsandtechniques. ThesystemminesADR informa-tionfromtwomainsocialmediasources,namely:

1.Twitter:tweetswereharvestedbasedonspeciﬁckeywordsin Spanish

2.Saluspot7_: _an_online _forum_that _allows _public_users_to _make inquiries on health-related issues and receive answers from medicalpractitioners.

Texts from these sources are harvested and stored in a data warehouse,implementedusingElasticSearch.8_These_texts_are_then pre-processedbylemmatizationandPOStaggingusingrelevantAPIs fromMeaningCloud.9WordlemmasandPOSareemployedina rule-basedapproachtowordsensedisambiguation.Next,relevantentities areidentifiedfromthetextswithNER(whichthearticlereferstoas TopicExtraction).TheNERstepreliesonseveraltoolsandresources, including APIs from MeaningCloud,the annotation pipelinefrom GATE [44]anddomain-specific dictionaries, suchasthe Spanish-DrugEffectDBandUMLS-SNOWMED.Namedentitiesinthedomain typically correspond to drugs, effects, and diseases. Finally, the semanticrelationshipsbetweentheseentitiesarelabeledasoneof three types, viz, (1) ‘‘adverse effect’’, (2) ‘‘indication’’ and (3) ‘‘possible’’. The first two relationships are detectedbased on the SpanishDrugEffectDB, whilethe 3rd type corresponds to undocu-mentedrelationships.

3.1.7.3. Evaluation. The differentcomponentsof thesystem are evaluatedseparately.FortheNERtaskpertainingtothedetection ofdrugs,themainsourceoffalsenegativesisabbreviateddrug

7 http://www.saluspot.com. 8 http://elasticsearch.org/. 9 https://www.meaningcloud.com/.

(7)

names,while falsepositives aremostly adjectivesthat arealso usedasdrugnames(fore.g.‘‘sedatives’’).ThereportedF1-scoreis 0.76.Abaseline NER thatrelies on a gazetteer populatedfrom domain-speciﬁcdictionaries achievesa muchlowerF1-scoreof 0.43.ConcerningtheNERtaskpertainingtothedetectionofan effect,theF1-scoreoftheproposedsystemisestimatedat0.6.In thiscase,falsenegativesaremostlyduetocolloquialexpressions (fore.g.‘‘myheadisringing’’),whilefalsepositivesaremostlydue toambiguousterms.Withregardstotherelationextractiontask, thesystemachievesanF1-scoreof0.72;falsenegativesaremainly attributedtothelimitedcoverageofdictionariesforlabelingthe relations, while false positives are due to the difﬁculties in identifyingrelationsfromcomplexsentences.

3.2. Overallreview

After havingreviewedtheapplicationsindividually, wenow provide a more overall review, encompassing the different applicationsandhighlightingtheircommoncharacteristicsalong thefollowingdimensions:

1.Methods,techniquesandapplications; 2.Implementation;

3.Evaluationandmetrics.

Wefocusonthesedimensionsastheyarecommonlyexplicated acrossthedifferentapplicationsdescribedintheprevioussection. Also, these dimensions correspond to core issues that require considerationwhendevelopingandimplementingtextanalytics (TA)applications,bothinthescientiﬁcandcorporaterealms. 3.2.1. Methods,techniquesandapplications

The articles we reviewed cover a wide range of application domains,rangingfromdocumentclassificationtotrafficmonitoring. Wecandistinguishthreemaingroupsofapplications.Firstarethose aimedatalleviatinghumaneffortintediousandtime-consuming tasks,therebyenablinghumanexpertstofocusonmorecriticalor value-addedactivities.Atypicalexampleofsuchapplicationsisthat of text classification, as presented in [15,16]. Second are those applicationsthataimatfacilitatingandenhancingthe qualityof theinformation-seekingprocessofendusersandtheresults,suchas theCBRapplicationthatacceptsnaturallanguagequestionfrom users [17] and the information retrieval component of the application in [16]. Finally, applications in our third group are those that assist in decision making. These include the traffic monitoringapplicationdeveloped in [14] forfleet management. Naturally,suchawidesetofapplicationscallsforanequallydiverse setofTAtechniques(andmethods),aswillbebrieflydescribednext. 3.2.1.1. Text classification. Two of the applications rely on text classificationtechniquesfor:

1.Categorizingincidentreportsintheaviationsectoraccordingto thetypesofsafetyeventstheydescribed(i.e.safetycategories)

[16].

2.Categorizingjoboffersaccordingtothejobs’characteristics(e.g. part-time,full-time)tosupportadomain-speciﬁcsearchengine

[15].

Inbothapplications,supportvectormachinesareusedinanactive learning conﬁguration in order to enhance the classiﬁcation performance.

3.2.1.2. Text similarity. Methods for estimating text similarity constituteanothersetoftechniquesatthecoreofmanyapplications inthearticlesreviewed.Thesetechniquesmeasurethedegreeof similaritybetweentextsatthelexicallevelaswellasatthesemantic

level.Inalltheapplicationssurveyed,similarityatthelexicallevelis computed by projecting the texts (documents) into a multi-dimensional(vector)spaceandcomputingthecosineanglebetween the corresponding document vectors, commonly referred to as cosinesimilarity.Forinstance,theIRtoolof[16]reliesonthecosine similaritymeasuretoretrievesimilardocumentsinresponsetoa query.Likewise,theK-Meansclusteringalgorithmof[12]identiﬁes groups of similar documents based on their cosine similarity. Similarityatthesemanticlevelisestimatedusingthetraditional WordNet-basedmeasures[17],i.e.thedistance(e.g.pathlength) separatingtwowordsintheWordNetlexico-semanticdictionary. 3.2.1.3. Textsummarization. Varioustechniquesfortext summari-zationwereemployedin[13].

3.2.1.4. Sentimentanalysis,topicmodeling. Ourreviewrevealsthat more recent TA trends are also being deployed in industrial applications.Inparticular,sentimentanalysisisemployedin[12]to determinethepolarityofsurveyresponse,andtopicmodelingis used in [16] to automatically infer topics from aviation safety reports.

3.2.2. Implementation

Thevariousapplicationsreviewedareimplementedfollowinga modulararchitecture; i.e.the applicationis decomposed intoa sequenceoftasksandeachtaskisdeployedasaseparatemodule. Inessence,themodulesconstitutedapipeline(commonlyknown asthe‘‘NLPPipeline’’), composedofupstreamanddownstream tasks.Upstreamtasksaretypicallyconcernedwithpre-processing, suchassentencesplitting,tokenization,stemmingor lemmatiza-tion,NERandPoStagging.Theaimofthesepre-processingtasksis tomake theinput texts moreamenable to further analysis by downstreamtasks.Downstreamtasksoftencorrespondedtothe actualapplication itself, fore.g. textclassification or sentiment analysis.Suchamodularapproachisoftenadoptedincommercial softwareengineeringprojects,wherebythedifferent functionali-tiesofthefinalsoftwarearedevelopedseparatelybypotentially disparatedevelopmentteams.Amodularimplementationbrings aboutnumerousbenefits.Mostnotably,forindustrialapplications, modularizationfacilitatesmaintenance;selectedmodulesofthe applicationcanbereplacedwithminimaldisruptiontotheoverall functionality and without the need to re-write and re-test substantialportionsofthesource-code.

From our survey, we also noted that the various industrial applications leverage extensively upon open-source tools (and technologies) to implement the functionalities of the different modules.Fore.g.thesentencesplitteroftheNLTKtoolkit[19]is employedinthetextclusteringandsentimentanalysisapplication of [12]. The application for analyzing aviation safety reports presentedin [16]relies on theTreeTaggertoolkit [32]for PoS-taggingand lemmatization, whilethe Freelingtoolkit [25]was used in [13] to process tweets. The application for processing healthmessagesonsocialmediain[18]madeextensiveuseofthe annotationpipelinefromtheGATEtoolkit[44].Toclassifytexts,

[16]and[14]respectivelyusetheSVMimplementationprovided bytheLiblinear[33]andWeka[28]libraries.Theseopen-source toolsaremoretraditionallyassociatedwiththedevelopment of proof-of-concepts and experimental systems in the research community.

Anotherinterestingobservationistheexploitationof webser-vicesforperformingvarioustasks.Fore.g.[14]reliedontheF-EXT webservices [26] for processing tweets, while [18] relied on webservicesfromMeaningCloudtoprocesssocialmediamessages. TheadoptionofwebservicesintheTAcommunityreﬂectsasimilar trendintheenterprisearena,wherewebservicesareexpectedto playanincreasinglyimportantrole.Thisincreasingtrendcanbe

(8)

attributed to the growing number of enterprises realizing the beneﬁtsofcloudcomputingandadheringtothe Software-as-a-Service(SaaS)philosophy[45,46].

3.2.3. Evaluationandmetrics

Theperformanceofthevariousapplicationsisestimatedusing severalevaluationmetrics,traditionallyemployedinthecontext of TA research. For instance, the clustering quality in [12] is evaluated using the purity metric. In [17], the accuracy of the solutions,retrieved froma case-base,in responseto aquery is estimatedaccordingtotheaveragewordsimilarityscoreandthe meanaverageprecision(MAP).Theperformanceofother applica-tions,suchastextcategorization[16,15],sentimentanalysis[12], named entityrecognition [14],is estimated using thestandard measures of precision, recall and F1-score, the latter being a harmonicmeanthatbalancesprecisionandrecall[47].However,it isworthnotingthatinsomeapplications,fore.g.thesentiment analysiscomponentof[12],precisionisfavoredoverrecall.Also, theareaundercurve(AUC)waspreferredtotheF1-scorein[14]as itprovidesamoreunbiasedmeasureoftheperformance(inthis case,textcategorization)whentheclassesareimbalanced.

Theevaluationprocedureof[13]deservesparticularattention. Inthisstudy,theauthorsevaluatetheapplication’sperformance bymeasuringhowusersperceiveitsendresults.Specifically,they gaugethedegreetowhichusersperceiveautomaticallygenerated tweetsasinformativeandinteresting.Thisevaluationprocedureis somewhat atypical of the TA community, whereby evaluation tends to focus on the efficiency with which the proposed TA techniques performed a given task, such as text classification. Users’perceptionsoftheresultsandtheusefulnessare,inmost cases,overlooked,butsee[48]foranexampleofarelatednatural languagegeneration task in an academic experimental setting, usingtestsubjectstoratetheoutputofrivalingsystems. 4. Challengesinindustrialapplications

Wenowdiscussthedifferentchallengesandconstraintsthat industrialenvironmentsimposeonTAtechniques,asopposedto deployingthese techniquesin a more controlled, experimental researchenvironment.Presentinganexhaustiveorcomprehensive list of constraints and challenges is a formidable, almost impossible, endeavor, requiring a survey of all TA techniques deployedinindustry.Therefore,wewillbaseourdiscussiononthe applicationsdescribedintheprevioussection.

4.1. Challenge1:Variety/handlingheterogeneousdatasources TextstobeprocessedinTAapplicationswilltypicallyoriginate from a variety of sources, resulting in corpora of largely heterogeneousdocuments.Thisheterogeneitycanbemanifested intextsinmanydifferentways,e.g.

The texts canbeencoded in differentformatsor renderedin differentlayouts,suchaswebpages;

Thetexts’lengthcanalsobedifferent;tweetsforinstancewillbe muchshorterthanothertypesoftexts;

Theycan beexpressedinmultiplelanguages. Forinstance,in many companies, documents for the higher management is often expressed in English. Conversely, at the operational level, communication predominantly takes place in thelocal language.

4.2. Challenge2:Textandgenreartifacts

Textsinreal-life,industrialenvironmentsoftenexhibitcertain peculiarities that complicate their automatic processing by TA

techniques. Anoverview of themain peculiarities is presented below.

TAtechniquesinmanyapplicationdomainshavetodealwith short and sparse texts. These texts do not provide sufﬁcient statistical redundancy,asopposed tolarger,standard corpora employed in academic research efforts. Consequently, TA techniques are unable to acquire reliable statistical evidence fromthecontents ofsuchtexts, hinderingtheir analysis.The issueofsparsity isfurtherexacerbated bytheheavyusageof variants,suchas ‘‘TO’’‘‘takeoff’’ and‘‘take-off’’ torepresenta singletermorconcept;

Itisalsocommonforthesetextstobeexpressedinaninformal style,rifewithill-formedgrammaticalconstructs.Thus,itishard forTAtechniquestoacquirereliablelinguisticinformationfrom thecontentsofsuchtexts,impedingtheirautomaticanalysis; Another challenge, especially for text categorization

applica-tions,isthatofclassimbalance,i.e.thetraininginstances(text documents)areunevenlydistributedacrossthedifferentclasses. Fore.g.theaimoftheapplicationpresentedin[16]istoclassify aviationincidentreportsusingaclassiﬁcationschemeconsisting of 37 categories. However, thetraining data distributionwas suchthatasinglecategorycontainedaround41%ofallreports, while25categoriestogethercoveredlessthan1%ofallreports. Thedifﬁcultywhendealingwithsuchclassimbalanceproblems isthatoflearningamodelthataccuratelypredictsthemajority aswellastheminorityclasses;

Asrelatedissueasdescribedin[16]isthatsomeclassification schemesinreal-lifeapplicationsaretoounwieldyandcomplex forTAtechniques.Forinstance,aclassificationscheme consist-ing of 1600 categories, hierarchically organized across three levels, was deemed ‘‘out of reach’’ by the text classification applicationof[16].

4.3. Challenge3:Lackofgold-standardsandannotateddata Evaluating the results of TA techniques can be particularly challenging due tosubjectivityinherentin certainapplications, suchassentimentanalysis,textclassificationandtextclustering. Inthecontextofacademicresearch,suchdifficultiesarealleviated byrelyingongold-standarddatasetsandalgorithmsfromprevious researchforcomparing,benchmarkingandassessingtheresults. However, gold-standard data are rarely available in industrial settings, compounding thedifficulties in evaluatingthe perfor-manceofthedevelopedTAtechniques.Arelatedandmoregeneral issueisthelackofannotateddatafortrainingsupervisedmachine learningalgorithmsinapplicationssuchastextcategorizationand sentimentanalysis.

Thecreationofgold-standarddatasetsandtheannotationof trainingdataisatedious,time-consumingandcomplexprocess. For instance,humanexperts tookaroundfivehoursmerely for definingtheannotationguidelinestocreatetrainingdataforthe textclassificationapplicationin[15].Furthermore,asdescribedin

[12],itishard evenforhumanexpertstoreacha consensusin deﬁningannotationguidelines.

4.4. Challenge4:Qualityofresults

Giventhecurrentstate-of-the-artinresearch,itisunreasonable toexpectTAtechniquestooutperformhumansoreventoyield results that are close to perfection, i.e. 100% accuracy. This is particularlyrelevantforindustrialapplications, whereextensive domain-knowledge and expertise are crucial for the successful execution of certain tasks,suchas classifying aviation incident reportsintosafetycategoriesorﬁndingclustersofsimilartexts. Furthermore, as discussed before, these tasks prove to be

(9)

extremely challenging even for human experts due to the subjectivity involved. For instance, one of the classification schemesintheaviationsectorconsistedofmorethan1600 cate-gories [16]. As can be expected, it was considered to be too overwhelmingforusebyTA(textcategorization)applicationsfor several reasons. Intuitively, accurately discriminating between suchalarge number of classes ischallenging for automaticTA techniquesandevenforhumanexperts.Then,therearethereare theissueofsparseness,i.e.insufficienttrainingexamplesperclass, aswellasthatofclassimbalance.Anotherissuepertainstothefact thatsomecategoriescorrespondtobroad,vagueconcepts.They are inherently hard to formalize. For example, the category ‘‘systemcomponentfailure’’representsallfailuresthataffectthe largenumberofcomponentsinanaircraft.Preciselydefiningsuch acategoryishard.Furthermore,suchacategoryisoftendescribed interms ofextraneousinformation,suchascrewsdeclaringan emergencyortroubleshooting.Consequently,TAtechniquesare unabletolearnthesalientfeaturesthataccuratelydiscriminatethe categoryforclassifyingincidentreports[16].

ThelimitationsofTAtechniquesforcertaindomain-speciﬁctasks andtheadded-valueofexpertknowledgewerealsohighlightedin the study of [18]. Speciﬁcally,the proposedapplication fails in detectingdrugnamesthatarelexicallyrealizedasadjectives,fore.g. ‘‘antidepressant’’.Itwasalsounabletodetectdrugeffectsrealized usingcolloquialexpressions,fore.g.‘‘myheadisringing’’.

AgrowingtrendinmanyTAapplications,particularlythosefor textclassification [15,16], is tomake use of active learning to improvetheoverall TAprocess and theresults.However, even thoughactivelearninghasbeenshowntobeapromisingstrategy, its usage in industrial applications also gives rise to several pertinentquestions.Forinstance,animportantissueisthechoice oftheinitialdocumentsubset,whichtendstohaveasignificant impactonthefinalclassificationaccuracy.Anotherissueisthatof determininganoptimalthresholdforidentifyingthoseborderline/ ambiguous cases (documents) to be submitted to the experts’ judgments. Related to the previous issue is the problem of determininganappropriatenumberofdocumentstobepresented totheexpertsforfurtherevaluationineachiteration.

4.5. Challenge5:Velocity

‘‘Velocity’’(speed)isamajorfactorinindustrialapplications.It is manifested in two primary ways. First, some industrial applications,particularlythose thatsupportdecision makingin mission-criticalactivities,arerequiredtoprocessdata(texts)and generateresultsinatimelyandefﬁcientmanner.Second,isthe rateatwhich certainsources,especiallysocial medianetworks, producedata(textstreams).Forinstance,around6000tweetsare generated on average every second.10 _Such _high-paced _text streamsdemandatimelyanalysissothatrelevantandmeaningful informationareextractedfromtheircontents.

5. Desiderataofindustrialapplications

WenowformulateasetofdesideratathatTAtechniquesshould satisfyinordertoensuretheirsuccessfuldeploymentinindustrial applications. As will be seen in our discussion, some of the desiderata stem directly from the aforementioned challenges, whileothersaremoregeneral.

5.1. Desideratum1:Flexibility–data

As mentioned earlier,texts tobe processedin industrial TA applicationswilltypicallyoriginatefromavarietyofsources.For

instance,documents(joboffers)thatarefedtothedomain-specific search-enginein[15]areinheterogeneousformatsandlayouts. Thesearch-engineshouldbeabletoreliablypre-processallthese differenttypesofdocumentsandindextheircontents.Similarly, theapplicationfordetectingADRpresentedin[18]hastomine relevantinformationfromdifferentsocialmediasources,suchas Twitterandwebforums.Thus,industrialTAapplicationsshouldbe sufficientlyflexiblesoastosupportandprocesstextsinawide varietyofformatsandfrommultiplesources.

5.2. Desideratum2:Flexibility–language

Inaddition,industrialTAapplicationsshouldbeﬂexiblewith regardstothelanguage.Severaloftheapplicationswereviewed aretargetedattextsinlanguagesotherthanEnglish.Forinstance, the text clustering and sentiment analysis application in [12]

process texts in German and English; the traffic monitoring applicationin [14]analyzestweets in Portuguese.The domain-specific search-engine and text classification system in [15]

operates on German text, the application in [18] extracts informationonADRfromSpanishtexts,while[16]performtopic modeling, text classiﬁcation and information retrieval on texts predominantly expressed in French. This reﬂects a trend also observedinresearch,characterizedbyanincreasingnumberofTA systemsforarangeofwidelyspokenlanguagesotherthanEnglish, suchasChinese,Arabic,Russian,andTurkish.

5.3. Desideratum3:Robustness

Asdiscussedintheprevioussection,textsgeneratedinreal-life, industrial environments exhibit several intricacies, such as sparsity and ill-formed grammatical constructs. Classical TA techniques, developed within the realm of scientific/academic research, often face several difficulties and are too brittle for analyzing these types of texts [49–51]. Therefore, in industrial applications,therobustnessofTAtechniquestendstobefavored overtheirsophistication.Forinstance,latentDirichletallocation (LDA)isconsideredasthecurrentstateoftheartandoneofthe most successful techniques for topic detection from corpora. However,theexperimentsin[16]and[12],conductedover real-lifetexts,revealthatLDAdoesnotperformwell.Consequently, K-Meansclusteringwasemployedfortopicdetectionasitwasmore robustandachievedahighclusteringqualityandhighefficiency

[12]. Similarly, many of the applications reviewed [12,14,16]

preferred stemming over lemmatization as the former was consideredtobemorerobustandlessdependentonthelanguage. 5.4. Desideratum3:Minimalsupervision

Annotatedcorporafortrainingsupervisedtechniquesareoften unavailableordifﬁculttoobtaininindustrialenvironments,and their creationentails several challengesasdiscussed before. To overcome this difﬁculty, industrial applications could rely on minimally supervised techniques or on techniques based on distantsupervision.Thesetechniqueshavealreadybeenappliedto variousTAtasks,suchastermandrelationextraction[49–51]and sentiment analysis [52]. They have been shown to achieve promisingresultswhileeschewingtheneedforannotatedtraining data.

5.5. Desideratum4:Humanintervention

ItisunreasonabletoexpectTAtechniquestoproduceresultsof the same quality as human experts, especially in domains requiringhumanknowledgeandexpertise.Despitetheincreasing adoption of TA techniques, there is still a large number of

10

(10)

applicationsinwhichhumaninterventioniscrucial.Asdescribed in[16],humansshouldbeplacedattheheartofthe(TA)process.In this regard, the study of [12] reveals that users preferred applicationsthatallowedthemtointeractivelyexperimentwith andexploretheresultsanddatasets.

Basedontheapplicationsthatwereviewed,wecouldclassify ‘‘userinteraction’’intothreemaintypes,viz.

1.ForﬁnetuningtheTAprocessandresults,suchasmergingand splittingclustersandvaryingthresholdstocontrolthenumber oftopics andnumberofclustersaswellastoselectrelevant keywords;

2.Forvalidatingresults,especiallyinmissioncriticalapplications, suchasthosepertainingtoaviation safety[16].For instance,

[16]proposesaconfigurationwherebynohumaninterventionis required for aviation safety reports that could be classified (automatically) with a precision of at least 0.95 by the application. Conversely,those reports witha lowerprecision couldbedispatchedtoa humanexpertforfurtherevaluation andvalidation.Suchaconfigurationenablestheexpertstofocus onspecificcasesthatcannotbesatisfactorilyclassifiedbythe application,therebymakingefficientuseoftheirtime; 3.For active learning in textclassification applications.In this

configuration,documentsthatcannotbeconclusivelyclassified bytheapplicationareannotatedbyahumanexpertandusedfor training thetext classification application in later iterations. There are several benefitsassociated withan activelearning strategy.First,itoptimizestheexperts’timebysubmittingto theirjudgmentonlythoseambiguousdocumentsthatcannotbe accurately classified. Second,it alleviates the need for large amounts of training data, which are often not available in industrialcontexts.

Acommon methodtorealize thedesiredinteractivity isvia parameters[12,16].Furthermore,sinceusersappreciatedhavinga degree of control to experiment with parameter values, no attemptsweremadeintheapplicationstolearn,predictoradjust thesevaluesautomatically.

5.6. Desideratum5:Easeofuse

Animportantdesiderata,fundamentaltoachievethedesired levelofhumanintervention,isthatTAapplicationsshouldbeeasy touse.Thisdesiderataisparticularlyimportantconsideringthat layusersinindustrialsettingswillnotbeTAspecialists.Thus,they shouldbepresentedwithapplicationsthataresimpleandeasyto use, without requiring them to tamper with the applications’ internal mechanics.This is not likely in academic TA research, wherebyconsiderationspertaining touserinterface design and ease ofuse are oftensecondary. Research in human–computer interactionandergonomicsproposesseveralmethodsfor enhanc-ing the easeof useof systems [53,54]. In the applicationswe reviewed, we noted 2 main ways to make the proposed TA applicationseasiertouse:

1.ViaaGraphicalUserInterface(GUI),enablinguserstointeract withtheapplicationsin astraightforwardmanner, fore.g.in tuningandexperimentingwithvariousparametersettings; 2.Throughvisualization(ofresults),enablinguserstoanalyze(for

e.g. drill-down)the informationand resultsgeneratedby the applications.Forinstance,intheapplicationdevelopedin[12], clustersarerenderedasnodesinasemanticmap,andarelabeled with their corresponding topics. Also, the nodes’ size is proportionaltothenumberofdocumentscontainedwithinthe clusters,andthenodesarepositionedonthemapsuchthatthe distancebetweenthemcorrespondstothesimilarity between

theirrespectivetopics.Furthermore,nodesarecoloredaccording tothesentimentsexpressedbythetextstheycontain.Similarly, theIRtoolin[16]displaysgroupsofsimilarreports chronologi-callyalongatemporalaxistofacilitatethemonitoringoftrends.

5.7. Desideratum6:Velocity–fastanalysis

TA is often employed in industrial applications to support decision-makingatalllevels,i.e.strategic,tacticalandoperational, ofanenterprise.Therefore,anessentialdesideratumistheability toanalyzepotentiallylargevolumesofdataandgeneratereliable resultsinatimelymanner.Theneedfortimelyresponseiscritical to ensurea high degreeof information freshness, which hasa significant impact on the practical usefulness, relevancy and veracityoftheresultsproducedbytheapplication[55,56].Failure tosatisfythisdesideratumcanleadtotheproductionofinaccurate andoutdatedresults,whicharenotinterestingtousersandcan evenbedetrimentaltodecision-making.Fore.g.aviationincident reportsin[16]havetobeanalyzedintheleasttimepossiblesothat safetyissuescanbedetectedandcorrectivemeasures implemen-tedquickly.Otherapplicationsmayevencallforreal-timeornear real-time processing. For e.g. the search-engine in [15] has to analyze,indexandclassifyjoboffersfromvarioussourcesinanear real-timemannerinordertoprovidethemostaccurateand up-to-date information to users. Similarly, the fleet management applicationin[14]hastoprocesstweetsinreal-timeinorderto acquirethelatesttrafficinformation.

5.8. Desideratum7:Extrinsicevaluation

Thevariousapplicationsthatwereviewedareevaluatedusing widelyacceptedstandardmetrics,originatingfromTA(andmachine learning) research. Examples included the traditional precision, recallandF-score[47].Thesemetrics,nodoubt,provideareliable estimateofperformance.Theymeasuretheintrinsicperformance, i.e. how accurate the TA application is or howefficient it is in performingitstask,e.g.textclassification?However,measuringthe extrinsicperformancehasbeenanunderratedissueinTA,bothin research andpractice. Unliketheir intrinsiccounterparts, which focus on efficiency, extrinsic evaluation measures focus on the effectivenessorusefulnessoftheapplicationinitsusage environ-ment.Thus,amajordesideratumforindustrialapplicationswould bethedefinitionofmetricsforextrinsicevaluation.Onesolutionto fillthislacunacouldbetheadoptionofinformationsystems(IS) success frameworks proposed in organizational management (informationsystems)research.Onesuchframeworkisthe well-knownDeLoneandMcLeanISmodel[57].ItpositsthatISsuccesscan be measured using six interdependent variables as proxies, namely: System Quality, Information Quality, Service Quality, (Intentionto)Use,UserSatisfactionandNetBenefits.Thechallenge is now how toaccurately measure these variables, given their subjectiveandqualitativenature,andtodeterminewhetherthey shouldallbegiventhesameimportance(weights)indifferentTA applications.

6. Futuretrends

Our review has provided some evidence that classical text analytics techniques are still at the heart of many industrial applications.Theseincludetextclassiﬁcation,textclustering,text summarization,andmeasuresfortextsimilarity.Themainstrength ofthesetechniquesforindustrialapplicationsisthattheyarenow consideredasmaturetechnologiesbyvirtueoftheirlongstanding historyintheTAcommunityandofthesigniﬁcantattentionthat they have received over the years in various research efforts.

(11)

Furthermore,theyareconsideredsufﬁcientlyrobusttodealwiththe intricaciesoftextsgeneratedinreal-worldindustrialapplicationsas discussedinSection4.Inaddition,anotherstrengthoftheseclassical techniquesisthattheygenerateresultsthatareeasytointerpretand thatcandirectlybeexploitedforinformeddecision-making.

However,besidethedevelopmentoftheclassicaltechniques, someofwhichwerealreadywellunderstoodanddevelopedtwo decadesago,many newdevelopments haveemergedthat have startedtoplayaroleinindustrialapplications,orcouldplayarole inthenear future. In particular,recent years havewitnessed a revivalinneural-network-basedmethods,suchasdeep-learning neuralnetworks[58,59]thathaveshowntremendouspotentialin theprocessingofstreamingmedia(speech,video)aswellasstill images. All majorcommercial speech recognition systems (e.g. MicrosoftCortana,Xbox,SkypeTranslator,GoogleNow,AppleSiri, Baiduand iFlyTek voice search,and a range of Nuance speech products,etc.)nowadaysarebasedondeeplearning.

Work on deeplearningof textprocessing islaggingbehind, possibly because deep-learning methods are less suited to symbolic data than they are for sub-symbolic input streams. Initialsuccesseshavebeenreported,however.Forexample,Chen et al. [60] propose a recurrent deep learning method for dependency parsing. Sutskever et al. [61] demonstrate how sequence-to-sequenceproblems(suchasmachinetranslationor paraphrasing)inNLPcanbeformulatedandlearnedthroughdeep learning methods. Schocher et al. [62] show that sentiment classificationcanbelearnedwithdeeplearningmodelsbyusing wordembeddingsasaninputlayertoarecursiveneuralnetwork (RNN).Inthefieldofnaturallanguagegeneration,deeplearning hasalsobeenshowntobesuccessfulthroughthemodellingof semanticconstraints[63].Maetal.[64]proposeadeeplearning approachforsentenceembedding,andshowthattheirmethodcan achievestate-of-the-art resultsformany sentenceclassification problemssuchassentimentclassificationandquestion classifica-tion.Alsointhedomainofinformationretrieval,recentworkhas demonstrated how deep learning models can be applied to learningasimilarityscorebetweentwotextsegments[65,66].

AnotherrecentdevelopmentintheNLPfieldistoimproveNLPor text analytics systems by using so-called word embeddings, distributedwordrepresentationproducedbydimension-reduction techniquesmostofwhichhaveaneural-networkinterpretationor implementation,suchasword2vec[67]orGloVe[68].Therelative easeofusingwordembeddingssoftwareandoutput,incontrastto therelativelydemandinginfrastructurerequiredfordeeplearning (e.g. high-performance computing, GPU hardware), makes the methodeasytointegrateintoindustrialapplications;intermsofour desiderata,theyareeasytouse,requireminimalsupervision,are robust,and are flexiblewithrespect tolanguage anddata.One exampleisprovidedbyTosiketal.[69]inthecontextofinformation extraction from curriculum vitae documents in German. Word embeddings,replacinglexicalfeaturesinamachine-learning-based sequenceparser,causetheparsertobemoreaccurateindetecting domain-specificnamedentities,especiallyinout-of-sampledata.

Wordembeddingsanddeep-learningmethodsarealsocentral toanewtrendinNLPtojointlymodelobjectrecognitioninimages and generating texts that describe the scene in those images

[71]. Asmany industriesworkwithimages and video streams, oftencombinedwithspeechortextinsimultaneousstreams,the automatization of generating keywords, captions, or subtitles along with image data offers interesting outlooks for serving enrichedcontenttousers.

Yet, in industry theuptake of word embeddings and deep-learningmethods is still in itsinfancy. Early successessuchas thosereportedin[69]arelikely toseedfurther andpotentially widespreaddeploymentinreal-lifeindustrialapplications.Onthe other hand, typical industrial demands for ease of use of the

methodsmaylimittheuptakeofthesemethods,whichrelyonbig data and partly on high-performance computing,of which the operationisfarfromtrivial,andwhicharenottransparentintheir internal workings. Furthermore, classical methods offer strong baselinesthatthesenewmethodsarenotguaranteedtosurpass. Key in this development will likely be the research and developmentdepartmentsofthegrowingtextanalyticsindustry. References

[1]S.Filippov,MappingTextandDataMininginAcademicandResearch CommunitiesinEurope,TechnicalReport, 2014(accessed19.08.15).

[2]D.Harris,Netﬂixusesdataforalotmorethanjustrecommendations, 2014,

https://gigaom.com/2014/06/12/

netﬂix-uses-data-for-a-lot-more-than-just-recommendations/ (accessed 19.08.15).

[3]X.Amatriain,NetflixRecommendations:Beyondthe5Stars, 2012,http:// techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html/

(accessed19.08.15).

[4]J.Best,IBMWatson:theinsidestoryofhowtheJeopardy-winning supercomputerwasborn,andwhatitwantstodonext, 2014,http://www. techrepublic.com/article/ibm-watson-the-inside-story-of-how-the-jeopardy-winning-supercomputer-was-born-and-what-it-wants-to-do-next/ (accessed 19.08.15).

[5]L.Dignan,IBMWatson’snextadventure:healthcarewithNuance, 2011,

http://www.zdnet.com/article/

ibm-watsons-next-adventure-healthcare-with-nuance/ (accessed19.08.15). [6]E.Guizzo,IBM’sWatsonJeopardyComputerShutsDownHumansinFinal

Game, 2011,http://spectrum.ieee.org/automaton/robotics/

artiﬁcial-intelligence/ibm-watson-jeopardy-computer-shuts-down-humans/

(accessed19.08.15).

[7]TheBankofEnglandminessocialmediainbidtopredicteconomicevents aheadoftime, 2015,http://www.out-law.com/en/articles/2015/august/ the-bank-of-england-mines-social-media-in-bid-to-predict-economic-events-ahead-of-time/ (accessed19.08.15).

[8]IBMWatsonHardAtWork:NewBreakthroughsTransformQualityCarefor Patients, 2013,https://www.mskcc.org/press-releases/

ibm-watson-hard-work-new-breakthroughs-transform-quality-care-patients/

(accessed19.08.15).

[9]WatsoninFinances,http://www-05.ibm.com/innovation/uk/watson/ watson_in_ﬁnance.shtml/(accessed19.08.15).

[10]DBSBankEngagesIBM’sWatsontoAchieveNextGenerationClient Experience,https://www-03.ibm.com/press/us/en/pressrelease/42868.wss/

(accessed19.08.15).

[11]StandardBankdeploysIBM’sWatsontocrunchcustomerdata, 2014,http:// www.ﬁnextra.com/news/fullstory.aspx?newsitemid=26624/ (accessed19.08.15). [12]P.Greiner,S.Evert,F.Baigger,B.Lang,Adistributionalapproachtoopen

questionsinmarketresearch,Comput.Ind.(2015).

[13]E.Lloret,M.Palomar,Analysingandevaluatingthetaskofautomatictweet generation:knowledgetobusiness,Comput.Ind.(2015).

[14]F.C.Albuquerque,M.A.Casanova,H.Lopes,L.R.Redlich,J.A.F.deMacedo,M. Lemos,M.T.M.deCarvalho,C.Renso,Amethodologyfortrafﬁc-relatedtwitter messagesinterpretation,Comput.Ind.(2015).

[15]S.Schmidt,S.Schnitzer,C.Rensing,Textclassificationbasedfiltersfora domain-specificsearchengine,Comput.Ind.(2015).

[16]L.Tanguy,N.Tulechki,A.Urieli,E.Hermann,C.Raynal,Naturallanguage processingforaviationsafetyreports:fromclassiﬁcationtointeractive analysis,Comput.Ind.(2015).

[17]J.W.Chang,M.C.Lee,T.I.Wang,Integratingasemantic-basedretrievalagent intocase-basedreasoningsystems:acasestudyofanonlinebookstore, Comput.Ind.(2015).

[18]P.Martı´nez,J.L.Martı´nez,I.Segura-Bedmar,J.Moreno-Schneider,A.Luna,R. Revert,Turningusergeneratedhealth-relatedcontentintoactionable knowledgethroughtextanalyticsservices,Comput.Ind.(2015).

[19]S.Bird,NLTK:thenaturallanguagetoolkit,in:ProceedingsoftheCOLING/ACL onInteractivePresentationSessions,AssociationforComputationalLinguistics, 2006,pp.69–72.

[20]M.F.Porter,Snowball:ALanguageforStemmingAlgorithms, 2001.

[21]J.R.Bellegarda,Latentsemanticmapping:principles&applications,Synthesis LecturesonSpeechandAudioProcessing,vol.3, 2007,pp.1–101.

[22]S.Evert,T.Proisl,P.Greiner,B.Kabashi,SentiKLUE:updatingapolarity classiﬁerin48hours,in:SemEval2014,2014,p.551.

[23]T.Dunning,Accuratemethodsforthestatisticsofsurpriseandcoincidence, Comput.Linguist.19(1993)61–74.

[24]E.Lloret,M.Palomar,Towardsautomatictweetgeneration:acomparative studyfromthetextsummarizationperspectiveinthejournalismgenre, ExpertSyst.Appl.40(2013)6624–6630.

[25]FreeLing3.1,http://nlp.lsi.upc.edu/freeling/(accessed19.08.15). [26]E.Motta,E.Fernandes,R.Milidiu´,F-ext-2.0:awebservicefornatural

languageprocessing,in:PROPOR,2010,27–30.

[27]J.Platt,etal.,Fasttrainingofsupportvectormachinesusingsequential minimaloptimization,AdvancesinKernelMethods:SupportVectorLearning, vol.3, 1999.