HAL Id: hal-02390399
https://hal.archives-ouvertes.fr/hal-02390399
Submitted on 4 Dec 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0
International License
Code2vect: An efficient heterogenous data classifier and
nonlinear regression technique
Clara Argerich Martin, Rubén Ibáñez Pinillo, Anaïs Barasinski, Francisco
Chinesta
To cite this version:
nonlinear
regression
technique
Clara
Argerich
Martín
a,
Ruben Ibáñez Pinillo
a,
Anais Barasinski
b,
Francisco Chinesta
c,
∗
aPIMM,ArtsetMétiersInstituteofTechnology,CNRS,CNAM,HESAMUniversity,151,boulevarddel’Hôpital,75013Paris,France bUniversityofPau&PaysAdour,E2SUPPA,IPREMUMR5254,64000Pau,France
cESIGROUPChair@PIMM,ArtsetMétiersInstituteofTechnology,151,boulevarddel’Hôpital,75013Paris,France
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory:
Received20May2019 Accepted1August2019
Availableonline15November2019
Keywords: Machinelearning Datarepresentation Classification Categorialdata Neuralnetwork High-dimensionaldata Regression
The aim of this paper is to present a new classification and regression algorithm based on Artificial Intelligence. The main feature of this algorithm, which will be called Code2Vect, is the nature of the data to treat: qualitative or quantitative and continuous or discrete. Contrary to other artificial intelligence techniques based on the “Big-Data,” this new approach will enable working with a reduced amount of data, within the so-called “Smart Data” paradigm. Moreover, the main purpose of this algorithm is to enable the representation of high-dimensional data and more specifically grouping and visualizing this data according to a given target. For that purpose, the data will be projected into a vectorial space equipped with an appropriate metric, able to group data according to their affinity (with respect to a given output of interest). Furthermore, another application of this algorithm lies on its prediction capability. As it occurs with most common data-mining techniques such as regression trees, by giving an input the output will be inferred, in this case considering the nature of the data formerly described. In order to illustrate its potentialities, two different applications will be addressed, one concerning the representation of high-dimensional and categorical data and another featuring the prediction capabilities of the algorithm.
©2019 Académie des sciences. Published by Elsevier Masson SAS. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
Theuseofartificialintelligencetoworkwithdatainordertoclassify,representoranalyzeitisafieldthatisattracting the moreandmoreinterestfromboththescientific andtechnologicalviewpoints.Manytechniquesforthispurposehave beendevelopedduring thelast years,suchastheLocallyLinearEmbedding,presentedin[1] andlargelyused fordiverse applications [2]. The tSNE (t-distributed Stochastic NeighborEmbedding) presented in [3], is widely used for visualizing high-dimensional dataanddiscoverthelatentmanifold inwhichdataisembedded,whosedimensionalityinformsonthe numberofexplicativeuncorrelatedparameters.Otheralgorithmsweredesignedforextractinghiddenregressionsfromdata, andthenbeingabletoinferaresponse(output)foragiveninput.Linearandnonlinearregressionswerewidelyconsidered in a variety of worksand diverse applications,decision trees and random forest were proposed foroperating in highly
*
Correspondingauthor.E-mailaddresses:[email protected](C. ArgerichMartín),[email protected](R. Ibáñez Pinillo),
[email protected](A. Barasinski),[email protected](F. Chinesta).
https://doi.org/10.1016/j.crme.2019.11.002
Fig. 1. Input space (left) and target vector space (right).
dimensional spaces. In [4] authors perform a review of these techniques and some applications addressed in [5] using randomforests.Regressiontechniquesoperatequitewelleveninthelow-datalimit,inparticulartheonesbasedonsparse sensing combinedwiththe useofseparated representationformats. Whendata isabundant, the mostpowerfullearning techniquesaretheonesbasedontheuseofneuralnetworks,givenplacetotheso-calleddeep-learning(see[6] and[7] for generalknowledgeondeeplearning).
One ofthemostused applicationsofdeep Learningisvisual recognition andimage classification[8,9]. Deep learning techniquesare alsousedforprobabilisticlanguage [10]. Interestedreaderscan referto [11] andthenumerousreferences therein.
Oneofthemostrecentalgorithmsdevelopedforheterogeneousdataclassificationistheso-calledWord2Vec,developed by Google inc. This algorithm presented in [12] is based on the use of neural networks[13] and the representation of eachwordasavectorrepresentingtheprobabilityofhavingagivenneighboringword.Thisapproachtriedtocircumvent the difficulty of defining distances between qualitative data, that is, one could consider that yellow and red are quite closebecausebotharerepresentingcolors,however,fromthespellingofbothwordssuchaproximitybecomesdifficultto quantify.
Infieldslikemanufacturingengineeringsuchanapproachisparticularlyrelevantbecausetheperformancesoffor exam-plea manufacturedplasticpartdependsontheresin used(definedby itscommercialappellation orthe chemicalnature also codified but differently), the oven temperature(in a particular unit system), the curing time and the name of the employe(e.g.,Peter,Paul...).
Now,aprocesscouldbecharacterizedfromthesefourparameters:(i)resin,(ii)temperature, (iii)processtimeand(iv) employe surname; andwitheach one we could associate a labelassociatedwith aparticular output ofinterest, a value quantifyingaparticularperformanceoftheformedpart.
Thus, each process could be described by a point in a four-dimensional space, each point having a color depending on thevalue associated withtheoutput ofinterest. The fouraxes representtheheterogeneous parameters appropriately codified.Forexample,anumericcodemustbeassociatedwitheach resinoroperatorname(e.g.,polyester
=
1,vinylester=
2,...Peter=
1,Paul=
2,...)inordertorepresentqualitativedataontheresinoroperatoraxes...However,thesechoicesaretotallyarbitrary,aswellasunitsinthedifferentquantitativedata(temperatureandprocess time in ourexample). Thus, it isevident that calculatingdistances betweenprocesses insuch a space hasnot sense. In otherwords,twopointsrepresentingtwoprocessescould beverycloseinthefour-dimensionalspacebuthaveoutputsof interestverydifferentand,oppositely,remotepointscouldimplyverysimilarperformances.
Thisissue isnot new,andmany authorswhen considering heterogeneousbutquantitative data,normalized themfor minimizing theimpactofunitsinthecomputation ofdistances(forexampleusual stressesandstraininsolid mechanics differinabout12ordersofmagnitude,theformerbeingoforder106 andthelast10−6).However,normalizationhidesthe issuewithoutreallysolvingit.
Inwhatfollows,weproposeatechnique,sketchedinFig.1,formappingpointsfromarepresentationspacetoatarget spaceequippedofanEuclideanmetricallowingthequantificationofdistancesandthenapplyingallthenumericalartillery usingdistances,asconsideredlater.
2. Classificationandregressionstrategies
In thissection, themethodology sketched inSection 1 is described. Weassume that the points in the original space (space of representation) consist of
P
arrays composed on N entries (four in the example considered above). They are assumedarraysbecausetheycannotbeconsideredvectors,andarenotedbyyi.Theirimagesinthevectorspacearenotedby xi
∈ R
d, withdN, thistime realvectors subjected to therules ofcoordinate transformation.That vector spaceisequippedwiththe standardscalarproductandtheassociatedEuclideandistance.Themappingisdescribed bythed
×
Nwhereboth thecomponentsofW andtheimagesxi
∈ R
d,i=
1,
· · · ,
P
mustbecalculated.Eachpoint xi keepsthelabel,denotedby
O
i(valueoftheoutputofinterest,hereassumedscalar),associatedwithitsoriginpointyi.We wouldlike placing points xi,such that theEuclidean distancewitheach other point xj scales withtheir outputs
difference,i.e.
(
W(
yi−
yj))
· (
W(
yi−
yj))
=
xi−
xj2= |
O
i−
O
j|,
(2)wherethecoordinatesofoneofthepoints canbearbitrarilychosen. Thus,thereare P2
2
− P
relationsfordeterminingthe d×
N (thecomponentsofW becausethecoordinatesxi areobtainedfromW byapplyingrelation(1)).Linear mappings are limited and do not allow operating in nonlinear settings. Thus, a better choice consists of the nonlinearmappingW
(
y)
,expressiblefromthegeneralpolynomialformW
(
y)
=
K
k=1
W
kP
k(
y),
(3)wherematrices
W
khavethesamedimensionsas W andmustensuretheyj→
xjmapping.Ontheotherhand,theinterpolationfunctions
P
k(
y)
canbecontinuousordiscontinuous.Inthenumericalapplicationsdescribedlater,standardpolynomialorradialfunctionsareconsidered.
The associated nonlinear problemcan be efficiently solved by usingan adequate linearization strategy, e.g., Newton’s methodaspresentedinAppendixA.Becauseoftheincrease inthenumberofunknownstobe determined,arichenough samplingisrequired.
Thechoiceofd deservessomecomments.Thelowerd is,thestrongerbecomes themappingnonlinearity,howeverthe visualizationbecomessimpler.
When thedimensionalityN increases approximationssuffertheso-calledcurseofdimensionality.Averyefficientway ofalleviatingitconsistsinusingseparatedrepresentations[14] fromwhichthemappingreads
W
(
y)
=
K k=1W
k N i=1 Pik(
yi).
(4)Inthepreviousexpression yi referstothei-componentofy.
Theuseoftheseseparatedrepresentationsconstitutesaworkinprogress.
3. Numericalexamples
Thissectionaddressestwocasestudies,thefirstdevotedtoclassificationofheterogenousdataandthesecondaimingat emphasizingtheclassificationandnonlinearregressioncapabilitiesofthetechniquejustintroduced.
3.1. Representationofhigh-dimensionalcategoricaldatafrommanufacturingprocesses
Inthissection anapplicationwithrealandhighlynon-lineardataispresented.Weconsideradatabaseprovidedbyan industry relatedtoarealmanufacturingprocess.The inputspaceconcernssixmanufacturingparameters collectedduring 53 realizationsoftheprocess.Parametersconcernfourcategoricalparameters:material,supplier,two processparameters whosequantificationisshowninTable1andtwootherquantitative processparameters.Theoutputisthestabilityofthe processitself,alsoquantitivelycodifiedasshowninTable1.Wordsarecodifiedbyusingnumbersintothe
[−
1,
1]
interval. Inthisparticularcasetheinput data,yi,∀
i,isgroupedina 6×
53 matrix(53realizationsoftheprocessdefinedby 6input parameters),withasinglescalartarget.Thevectorspacexi consistsofa2
×
53 matrix(53realizationsmappedintoFig. 2. Residue and tolerance related to the computation of mapping W.
Fig. 3. Vector space representations: output, material, and supplier.
Oncetheinputdataandtheoutputisquantitativelycodified,werunthealgorithmdescribedintheprevioussection,in particularitsnonlinearcounterpart,thatreachesconvergenceinfewiterations(indicatedbyindexi)asFig.2proves,where thetolerance,definedby
Tolerance
=
logWi−
Wi−1 Wi−1(5)
isdepictedateachiteration,withtheresidueoftheNewtoniteration(8).
Fig.3(a)depictsthemappedsamplesinthevectorspace,thatisxi,wherethe53 pointsrepresentingeachrealizationof
themanufacturingprocesshavebeencoloredaccordingtoitsoutput.Asthereadercanappreciate,twowelldifferentiated clustersareobtained, oneconcerningallthestableprocessesandtheothergroupingnon-stableconditions,withonlytwo points outoftheclusters.It isimportantto notethatbetter resultscanbeobtainedbyincreasing boththesamplingand theapproximationofthenonlinearmapping,thatis
K
intheapproximation(3).Otherthanclusteringwithrespecttotheoutput,orusingthecomputingmappingforinferringstabilityfornewprocess conditions,onecouldalsotrytoextracttherelevanceofthedifferentparametersontheoutput.Forthatitsufficescoloring the mapped points according to the value of the considered parameter. Figs. 3(b) and 3(c) reportthis analysis forboth materialandsuppliedrespectively.From theseimagesitcanbeconcludedthat materials F and D andsupplier D exhibit
stability.Onthecontrary,supplierC compromisesprocessstability.
ThesameanalysiswasperformedbyusingNN(neuralnetwork)withahiddenlayercomposedoftwoneuronswhose trainingwasensuredbydatayi andtheassociatedstabilityoutputs.TheNNisshowninFig.4,with53 dataconsistingof
Fig. 5. NN-based model.
Fig. 6. Acoustic absorbers.
superiorityofoneprocedurewithrespecttotheother (becausethelargenumberofNNvariants),wecanconcludeonthe simplicityandefficiencyoftheCode2Vect proposal.
3.2. Modelandregressionconstruction
Weconsideracousticsystemscomposedbyacousticresonators,whosesimpleststructure,theHelmholtzresonator–HR –isdepictedinFig.6a,andinvolvesthreemaingeometricparameters,length(L),section
(
A)
andvolume(V ).HR canbe combinedtocreatemorecomplexacousticstructuresasillustratedinFig.6b,whosebehaviorisgovernedbythedynamic systemM¨ξ +
C˙ξ +
Kξ
=
f (whereξ
definestheairmotionamplitudeateverybottleneck)andwasdeeplyanalyzedin[15], fromwhichtheresonancefrequenciescanbeextracted.Withoutlossofgenerality,theacousticsystemconsider:L1
=
L2=
L3, A1=
A2=
A3 andV1=
V2,reducingthenumberFig. 7. Vector space colored with respect to the parameter values (top) and output of interest (bottom).
wereconsideredandthetworesonancefrequenciescomputed,bothconstitutingtheoutputtobepredictedbythesearched data-basedmodel.
Results obtainedwhen applyingthe Code2Vect techniqueare depictedin Fig.7, wherethe two imagesat thebottom representthevectorspacewithsamplescoloredaccordingtotheassociatedoutputs.Inthepresentcase,thescalaroutput consideredinthemappingconstruction(Eq. (8))consistsofthenormofthevectorscontainingbothfrequencies,i.e.with ti
= (
ω
1i
,
ω
2i)
,|
O
i−
O
j|
=
ti−
tj.From Fig. 7, we can concludethat section and volume (V and L) are inversely correlated with the output, because forexampleyellowcolorconcentratesonthe bottomwhenrepresentingtheoutputwhereasit localizesonthetopwhen referringtotheseparameters.Theoppositeoccurswhenconsideringthecross-sectionarea( A).Thesefindingsareinperfect agreementwiththeknownphysicsofaHelmholtzResonator,whoseresonancefrequency
ω
risgivenbyω
r=
c 2π
A V L (6)Now,withthemodelW
(
y)
relatinggeometricalparametersandresonancefrequencies extracted,itspredictive capabili-tiesarebeingtested.Forthatpurposethemodelisextractedbyusingonly60% ofthe64geometricalvariantsandtherest ofthesamples(theremaining40% yj)usedforvalidatingthepredictions,thatis,forcomparingthepredictions,theoutputtiatxj
=
W(
yj)
yjinterpolatedfromtheoutputtiattheneighboringsamplesxi.Thepredictionsare depictedinFig.8; inyellowthesamplesusedastrainingandinbluethepredicted.Thedeviation withrespecttothediagonalrepresentsthepredictionerrorthatresults,asexpected,verylowforthesamplesusedinthe trainingandabitlarger,butlowerthan4% fortheothers.
From the discussion above, one expectsthat the best model isthe one relating thecombined parameter A
/
V L withtheresonancefrequencies, insteadoftheone that weconsidered abovethat considered thegeometricalparameters indi-vidually,thatis, L, A,andV .Inordertoprovethesuperiorityofmodelingbyconsidering thebestparametrization, Fig.9
compares the previous prediction withthe one obtained by using thecombined parameter A
/
V L that produced almost perfectpredictions.However,thephysicsbehinddataisnotalwaysknowntoinformthebestchoiceofparameters.Aprocedurefor access-ing to thosemostrelevant combinedparameters constitutesone ofour mainresearch activities still in progress.Forthe moment,andinabsenceof“apriori”information,parameterswillbeselectedandincorporatedindividuallyinthemodel.
4. Conclusions
Fig. 8. Prediction of the target.
Fig. 9. Predicting the output.
spacethatiskeptlowdimensionalforfacilitatingvisualization.Itallowssupervisedclassificationbutalsomakingprediction from itsnonlinear regressionnature. Moreover, itenablesa very simplerepresentation emphasizingthe relevance ofthe parametersinvolvedinthemodel.Itsperformancesinthelow-datalimithavebeenproved.
Acknowledgements
TheauthorsgratefullyacknowledgetheANR(“Agencenationaledelarecherche”,France)foritssupportthroughproject AAPG2018DataBEST.AIRBUSGroupthroughthePhDCIFREprogramandESIGroupthroughtheESI-ENSAMChairprogram.
Appendix A. Nonlinearmodeling
BydenotingWi
=
W(
yi)
,theNewtonproceedsatiterationnWn+1,i
=
Wn,i+
Wi,
(7)thatreplacedinthenonlinearcounterpartofEq. (2)
(
Wiyi−
Wjyj)
· (
Wiyi−
Wjyj)
=
xi−
xj2= |
O
i−
O
j|,
(8)thelefthandmemberreads
(
yi(
Wn,i+
Wi)
−
yj(
Wn,j+
Wj))((
Wn,i+
Wi)
yi− (
Wn,j+
Wj)
yj)
=
yi(
Wn,i+
Wi)(
Wn,i+
Wi)
yi+
yj(
Wn,j+
Wj)(
Wn,j+
Wj)
yj−
2yj
(
Wn,j+
Wj)(
Wn,i+
Wi)
yi.
(9)SinceeachtermappearinginEq. (9) presentsthesamestructure,wedevelopthelinearizationforoneoftheseterms:
andsimilarlyfortheothertermsinEq. (9).
Thenonlinearmodelisapproximatedaccordingto
Wi
=
K
k=1
W
kP
k(
yi).
(11)Eq. (8) concernsdatayiandyj.Hence,byapplyingthesameprocedureforanydatapair,analgebraicsystemofequations
resultsinvolvingthe N
×
d×
K unknowns. Thus,the safeconstructionofthe non-linear mapping requiresarich enough sampling,i.e.N
×
d×
K<
P 22
−
P (12)Thenonlinearapproximation (11) wasaccomplishedbyusingradial basisbasedoneitherthegaussian orthe inverse-quadraticradialkernels,bothmakinguseofthedistancer
rk
= ||
y−
yk||,
(13)fromwhichbothinterpolantsread
P
k(
r)
=
e−(rk) 2,
(14) andP
k(
r)
=
1 1+
e−(rk)2,
(15)with
affectingtheshapeoftheradialkernel. References
[1]S.Roweis,L.Saul,Nonlineardimensionalityreductionbylocallylinearembedding,Science290 (5500)(2000)2323–2326.
[2] E.Lopez,D.Gonzalez,J.V.Aguado,E.Abisset-Chavanne,E.Cueto,C.Binetruy,F.Chinesta,Archivesofcomputationalmethodsinengineering,Int.J. Numer.MethodsEng.25 (1)(January2018)59–68,https://doi.org/10.1007/s11831-016-9172-5.
[3]L.VansDerMaaten,G.Hinton,Visualizingdatausingt-SNE,J.Mach.Learn.Res.9(2008)2579–2605.
[4]A.Criminisi,J.Shotton,E.Konukoglu,DecisionForestsforClassification,Regression,DensityEstimation,ManifoldLearningandSemi-Supervised Learn-ing.MicrosoftResearchtechnicalreport,TR-2011-114,2011.
[5]J.Greenhalgh,M.Miermehdi,TrafficsignrecognitionusingMserandRandomforests,in:20thEuropeanSignalProcessingConference,2012.
[6]J.Schmidhuber,Deeplearninginneuralnetworks:anoverview,NeuralNetw.61(2015)85–117.
[7]L.Yann,B.Yoshua,H.Geoffrey,Deeplearning,Nature521(2015)436–444.
[8]J.Donahue,L.A.Hendricks,M.Rohrbach,S.Venugopalan,S.Guadarrama,K.Saenko,T.Darrell,Long-termrecurrentconvolutionalnetworksforvisual recognitionanddescription,IEEETrans.PatternAnal.Mach.Intell.39 (4)(2017)677–691.
[9]C.Dong,C.C.Loy,K.He,X.Tang,Imagesuper-resolutionusingdeepconvolutionalnetworks,IEEETrans.PatternAnal.Mach.Intell.38 (2)(2016) 295–307.
[10]Y.Bengio,R.Ducharme,P.Vincent,C.Jauvin,Aneuralprobabilisticlanguagemodel,J.Mach.Learn.Res.3(2003)1137–1155.
[11]P.Norvig,S.Russell,ArtificialIntelligence,aModernApproach,Prentice-Hall,1994.
[12]T.Mikolov,I.Sutskever,K.Chen,G.Corrado,J.Dean,Distributedrepresentationofwordsandphrasestheir compositionality,in:ProceedingNIPS’13 Proceedingsofthe26thInternationalConferenceonNeuralInformationProcessingSystems,vol. 2,2013.
[13]M.Abadi,etal.,GoogleResearchLarge-ScaleMachineLearningonHeterogeneousDistributedSystems,TechnicalReport,2015.
[14] R.Ibanez,E.Abbisset-Chavanne,A.Ammar,D.Gonzalze,E.Cueto,J.-L.Duval,F.Chinesta,Amultidimensionaldata-drivensparseidentification tech-nique:thesparsepropergeneralizeddecomposition,Complexity(2018)5608286,https://doi.org/10.1155/2018/5608286.