Science Arts & Métiers (SAM)
is an open access repository that collects the work of Arts et Métiers Institute of
Technology researchers and makes it freely available over the web where possible.
This is an author-deposited version published in:
https://sam.ensam.eu
Handle ID: .
http://hdl.handle.net/10985/18405
To cite this version :
Clara ARGERICH MARTÍN, Rubén IBÁÑEZ PINILLO, Anaïs BARASINSKI, Francisco CHINESTA
Code2vect: An efficient heterogenous data classifier and nonlinear regression technique
-Comptes Rendus Mecanique - Vol. 347, n°11, p.754-761 - 2019
Any correspondence concerning this service should be sent to the repository
Administrator :
[email protected]
Code2vect:
An
efficient
heterogenous
data
classifier
and
nonlinear
regression
technique
Clara
Argerich
Martín
a,
Ruben
Ibáñez Pinillo
a,
Anais
Barasinski
b,
Francisco
Chinesta
c,
∗
aPIMM,ArtsetMétiersInstituteofTechnology,CNRS,CNAM,HESAMUniversity,151,boulevarddel’Hôpital,75013Paris,France bUniversityofPau&PaysAdour,E2SUPPA,IPREMUMR5254,64000Pau,France
cESIGROUPChair@PIMM,ArtsetMétiersInstituteofTechnology,151,boulevarddel’Hôpital,75013Paris,France
a
b
s
t
r
a
c
t
Keywords: Machinelearning Datarepresentation Classification Categorialdata Neuralnetwork High-dimensionaldata RegressionTheaimofthispaperistopresentanewclassificationandregressionalgorithmbasedon ArtificialIntelligence.Themainfeatureofthisalgorithm,whichwillbecalledCode2Vect, isthenature ofthe datatotreat: qualitativeorquantitativeandcontinuous ordiscrete. Contrary to other artificial intelligence techniques based on the “Big-Data,” this new approach will enable working with a reduced amount of data, within the so-called “Smart Data” paradigm.Moreover, themain purpose of thisalgorithm isto enable the representationof high-dimensional data and more specificallygrouping and visualizing thisdata according to agiventarget.For that purpose,the data will be projectedinto avectorialspaceequipped withan appropriatemetric, abletogroup data according to theiraffinity(withrespecttoagivenoutputofinterest).Furthermore,anotherapplication of this algorithm lies on its prediction capability. As it occurs with most common data-miningtechniques suchas regression trees, bygiving an input the output will be inferred,inthiscase consideringthe nature ofthedata formerlydescribed.In orderto illustrate itspotentialities, two different applications will be addressed,one concerning the representation of high-dimensional and categorical data and another featuring the predictioncapabilitiesofthealgorithm.
1. Introduction
Theuseofartificialintelligencetoworkwithdatainordertoclassify,representoranalyzeitisafieldthatisattracting the moreandmoreinterestfromboththescientific andtechnologicalviewpoints.Manytechniquesforthispurposehave beendevelopedduring thelast years,suchastheLocallyLinearEmbedding,presentedin[1] andlargelyused fordiverse applications [2]. The tSNE (t-distributed Stochastic NeighborEmbedding) presented in [3], is widely used for visualizing high-dimensional dataanddiscoverthelatentmanifold inwhichdataisembedded,whosedimensionalityinformsonthe numberofexplicativeuncorrelatedparameters.Otheralgorithmsweredesignedforextractinghiddenregressionsfromdata, andthenbeingabletoinferaresponse(output)foragiveninput.Linearandnonlinearregressionswerewidelyconsidered in a variety of worksand diverse applications,decision trees and random forest were proposed foroperating in highly
*
Correspondingauthor.E-mailaddresses:[email protected](C. ArgerichMartín),[email protected](R. Ibáñez Pinillo), [email protected] (A. Barasinski),[email protected] (F. Chinesta).
Fig. 1. Input space (left) and target vector space (right).
dimensional spaces. In [4] authors perform a review of these techniques and some applications addressed in [5] using randomforests.Regressiontechniquesoperatequitewelleveninthelow-datalimit,inparticulartheonesbasedonsparse sensing combinedwiththe useofseparated representationformats. Whendata isabundant, the mostpowerfullearning techniquesaretheonesbasedontheuseofneuralnetworks,givenplacetotheso-calleddeep-learning(see[6] and[7] for generalknowledgeondeeplearning).
One ofthemostused applicationsofdeep Learningisvisual recognition andimage classification[8,9]. Deep learning techniquesare alsousedforprobabilisticlanguage [10]. Interestedreaderscan referto [11] andthenumerousreferences therein.
Oneofthemostrecentalgorithmsdevelopedforheterogeneousdataclassificationistheso-calledWord2Vec,developed by Google inc. This algorithm presented in [12] is based on the use of neural networks[13] and the representation of eachwordasavectorrepresentingtheprobabilityofhavingagivenneighboringword.Thisapproachtriedtocircumvent the difficulty of defining distances between qualitative data, that is, one could consider that yellow and red are quite closebecausebotharerepresentingcolors,however,fromthespellingofbothwordssuchaproximitybecomesdifficultto quantify.
Infieldslikemanufacturingengineeringsuchanapproachisparticularlyrelevantbecausetheperformancesoffor exam-plea manufacturedplasticpartdependsontheresin used(definedby itscommercialappellation orthe chemicalnature also codified but differently), the oven temperature(in a particular unit system), the curing time and the name of the employe(e.g.,Peter,Paul...).
Now,aprocesscouldbecharacterizedfromthesefourparameters:(i)resin,(ii)temperature, (iii)processtimeand(iv) employe surname; andwitheach one we could associate a labelassociatedwith aparticular output ofinterest, a value quantifyingaparticularperformanceoftheformedpart.
Thus, each process could be described by a point in a four-dimensional space, each point having a color depending on thevalue associated withtheoutput ofinterest. The fouraxes representtheheterogeneous parameters appropriately codified.Forexample,anumericcodemustbeassociatedwitheach resinoroperatorname(e.g.,polyester
=
1,vinylester=
2,...Peter=
1,Paul=
2,...)inordertorepresentqualitativedataontheresinoroperatoraxes...However,thesechoicesaretotallyarbitrary,aswellasunitsinthedifferentquantitativedata(temperatureandprocess time in ourexample). Thus, it isevident that calculatingdistances betweenprocesses insuch a space hasnot sense. In otherwords,twopointsrepresentingtwoprocessescould beverycloseinthefour-dimensionalspacebuthaveoutputsof interestverydifferentand,oppositely,remotepointscouldimplyverysimilarperformances.
Thisissue isnot new,andmany authorswhen considering heterogeneousbutquantitative data,normalized themfor minimizing theimpactofunitsinthecomputation ofdistances(forexampleusual stressesandstraininsolid mechanics differinabout12ordersofmagnitude,theformerbeingoforder106 andthelast10−6).However,normalizationhidesthe issuewithoutreallysolvingit.
Inwhatfollows,weproposeatechnique,sketchedinFig.1,formappingpointsfromarepresentationspacetoatarget spaceequippedofanEuclideanmetricallowingthequantificationofdistancesandthenapplyingallthenumericalartillery usingdistances,asconsideredlater.
2. Classificationandregressionstrategies
In thissection, themethodology sketched inSection 1 is described. Weassume that the points in the original space (space of representation) consist of
P
arrays composed on N entries (four in the example considered above). They are assumedarraysbecausetheycannotbeconsideredvectors,andarenotedbyyi.Theirimagesinthevectorspacearenotedby xi
∈ R
d, withdN, thistime realvectors subjected to therules ofcoordinate transformation.That vector spaceis
equippedwiththe standardscalarproductandtheassociatedEuclideandistance.Themappingisdescribed bythed
×
NTable 1
Codificationofcategorialparameters. Param. 1 Code A 10 B 11 C 12 D 13 E 14 F 15 G 16 Material Code A 1 B 2 C 3 D 4 E 5 F 6 Supplier Code A 1.5 B 2.5 C 3.5 D 4.5 Param. 2 Code A 7 B 8 Stable Code Y E S 1 N O 0.1 x
=
Wy,
(1)whereboth thecomponentsofW andtheimagesxi
∈ R
d,i=
1,
· · · ,
P
mustbecalculated.Eachpoint xi keepsthelabel,denotedby
O
i(valueoftheoutputofinterest,hereassumedscalar),associatedwithitsoriginpointyi.We wouldlike placing points xi,such that theEuclidean distancewitheach other point xj scales withtheir outputs
difference,i.e.
(
W(
yi−
yj))
· (
W(
yi−
yj))
=
xi−
xj2= |
O
i−
O
j|,
(2)wherethecoordinatesofoneofthepoints canbearbitrarilychosen. Thus,thereare P2
2
− P
relationsfordeterminingthed
×
N (thecomponentsofW becausethecoordinatesxi areobtainedfromW byapplyingrelation(1)).Linear mappings are limited and do not allow operating in nonlinear settings. Thus, a better choice consists of the nonlinearmappingW
(
y)
,expressiblefromthegeneralpolynomialformW
(
y)
=
K
k=1
W
kP
k(
y),
(3)wherematrices
W
khavethesamedimensionsas W andmustensuretheyj→
xjmapping.Ontheotherhand,theinterpolationfunctions
P
k(
y)
canbecontinuousordiscontinuous.Inthenumericalapplicationsdescribedlater,standardpolynomialorradialfunctionsareconsidered.
The associated nonlinear problemcan be efficiently solved by usingan adequate linearization strategy, e.g., Newton’s methodaspresentedinAppendixA.Becauseoftheincrease inthenumberofunknownstobe determined,arichenough samplingisrequired.
Thechoiceofd deservessomecomments.Thelowerd is,thestrongerbecomes themappingnonlinearity,howeverthe visualizationbecomessimpler.
When thedimensionalityN increases approximationssuffertheso-calledcurseofdimensionality.Averyefficientway ofalleviatingitconsistsinusingseparatedrepresentations[14] fromwhichthemappingreads
W
(
y)
=
K k=1W
k N i=1 Pik(
yi).
(4)Inthepreviousexpression yi referstothei-componentofy.
Theuseoftheseseparatedrepresentationsconstitutesaworkinprogress.
3. Numericalexamples
Thissectionaddressestwocasestudies,thefirstdevotedtoclassificationofheterogenousdataandthesecondaimingat emphasizingtheclassificationandnonlinearregressioncapabilitiesofthetechniquejustintroduced.
3.1. Representationofhigh-dimensionalcategoricaldatafrommanufacturingprocesses
Inthissection anapplicationwithrealandhighlynon-lineardataispresented.Weconsideradatabaseprovidedbyan industry relatedtoarealmanufacturingprocess.The inputspaceconcernssixmanufacturingparameters collectedduring 53 realizationsoftheprocess.Parametersconcernfourcategoricalparameters:material,supplier,two processparameters whosequantificationisshowninTable1andtwootherquantitative processparameters.Theoutputisthestabilityofthe processitself,alsoquantitivelycodifiedasshowninTable1.Wordsarecodifiedbyusingnumbersintothe
[−
1,
1]
interval. Inthisparticularcasetheinput data,yi,∀
i,isgroupedina 6×
53 matrix(53realizationsoftheprocessdefinedby 6input parameters),withasinglescalartarget.Thevectorspacexi consistsofa2
×
53 matrix(53realizationsmappedintoFig. 2. Residue and tolerance related to the computation of mapping W.
Fig. 3. Vector space representations: output, material, and supplier.
Oncetheinputdataandtheoutputisquantitativelycodified,werunthealgorithmdescribedintheprevioussection,in particularitsnonlinearcounterpart,thatreachesconvergenceinfewiterations(indicatedbyindexi)asFig.2proves,where thetolerance,definedby
Tolerance
=
logWi−
Wi−1 Wi−1(5)
isdepictedateachiteration,withtheresidueoftheNewtoniteration(8).
Fig.3(a)depictsthemappedsamplesinthevectorspace,thatisxi,wherethe53 pointsrepresentingeachrealizationof
themanufacturingprocesshavebeencoloredaccordingtoitsoutput.Asthereadercanappreciate,twowelldifferentiated clustersareobtained, oneconcerningallthestableprocessesandtheothergroupingnon-stableconditions,withonlytwo points outoftheclusters.It isimportantto notethatbetter resultscanbeobtainedbyincreasing boththesamplingand theapproximationofthenonlinearmapping,thatis
K
intheapproximation(3).Otherthanclusteringwithrespecttotheoutput,orusingthecomputingmappingforinferringstabilityfornewprocess conditions,onecouldalsotrytoextracttherelevanceofthedifferentparametersontheoutput.Forthatitsufficescoloring the mapped points according to the value of the considered parameter. Figs. 3(b) and 3(c) reportthis analysis forboth materialandsuppliedrespectively.From theseimagesitcanbeconcludedthat materials F and D andsupplier D exhibit
stability.Onthecontrary,supplierC compromisesprocessstability.
ThesameanalysiswasperformedbyusingNN(neuralnetwork)withahiddenlayercomposedoftwoneuronswhose trainingwasensuredbydatayi andtheassociatedstabilityoutputs.TheNNisshowninFig.4,with53 dataconsistingof
6 inputs,withasingleoutput.Thedimensionisreducedfrom6 to2 byusing2 neuronsinthehiddenlayer,foremulating thelow-dimensionalspacegeneratedbyCode2Vect,whichisdepictedinFig.5wherepointsassociatedwithstableprocess conditions are colored in blue. It can be concluded that clustering is not achieved. Even ifwe cannot conclude on the
Fig. 4. Neural network.
Fig. 5. NN-based model.
Fig. 6. Acoustic absorbers.
superiorityofoneprocedurewithrespecttotheother (becausethelargenumberofNNvariants),wecanconcludeonthe simplicityandefficiencyoftheCode2Vect proposal.
3.2. Modelandregressionconstruction
Weconsideracousticsystemscomposedbyacousticresonators,whosesimpleststructure,theHelmholtzresonator–HR – isdepictedinFig.6a,andinvolvesthreemaingeometricparameters,length(L),section
(
A)
andvolume(V ).HR canbe combinedtocreatemorecomplexacousticstructuresasillustratedinFig.6b,whosebehaviorisgovernedbythedynamic systemM¨ξ +
C˙ξ +
Kξ
=
f (whereξ
definestheairmotionamplitudeateverybottleneck)andwasdeeplyanalyzedin[15], fromwhichtheresonancefrequenciescanbeextracted.Withoutlossofgenerality,theacousticsystemconsider:L1
=
L2=
L3, A1=
A2=
A3 andV1=
V2,reducingthenumberFig. 7. Vector space colored with respect to the parameter values (top) and output of interest (bottom).
wereconsideredandthetworesonancefrequenciescomputed,bothconstitutingtheoutputtobepredictedbythesearched data-basedmodel.
Results obtainedwhen applyingthe Code2Vect techniqueare depictedin Fig.7, wherethe two imagesat thebottom representthevectorspacewithsamplescoloredaccordingtotheassociatedoutputs.Inthepresentcase,thescalaroutput consideredinthemappingconstruction(Eq. (8))consistsofthenormofthevectorscontainingbothfrequencies,i.e.with
ti
= (
ω
1i
,
ω
2i)
,|
O
i−
O
j|
=
ti−
tj.From Fig. 7, we can concludethat section and volume (V and L) are inversely correlated with the output, because forexampleyellowcolorconcentratesonthe bottomwhenrepresentingtheoutputwhereasit localizesonthetopwhen referringtotheseparameters.Theoppositeoccurswhenconsideringthecross-sectionarea( A).Thesefindingsareinperfect agreementwiththeknownphysicsofaHelmholtzResonator,whoseresonancefrequency
ω
risgivenbyω
r=
c 2π
A V L (6)Now,withthemodelW
(
y)
relatinggeometricalparametersandresonancefrequencies extracted,itspredictive capabili-tiesarebeingtested.Forthatpurposethemodelisextractedbyusingonly60% ofthe64geometricalvariantsandtherest ofthesamples(theremaining40% yj)usedforvalidatingthepredictions,thatis,forcomparingthepredictions,theoutput tiatxj=
W(
yj)
yjinterpolatedfromtheoutputtiattheneighboringsamplesxi.Thepredictionsare depictedinFig.8; inyellowthesamplesusedastrainingandinbluethepredicted.Thedeviation withrespecttothediagonalrepresentsthepredictionerrorthatresults,asexpected,verylowforthesamplesusedinthe trainingandabitlarger,butlowerthan4% fortheothers.
From the discussion above, one expectsthat the best model isthe one relating thecombined parameter A
/
V L withtheresonancefrequencies, insteadoftheone that weconsidered abovethat considered thegeometricalparameters indi-vidually,thatis, L, A,andV .Inordertoprovethesuperiorityofmodelingbyconsidering thebestparametrization, Fig.9
compares the previous prediction withthe one obtained by using thecombined parameter A
/
V L that produced almost perfectpredictions.However,thephysicsbehinddataisnotalwaysknowntoinformthebestchoiceofparameters.Aprocedurefor access-ing to thosemostrelevant combinedparameters constitutesone ofour mainresearch activities still in progress.Forthe moment,andinabsenceof“apriori”information,parameterswillbeselectedandincorporatedindividuallyinthemodel.
4. Conclusions
Theclassificationalgorithmpresentedhereenablestherepresentationofhigh-dimensionalheterogeneousdata,including continuousanddiscrete,quantitativeandcategorial.Itestablishesamappingbetweenarepresentationspaceandavector
Fig. 8. Prediction of the target.
Fig. 9. Predicting the output.
spacethatiskeptlowdimensionalforfacilitatingvisualization.Itallowssupervisedclassificationbutalsomakingprediction from itsnonlinear regressionnature. Moreover, itenablesa very simplerepresentation emphasizingthe relevance ofthe parametersinvolvedinthemodel.Itsperformancesinthelow-datalimithavebeenproved.
Acknowledgements
TheauthorsgratefullyacknowledgetheANR(“Agencenationaledelarecherche”,France)foritssupportthroughproject AAPG2018DataBEST.AIRBUSGroupthroughthePhDCIFREprogramandESIGroupthroughtheESI-ENSAMChairprogram.
Appendix A. Nonlinearmodeling
BydenotingWi
=
W(
yi)
,theNewtonproceedsatiterationnWn+1,i
=
Wn,i+
Wi,
(7)thatreplacedinthenonlinearcounterpartofEq. (2)
(
Wiyi−
Wjyj)
· (
Wiyi−
Wjyj)
=
xi−
xj2= |
O
i−
O
j|,
(8)thelefthandmemberreads
(
yi(
Wn,i+
Wi)
−
yj(
Wn,j+
Wj))((
Wn,i+
Wi)
yi− (
Wn,j+
Wj)
yj)
=
yi(
Wn,i+
Wi)(
Wn,i+
Wi)
yi+
yj(
Wn,j+
Wj)(
Wn,j+
Wj)
yj−
2yj
(
Wn,j+
Wj)(
Wn,i+
Wi)
yi.
(9)SinceeachtermappearinginEq. (9) presentsthesamestructure,wedevelopthelinearizationforoneoftheseterms:
andsimilarlyfortheothertermsinEq. (9).
Thenonlinearmodelisapproximatedaccordingto
Wi
=
Kk=1
W
kP
k(
yi).
(11)Eq. (8) concernsdatayiandyj.Hence,byapplyingthesameprocedureforanydatapair,analgebraicsystemofequations
resultsinvolvingthe N
×
d×
K unknowns. Thus,the safeconstructionofthe non-linear mapping requiresarich enough sampling,i.e.N
×
d×
K<
P2
2
−
P (12)Thenonlinearapproximation (11) wasaccomplishedbyusingradial basisbasedoneitherthegaussian orthe inverse-quadraticradialkernels,bothmakinguseofthedistancer
rk
= ||
y−
yk||,
(13)fromwhichbothinterpolantsread
P
k(
r)
=
e−(rk) 2,
(14) andP
k(
r)
=
1 1+
e−(rk)2,
(15)with
affectingtheshapeoftheradialkernel.
References
[1]S.Roweis,L.Saul,Nonlineardimensionalityreductionbylocallylinearembedding,Science290 (5500)(2000)2323–2326.
[2] E.Lopez,D.Gonzalez,J.V.Aguado,E.Abisset-Chavanne,E.Cueto,C.Binetruy,F.Chinesta,Archivesofcomputationalmethodsinengineering,Int.J. Numer.MethodsEng.25 (1)(January2018)59–68,https://doi.org/10.1007/s11831-016-9172-5.
[3]L.VansDerMaaten,G.Hinton,Visualizingdatausingt-SNE,J.Mach.Learn.Res.9(2008)2579–2605.
[4]A.Criminisi,J.Shotton,E.Konukoglu,DecisionForestsforClassification,Regression,DensityEstimation,ManifoldLearningandSemi-Supervised Learn-ing.MicrosoftResearchtechnicalreport,TR-2011-114,2011.
[5]J.Greenhalgh,M.Miermehdi,TrafficsignrecognitionusingMserandRandomforests,in:20thEuropeanSignalProcessingConference,2012. [6]J.Schmidhuber,Deeplearninginneuralnetworks:anoverview,NeuralNetw.61(2015)85–117.
[7]L.Yann,B.Yoshua,H.Geoffrey,Deeplearning,Nature521(2015)436–444.
[8]J.Donahue,L.A.Hendricks,M.Rohrbach,S.Venugopalan,S.Guadarrama,K.Saenko,T.Darrell,Long-termrecurrentconvolutionalnetworksforvisual recognitionanddescription,IEEETrans.PatternAnal.Mach.Intell.39 (4)(2017)677–691.
[9]C.Dong,C.C.Loy,K.He,X.Tang,Imagesuper-resolutionusingdeepconvolutionalnetworks,IEEETrans.PatternAnal.Mach.Intell.38 (2)(2016) 295–307.
[10]Y.Bengio,R.Ducharme,P.Vincent,C.Jauvin,Aneuralprobabilisticlanguagemodel,J.Mach.Learn.Res.3(2003)1137–1155. [11]P.Norvig,S.Russell,ArtificialIntelligence,aModernApproach,Prentice-Hall,1994.
[12]T.Mikolov,I.Sutskever,K.Chen,G.Corrado,J.Dean,Distributedrepresentationofwordsandphrasestheir compositionality,in:ProceedingNIPS’13 Proceedingsofthe26thInternationalConferenceonNeuralInformationProcessingSystems,vol. 2,2013.
[13]M.Abadi,etal.,GoogleResearchLarge-ScaleMachineLearningonHeterogeneousDistributedSystems,TechnicalReport,2015.
[14] R.Ibanez,E.Abbisset-Chavanne,A.Ammar,D.Gonzalze,E.Cueto,J.-L.Duval,F.Chinesta,Amultidimensionaldata-drivensparseidentification tech-nique:thesparsepropergeneralizeddecomposition,Complexity(2018)5608286,https://doi.org/10.1155/2018/5608286.
[15]G.Quaranta,C.Argerich,R.Ibanez,J.-L.Duval,E.Cueto,F.Chinesta,FromlineartononlinearPGD-basedparametricstructuraldynamics,C.R.Mécanique 347(2019)445–454.