Code2vect: An efficient heterogenous data classifier and nonlinear regression technique

(1)

Science Arts & Métiers (SAM)

is an open access repository that collects the work of Arts et Métiers Institute of

Technology researchers and makes it freely available over the web where possible.

This is an author-deposited version published in:

https://sam.ensam.eu

Handle ID: .

http://hdl.handle.net/10985/18405

To cite this version :

Clara ARGERICH MARTÍN, Rubén IBÁÑEZ PINILLO, Anaïs BARASINSKI, Francisco CHINESTA

Code2vect: An efficient heterogenous data classifier and nonlinear regression technique

-Comptes Rendus Mecanique - Vol. 347, n°11, p.754-761 - 2019

Any correspondence concerning this service should be sent to the repository

Administrator :

[email protected]

(2)

Code2vect:

An

eﬃcient

heterogenous

data

classiﬁer

and

nonlinear

regression

technique

Clara

Argerich

Martín

a

,

Ruben

Ibáñez Pinillo

a

,

Anais

Barasinski

b

,

Francisco

Chinesta

c

,

∗

a_PIMM,_Arts_et_Métiers_Institute_of_Technology,_CNRS,_CNAM,_HESAM_University,_151,_boulevard_de_{l’Hôpital,}₇₅₀₁₃_Paris,_France b_University_of_Pau_&_Pays_Adour,_E2S_UPPA,_IPREM_UMR5254,₆₄₀₀₀_Pau,_France

c_ESI_GROUP_Chair_@_PIMM,_Arts_et_Métiers_Institute_of_Technology,_151,_boulevard_de_{l’Hôpital,}₇₅₀₁₃_Paris,_France

a

b

s

t

r

a

c

t

Keywords: Machinelearning Datarepresentation Classiﬁcation Categorialdata Neuralnetwork High-dimensionaldata Regression

Theaimofthispaperistopresentanewclassificationandregressionalgorithmbasedon ArtificialIntelligence.Themainfeatureofthisalgorithm,whichwillbecalledCode2Vect, isthenature ofthe datatotreat: qualitativeorquantitativeandcontinuous ordiscrete. Contrary to other artificial intelligence techniques based on the “Big-Data,” this new approach will enable working with a reduced amount of data, within the so-called “Smart Data” paradigm.Moreover, themain purpose of thisalgorithm isto enable the representationof high-dimensional data and more specificallygrouping and visualizing thisdata according to agiventarget.For that purpose,the data will be projectedinto avectorialspaceequipped withan appropriatemetric, abletogroup data according to theiraffinity(withrespecttoagivenoutputofinterest).Furthermore,anotherapplication of this algorithm lies on its prediction capability. As it occurs with most common data-miningtechniques suchas regression trees, bygiving an input the output will be inferred,inthiscase consideringthe nature ofthedata formerlydescribed.In orderto illustrate itspotentialities, two different applications will be addressed,one concerning the representation of high-dimensional and categorical data and another featuring the predictioncapabilitiesofthealgorithm.

1. Introduction

Theuseofartificialintelligencetoworkwithdatainordertoclassify,representoranalyzeitisafieldthatisattracting the moreandmoreinterestfromboththescientific andtechnologicalviewpoints.Manytechniquesforthispurposehave beendevelopedduring thelast years,suchastheLocallyLinearEmbedding,presentedin[1] andlargelyused fordiverse applications [2]. The tSNE (t-distributed Stochastic NeighborEmbedding) presented in [3], is widely used for visualizing high-dimensional dataanddiscoverthelatentmanifold inwhichdataisembedded,whosedimensionalityinformsonthe numberofexplicativeuncorrelatedparameters.Otheralgorithmsweredesignedforextractinghiddenregressionsfromdata, andthenbeingabletoinferaresponse(output)foragiveninput.Linearandnonlinearregressionswerewidelyconsidered in a variety of worksand diverse applications,decision trees and random forest were proposed foroperating in highly

*

Correspondingauthor.

E-mailaddresses:[email protected](C. ArgerichMartín),[email protected](R. Ibáñez Pinillo), [email protected] (A. Barasinski),[email protected] (F. Chinesta).

(3)

Fig. 1. Input space (left) and target vector space (right).

dimensional spaces. In [4] authors perform a review of these techniques and some applications addressed in [5] using randomforests.Regressiontechniquesoperatequitewelleveninthelow-datalimit,inparticulartheonesbasedonsparse sensing combinedwiththe useofseparated representationformats. Whendata isabundant, the mostpowerfullearning techniquesaretheonesbasedontheuseofneuralnetworks,givenplacetotheso-calleddeep-learning(see[6] and[7] for generalknowledgeondeeplearning).

One ofthemostused applicationsofdeep Learningisvisual recognition andimage classiﬁcation[8,9]. Deep learning techniquesare alsousedforprobabilisticlanguage [10]. Interestedreaderscan referto [11] andthenumerousreferences therein.

Oneofthemostrecentalgorithmsdevelopedforheterogeneousdataclassificationistheso-calledWord2Vec,developed by Google inc. This algorithm presented in [12] is based on the use of neural networks[13] and the representation of eachwordasavectorrepresentingtheprobabilityofhavingagivenneighboringword.Thisapproachtriedtocircumvent the difficulty of defining distances between qualitative data, that is, one could consider that yellow and red are quite closebecausebotharerepresentingcolors,however,fromthespellingofbothwordssuchaproximitybecomesdifficultto quantify.

Infieldslikemanufacturingengineeringsuchanapproachisparticularlyrelevantbecausetheperformancesoffor exam-plea manufacturedplasticpartdependsontheresin used(definedby itscommercialappellation orthe chemicalnature also codified but differently), the oven temperature(in a particular unit system), the curing time and the name of the employe(e.g.,Peter,Paul...).

Now,aprocesscouldbecharacterizedfromthesefourparameters:(i)resin,(ii)temperature, (iii)processtimeand(iv) employe surname; andwitheach one we could associate a labelassociatedwith aparticular output ofinterest, a value quantifyingaparticularperformanceoftheformedpart.

Thus, each process could be described by a point in a four-dimensional space, each point having a color depending on thevalue associated withtheoutput ofinterest. The fouraxes representtheheterogeneous parameters appropriately codiﬁed.Forexample,anumericcodemustbeassociatedwitheach resinoroperatorname(e.g.,polyester

=

1,vinylester

=

2,...Peter

=

1,Paul

=

2,...)inordertorepresentqualitativedataontheresinoroperatoraxes...

However,thesechoicesaretotallyarbitrary,aswellasunitsinthedifferentquantitativedata(temperatureandprocess time in ourexample). Thus, it isevident that calculatingdistances betweenprocesses insuch a space hasnot sense. In otherwords,twopointsrepresentingtwoprocessescould beverycloseinthefour-dimensionalspacebuthaveoutputsof interestverydifferentand,oppositely,remotepointscouldimplyverysimilarperformances.

Thisissue isnot new,andmany authorswhen considering heterogeneousbutquantitative data,normalized themfor minimizing theimpactofunitsinthecomputation ofdistances(forexampleusual stressesandstraininsolid mechanics differinabout12ordersofmagnitude,theformerbeingoforder106 andthelast10−6).However,normalizationhidesthe issuewithoutreallysolvingit.

Inwhatfollows,weproposeatechnique,sketchedinFig.1,formappingpointsfromarepresentationspacetoatarget spaceequippedofanEuclideanmetricallowingthequantiﬁcationofdistancesandthenapplyingallthenumericalartillery usingdistances,asconsideredlater.

2. Classiﬁcationandregressionstrategies

In thissection, themethodology sketched inSection 1 is described. Weassume that the points in the original space (space of representation) consist of

P

arrays composed on N entries (four in the example considered above). They are assumedarraysbecausetheycannotbeconsideredvectors,andarenotedbyyi.Theirimagesinthevectorspacearenoted

by xi

∈ R

d, withd

N, thistime realvectors subjected to therules ofcoordinate transformation.That vector spaceis

equippedwiththe standardscalarproductandtheassociatedEuclideandistance.Themappingisdescribed bythed

×

N

(4)

Table 1

Codiﬁcationofcategorialparameters. Param. 1 Code A 10 B 11 C 12 D 13 E 14 F 15 G 16 Material Code A 1 B 2 C 3 D 4 E 5 F 6 Supplier Code A 1.5 B 2.5 C 3.5 D 4.5 Param. 2 Code A 7 B 8 Stable Code Y E S 1 N O 0.1 x

=

Wy

,

(1)

whereboth thecomponentsofW andtheimagesxi

∈ R

d,i

=

1

,

· · · ,

P

mustbecalculated.Eachpoint xi keepsthelabel,

denotedby

O

i(valueoftheoutputofinterest,hereassumedscalar),associatedwithitsoriginpointyi.

We wouldlike placing points xi,such that theEuclidean distancewitheach other point xj scales withtheir outputs

difference,i.e.

(

W

(

yi

−

yj

))

· (

W

(

yi

−

yj

))

=

xi

−

xj

2

= |

O

i

−

O

j

|,

(2)

wherethecoordinatesofoneofthepoints canbearbitrarilychosen. Thus,thereare P2

2

− P

relationsfordeterminingthe

d

×

N (thecomponentsofW becausethecoordinatesxi areobtainedfromW byapplyingrelation(1)).

Linear mappings are limited and do not allow operating in nonlinear settings. Thus, a better choice consists of the nonlinearmappingW

(

y

)

,expressiblefromthegeneralpolynomialform

W

(

y

)

=

K

k=1

W

k

P

k

(

y

),

(3)

wherematrices

W

khavethesamedimensionsas W andmustensuretheyj

→

xjmapping.

Ontheotherhand,theinterpolationfunctions

P

k

(

y

)

canbecontinuousordiscontinuous.Inthenumericalapplications

describedlater,standardpolynomialorradialfunctionsareconsidered.

The associated nonlinear problemcan be eﬃciently solved by usingan adequate linearization strategy, e.g., Newton’s methodaspresentedinAppendixA.Becauseoftheincrease inthenumberofunknownstobe determined,arichenough samplingisrequired.

Thechoiceofd deservessomecomments.Thelowerd is,thestrongerbecomes themappingnonlinearity,howeverthe visualizationbecomessimpler.

When thedimensionalityN increases approximationssuffertheso-calledcurseofdimensionality.Averyeﬃcientway ofalleviatingitconsistsinusingseparatedrepresentations[14] fromwhichthemappingreads

W

(

y

)

=

K

k=1

W

k N

i=1 Pi_k

(

yi

).

(4)

Inthepreviousexpression yi referstothei-componentofy.

Theuseoftheseseparatedrepresentationsconstitutesaworkinprogress.

3. Numericalexamples

Thissectionaddressestwocasestudies,thefirstdevotedtoclassificationofheterogenousdataandthesecondaimingat emphasizingtheclassificationandnonlinearregressioncapabilitiesofthetechniquejustintroduced.

3.1. Representationofhigh-dimensionalcategoricaldatafrommanufacturingprocesses

Inthissection anapplicationwithrealandhighlynon-lineardataispresented.Weconsideradatabaseprovidedbyan industry relatedtoarealmanufacturingprocess.The inputspaceconcernssixmanufacturingparameters collectedduring 53 realizationsoftheprocess.Parametersconcernfourcategoricalparameters:material,supplier,two processparameters whosequantificationisshowninTable1andtwootherquantitative processparameters.Theoutputisthestabilityofthe processitself,alsoquantitivelycodifiedasshowninTable1.Wordsarecodifiedbyusingnumbersintothe

[−

1

,

1

]

interval. Inthisparticularcasetheinput data,yi,

∀

i,isgroupedina 6

×

53 matrix(53realizationsoftheprocessdeﬁnedby 6

input parameters),withasinglescalartarget.Thevectorspacexi consistsofa2

×

53 matrix(53realizationsmappedinto

(5)

Fig. 2. Residue and tolerance related to the computation of mapping W.

Fig. 3. Vector space representations: output, material, and supplier.

Oncetheinputdataandtheoutputisquantitativelycodiﬁed,werunthealgorithmdescribedintheprevioussection,in particularitsnonlinearcounterpart,thatreachesconvergenceinfewiterations(indicatedbyindexi)asFig.2proves,where thetolerance,deﬁnedby

Tolerance

=

log

Wi

−

Wi−1

W_i−1

(5)

isdepictedateachiteration,withtheresidueoftheNewtoniteration(8).

Fig.3(a)depictsthemappedsamplesinthevectorspace,thatisxi,wherethe53 pointsrepresentingeachrealizationof

themanufacturingprocesshavebeencoloredaccordingtoitsoutput.Asthereadercanappreciate,twowelldifferentiated clustersareobtained, oneconcerningallthestableprocessesandtheothergroupingnon-stableconditions,withonlytwo points outoftheclusters.It isimportantto notethatbetter resultscanbeobtainedbyincreasing boththesamplingand theapproximationofthenonlinearmapping,thatis

K

intheapproximation(3).

Otherthanclusteringwithrespecttotheoutput,orusingthecomputingmappingforinferringstabilityfornewprocess conditions,onecouldalsotrytoextracttherelevanceofthedifferentparametersontheoutput.Forthatitsuﬃcescoloring the mapped points according to the value of the considered parameter. Figs. 3(b) and 3(c) reportthis analysis forboth materialandsuppliedrespectively.From theseimagesitcanbeconcludedthat materials F and D andsupplier D exhibit

stability.Onthecontrary,supplierC compromisesprocessstability.

ThesameanalysiswasperformedbyusingNN(neuralnetwork)withahiddenlayercomposedoftwoneuronswhose trainingwasensuredbydatayi andtheassociatedstabilityoutputs.TheNNisshowninFig.4,with53 dataconsistingof

6 inputs,withasingleoutput.Thedimensionisreducedfrom6 to2 byusing2 neuronsinthehiddenlayer,foremulating thelow-dimensionalspacegeneratedbyCode2Vect,whichisdepictedinFig.5wherepointsassociatedwithstableprocess conditions are colored in blue. It can be concluded that clustering is not achieved. Even ifwe cannot conclude on the

(6)

Fig. 4. Neural network.

Fig. 5. NN-based model.

Fig. 6. Acoustic absorbers.

superiorityofoneprocedurewithrespecttotheother (becausethelargenumberofNNvariants),wecanconcludeonthe simplicityandeﬃciencyoftheCode2Vect proposal.

3.2. Modelandregressionconstruction

Weconsideracousticsystemscomposedbyacousticresonators,whosesimpleststructure,theHelmholtzresonator–HR – isdepictedinFig.6a,andinvolvesthreemaingeometricparameters,length(L),section

(

A

)

andvolume(V ).HR canbe combinedtocreatemorecomplexacousticstructuresasillustratedinFig.6b,whosebehaviorisgovernedbythedynamic systemM

¨ξ +

C

˙ξ +

K

ξ

=

f (where

ξ

deﬁnestheairmotionamplitudeateverybottleneck)andwasdeeplyanalyzedin[15], fromwhichtheresonancefrequenciescanbeextracted.

Withoutlossofgenerality,theacousticsystemconsider:L1

=

L2

=

L3, A1

=

A2

=

A3 andV1

=

V2,reducingthenumber

(7)

Fig. 7. Vector space colored with respect to the parameter values (top) and output of interest (bottom).

wereconsideredandthetworesonancefrequenciescomputed,bothconstitutingtheoutputtobepredictedbythesearched data-basedmodel.

Results obtainedwhen applyingthe Code2Vect techniqueare depictedin Fig.7, wherethe two imagesat thebottom representthevectorspacewithsamplescoloredaccordingtotheassociatedoutputs.Inthepresentcase,thescalaroutput consideredinthemappingconstruction(Eq. (8))consistsofthenormofthevectorscontainingbothfrequencies,i.e.with

t_i

= (

ω

1

i

,

ω

2i

)

,

|

O

i

−

O

j

|

=

ti

−

tj

.

From Fig. 7, we can concludethat section and volume (V and L) are inversely correlated with the output, because forexampleyellowcolorconcentratesonthe bottomwhenrepresentingtheoutputwhereasit localizesonthetopwhen referringtotheseparameters.Theoppositeoccurswhenconsideringthecross-sectionarea( A).Theseﬁndingsareinperfect agreementwiththeknownphysicsofaHelmholtzResonator,whoseresonancefrequency

ω

risgivenby

ω

r

=

c 2

π

A V L (6)

Now,withthemodelW

(

y

)

relatinggeometricalparametersandresonancefrequencies extracted,itspredictive capabili-tiesarebeingtested.Forthatpurposethemodelisextractedbyusingonly60% ofthe64geometricalvariantsandtherest ofthesamples(theremaining40% yj)usedforvalidatingthepredictions,thatis,forcomparingthepredictions,theoutput tiatxj

=

W

(

yj

)

yjinterpolatedfromtheoutputtiattheneighboringsamplesxi.

Thepredictionsare depictedinFig.8; inyellowthesamplesusedastrainingandinbluethepredicted.Thedeviation withrespecttothediagonalrepresentsthepredictionerrorthatresults,asexpected,verylowforthesamplesusedinthe trainingandabitlarger,butlowerthan4% fortheothers.

From the discussion above, one expectsthat the best model isthe one relating thecombined parameter A

/

V L with

theresonancefrequencies, insteadoftheone that weconsidered abovethat considered thegeometricalparameters indi-vidually,thatis, L, A,andV .Inordertoprovethesuperiorityofmodelingbyconsidering thebestparametrization, Fig.9

compares the previous prediction withthe one obtained by using thecombined parameter A

/

V L that produced almost perfectpredictions.

However,thephysicsbehinddataisnotalwaysknowntoinformthebestchoiceofparameters.Aprocedurefor access-ing to thosemostrelevant combinedparameters constitutesone ofour mainresearch activities still in progress.Forthe moment,andinabsenceof“apriori”information,parameterswillbeselectedandincorporatedindividuallyinthemodel.

4. Conclusions

Theclassiﬁcationalgorithmpresentedhereenablestherepresentationofhigh-dimensionalheterogeneousdata,including continuousanddiscrete,quantitativeandcategorial.Itestablishesamappingbetweenarepresentationspaceandavector

(8)

Fig. 8. Prediction of the target.

Fig. 9. Predicting the output.

spacethatiskeptlowdimensionalforfacilitatingvisualization.Itallowssupervisedclassiﬁcationbutalsomakingprediction from itsnonlinear regressionnature. Moreover, itenablesa very simplerepresentation emphasizingthe relevance ofthe parametersinvolvedinthemodel.Itsperformancesinthelow-datalimithavebeenproved.

Acknowledgements

TheauthorsgratefullyacknowledgetheANR(“Agencenationaledelarecherche”,France)foritssupportthroughproject AAPG2018DataBEST.AIRBUSGroupthroughthePhDCIFREprogramandESIGroupthroughtheESI-ENSAMChairprogram.

Appendix A. Nonlinearmodeling

BydenotingWi

=

W

(

yi

)

,theNewtonproceedsatiterationn

Wn₊1,i

=

Wn,i

+

Wi

,

(7)

thatreplacedinthenonlinearcounterpartofEq. (2)

(

Wiyi

−

Wjyj

)

· (

Wiyi

−

Wjyj

)

=

xi

−

xj

2

= |

O

i

−

O

j

|,

(8)

thelefthandmemberreads

(

y_i

(

W_n_,_i

+

W_i

)

−

y_j

(

W_n_,_j

+

W_j

))((

Wn,i

+

Wi

)

yi

− (

Wn,j

+

Wj

)

yj

)

=

y_i

(

W_n_,_i

+

W_i

)(

Wn,i

+

Wi

)

yi

+

y_j

(

W_n_,_j

+

W_j

)(

Wn,j

+

Wj

)

yj

−

2y_j

(

W_n_,_j

+

W_j

)(

Wn,i

+

Wi

)

yi

.

(9)

SinceeachtermappearinginEq. (9) presentsthesamestructure,wedevelopthelinearizationforoneoftheseterms:

(9)

andsimilarlyfortheothertermsinEq. (9).

Thenonlinearmodelisapproximatedaccordingto

Wi

=

K

k=1

W

k

P

k

(

yi

).

(11)

Eq. (8) concernsdatayiandyj.Hence,byapplyingthesameprocedureforanydatapair,analgebraicsystemofequations

resultsinvolvingthe N

×

d

×

K unknowns. Thus,the safeconstructionofthe non-linear mapping requiresarich enough sampling,i.e.

N

×

d

×

K

<

P

2

−

P (12)

Thenonlinearapproximation (11) wasaccomplishedbyusingradial basisbasedoneitherthegaussian orthe inverse-quadraticradialkernels,bothmakinguseofthedistancer

rk

= ||

y

−

yk

||,

(13)

fromwhichbothinterpolantsread

P

k

(

r

)

=

e−(rk) 2

,

(14) and

P

k

(

r

)

=

1

+

e−(rk)2

,

(15)

with

affectingtheshapeoftheradialkernel.

References

[1]S.Roweis,L.Saul,Nonlineardimensionalityreductionbylocallylinearembedding,Science290 (5500)(2000)2323–2326.

[2] E.Lopez,D.Gonzalez,J.V.Aguado,E.Abisset-Chavanne,E.Cueto,C.Binetruy,F.Chinesta,Archivesofcomputationalmethodsinengineering,Int.J. Numer.MethodsEng.25 (1)(January2018)59–68,https://doi.org/10.1007/s11831-016-9172-5.

[3]L.VansDerMaaten,G.Hinton,Visualizingdatausingt-SNE,J.Mach.Learn.Res.9(2008)2579–2605.

[4]A.Criminisi,J.Shotton,E.Konukoglu,DecisionForestsforClassiﬁcation,Regression,DensityEstimation,ManifoldLearningandSemi-Supervised Learn-ing.MicrosoftResearchtechnicalreport,TR-2011-114,2011.

[5]J.Greenhalgh,M.Miermehdi,TraﬃcsignrecognitionusingMserandRandomforests,in:20thEuropeanSignalProcessingConference,2012. [6]J.Schmidhuber,Deeplearninginneuralnetworks:anoverview,NeuralNetw.61(2015)85–117.

[7]L.Yann,B.Yoshua,H.Geoffrey,Deeplearning,Nature521(2015)436–444.

[8]J.Donahue,L.A.Hendricks,M.Rohrbach,S.Venugopalan,S.Guadarrama,K.Saenko,T.Darrell,Long-termrecurrentconvolutionalnetworksforvisual recognitionanddescription,IEEETrans.PatternAnal.Mach.Intell.39 (4)(2017)677–691.

[9]C.Dong,C.C.Loy,K.He,X.Tang,Imagesuper-resolutionusingdeepconvolutionalnetworks,IEEETrans.PatternAnal.Mach.Intell.38 (2)(2016) 295–307.

[10]Y.Bengio,R.Ducharme,P.Vincent,C.Jauvin,Aneuralprobabilisticlanguagemodel,J.Mach.Learn.Res.3(2003)1137–1155. [11]P.Norvig,S.Russell,ArtiﬁcialIntelligence,aModernApproach,Prentice-Hall,1994.

[12]T.Mikolov,I.Sutskever,K.Chen,G.Corrado,J.Dean,Distributedrepresentationofwordsandphrasestheir compositionality,in:ProceedingNIPS’13 Proceedingsofthe26thInternationalConferenceonNeuralInformationProcessingSystems,vol. 2,2013.

[13]M.Abadi,etal.,GoogleResearchLarge-ScaleMachineLearningonHeterogeneousDistributedSystems,TechnicalReport,2015.

[14] R.Ibanez,E.Abbisset-Chavanne,A.Ammar,D.Gonzalze,E.Cueto,J.-L.Duval,F.Chinesta,Amultidimensionaldata-drivensparseidentiﬁcation tech-nique:thesparsepropergeneralizeddecomposition,Complexity(2018)5608286,https://doi.org/10.1155/2018/5608286.

[15]G.Quaranta,C.Argerich,R.Ibanez,J.-L.Duval,E.Cueto,F.Chinesta,FromlineartononlinearPGD-basedparametricstructuraldynamics,C.R.Mécanique 347(2019)445–454.

Code2vect: An efficient heterogenous data classifier and nonlinear regression technique

Science Arts & Métiers (SAM)

is an open access repository that collects the work of Arts et Métiers Institute of

Technology researchers and makes it freely available over the web where possible.

This is an author-deposited version published in:

https://sam.ensam.eu

Handle ID: .

http://hdl.handle.net/10985/18405

To cite this version :

Clara ARGERICH MARTÍN, Rubén IBÁÑEZ PINILLO, Anaïs BARASINSKI, Francisco CHINESTA