Code2vect: An efficient heterogenous data classifier and nonlinear regression technique

(1)

HAL Id: hal-02390399

https://hal.archives-ouvertes.fr/hal-02390399

Submitted on 4 Dec 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0

International License

Code2vect: An eﬀicient heterogenous data classifier and

nonlinear regression technique

Clara Argerich Martin, Rubén Ibáñez Pinillo, Anaïs Barasinski, Francisco

Chinesta

To cite this version:

(2)

nonlinear

regression

technique

Clara

Argerich

Martín

a

,

Ruben Ibáñez Pinillo

a

,

Anais Barasinski

b

,

Francisco Chinesta

c

,

∗

a_PIMM,_Arts_et_Métiers_Institute_of_Technology,_CNRS,_CNAM,_HESAM_University,_151,_boulevard_de_{l’Hôpital,}₇₅₀₁₃_Paris,_France b_University_of_Pau_&_Pays_Adour,_E2S_UPPA,_IPREM_UMR5254,₆₄₀₀₀_Pau,_France

c_ESI_GROUP_Chair_@_PIMM,_Arts_et_Métiers_Institute_of_Technology,_151,_boulevard_de_{l’Hôpital,}₇₅₀₁₃_Paris,_France

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Received20May2019 Accepted1August2019

Availableonline15November2019

Keywords: Machinelearning Datarepresentation Classiﬁcation Categorialdata Neuralnetwork High-dimensionaldata Regression

The aim of this paper is to present a new classification and regression algorithm based on Artificial Intelligence. The main feature of this algorithm, which will be called Code2Vect, is the nature of the data to treat: qualitative or quantitative and continuous or discrete. Contrary to other artificial intelligence techniques based on the “Big-Data,” this new approach will enable working with a reduced amount of data, within the so-called “Smart Data” paradigm. Moreover, the main purpose of this algorithm is to enable the representation of high-dimensional data and more specifically grouping and visualizing this data according to a given target. For that purpose, the data will be projected into a vectorial space equipped with an appropriate metric, able to group data according to their affinity (with respect to a given output of interest). Furthermore, another application of this algorithm lies on its prediction capability. As it occurs with most common data-mining techniques such as regression trees, by giving an input the output will be inferred, in this case considering the nature of the data formerly described. In order to illustrate its potentialities, two different applications will be addressed, one concerning the representation of high-dimensional and categorical data and another featuring the prediction capabilities of the algorithm.

1. Introduction

Theuseofartificialintelligencetoworkwithdatainordertoclassify,representoranalyzeitisafieldthatisattracting the moreandmoreinterestfromboththescientific andtechnologicalviewpoints.Manytechniquesforthispurposehave beendevelopedduring thelast years,suchastheLocallyLinearEmbedding,presentedin[1] andlargelyused fordiverse applications [2]. The tSNE (t-distributed Stochastic NeighborEmbedding) presented in [3], is widely used for visualizing high-dimensional dataanddiscoverthelatentmanifold inwhichdataisembedded,whosedimensionalityinformsonthe numberofexplicativeuncorrelatedparameters.Otheralgorithmsweredesignedforextractinghiddenregressionsfromdata, andthenbeingabletoinferaresponse(output)foragiveninput.Linearandnonlinearregressionswerewidelyconsidered in a variety of worksand diverse applications,decision trees and random forest were proposed foroperating in highly

*

Correspondingauthor.

E-mailaddresses:[email protected](C. ArgerichMartín),[email protected](R. Ibáñez Pinillo),

[email protected](A. Barasinski),[email protected](F. Chinesta).

https://doi.org/10.1016/j.crme.2019.11.002

(3)

Fig. 1. Input space (left) and target vector space (right).

dimensional spaces. In [4] authors perform a review of these techniques and some applications addressed in [5] using randomforests.Regressiontechniquesoperatequitewelleveninthelow-datalimit,inparticulartheonesbasedonsparse sensing combinedwiththe useofseparated representationformats. Whendata isabundant, the mostpowerfullearning techniquesaretheonesbasedontheuseofneuralnetworks,givenplacetotheso-calleddeep-learning(see[6] and[7] for generalknowledgeondeeplearning).

One ofthemostused applicationsofdeep Learningisvisual recognition andimage classiﬁcation[8,9]. Deep learning techniquesare alsousedforprobabilisticlanguage [10]. Interestedreaderscan referto [11] andthenumerousreferences therein.

Oneofthemostrecentalgorithmsdevelopedforheterogeneousdataclassificationistheso-calledWord2Vec,developed by Google inc. This algorithm presented in [12] is based on the use of neural networks[13] and the representation of eachwordasavectorrepresentingtheprobabilityofhavingagivenneighboringword.Thisapproachtriedtocircumvent the difficulty of defining distances between qualitative data, that is, one could consider that yellow and red are quite closebecausebotharerepresentingcolors,however,fromthespellingofbothwordssuchaproximitybecomesdifficultto quantify.

Infieldslikemanufacturingengineeringsuchanapproachisparticularlyrelevantbecausetheperformancesoffor exam-plea manufacturedplasticpartdependsontheresin used(definedby itscommercialappellation orthe chemicalnature also codified but differently), the oven temperature(in a particular unit system), the curing time and the name of the employe(e.g.,Peter,Paul...).

Now,aprocesscouldbecharacterizedfromthesefourparameters:(i)resin,(ii)temperature, (iii)processtimeand(iv) employe surname; andwitheach one we could associate a labelassociatedwith aparticular output ofinterest, a value quantifyingaparticularperformanceoftheformedpart.

Thus, each process could be described by a point in a four-dimensional space, each point having a color depending on thevalue associated withtheoutput ofinterest. The fouraxes representtheheterogeneous parameters appropriately codiﬁed.Forexample,anumericcodemustbeassociatedwitheach resinoroperatorname(e.g.,polyester

=

1,vinylester

=

2,...Peter

=

1,Paul

=

2,...)inordertorepresentqualitativedataontheresinoroperatoraxes...

However,thesechoicesaretotallyarbitrary,aswellasunitsinthedifferentquantitativedata(temperatureandprocess time in ourexample). Thus, it isevident that calculatingdistances betweenprocesses insuch a space hasnot sense. In otherwords,twopointsrepresentingtwoprocessescould beverycloseinthefour-dimensionalspacebuthaveoutputsof interestverydifferentand,oppositely,remotepointscouldimplyverysimilarperformances.

Thisissue isnot new,andmany authorswhen considering heterogeneousbutquantitative data,normalized themfor minimizing theimpactofunitsinthecomputation ofdistances(forexampleusual stressesandstraininsolid mechanics differinabout12ordersofmagnitude,theformerbeingoforder106 andthelast10−6).However,normalizationhidesthe issuewithoutreallysolvingit.

Inwhatfollows,weproposeatechnique,sketchedinFig.1,formappingpointsfromarepresentationspacetoatarget spaceequippedofanEuclideanmetricallowingthequantiﬁcationofdistancesandthenapplyingallthenumericalartillery usingdistances,asconsideredlater.

2. Classiﬁcationandregressionstrategies

In thissection, themethodology sketched inSection 1 is described. Weassume that the points in the original space (space of representation) consist of

P

arrays composed on N entries (four in the example considered above). They are assumedarraysbecausetheycannotbeconsideredvectors,andarenotedbyyi.Theirimagesinthevectorspacearenoted

by xi

∈ R

d, withd

N, thistime realvectors subjected to therules ofcoordinate transformation.That vector spaceis

equippedwiththe standardscalarproductandtheassociatedEuclideandistance.Themappingisdescribed bythed

×

N

(4)

whereboth thecomponentsofW andtheimagesxi

∈ R

d,i

=

1

,

· · · ,

P

mustbecalculated.Eachpoint xi keepsthelabel,

denotedby

O

i(valueoftheoutputofinterest,hereassumedscalar),associatedwithitsoriginpointyi.

We wouldlike placing points xi,such that theEuclidean distancewitheach other point xj scales withtheir outputs

difference,i.e.

(

W

(

yi

−

yj

))

· (

W

(

yi

−

yj

))

=

xi

−

xj

2

= |

O

i

−

O

j

|,

(2)

wherethecoordinatesofoneofthepoints canbearbitrarilychosen. Thus,thereare P2

2

− P

relationsfordeterminingthe d

×

N (thecomponentsofW becausethecoordinatesxi areobtainedfromW byapplyingrelation(1)).

Linear mappings are limited and do not allow operating in nonlinear settings. Thus, a better choice consists of the nonlinearmappingW

(

y

)

,expressiblefromthegeneralpolynomialform

W

(

y

)

=

K

k=1

W

k

P

k

(

y

),

(3)

wherematrices

W

khavethesamedimensionsas W andmustensuretheyj

→

xjmapping.

Ontheotherhand,theinterpolationfunctions

P

k

(

y

)

canbecontinuousordiscontinuous.Inthenumericalapplications

describedlater,standardpolynomialorradialfunctionsareconsidered.

The associated nonlinear problemcan be eﬃciently solved by usingan adequate linearization strategy, e.g., Newton’s methodaspresentedinAppendixA.Becauseoftheincrease inthenumberofunknownstobe determined,arichenough samplingisrequired.

Thechoiceofd deservessomecomments.Thelowerd is,thestrongerbecomes themappingnonlinearity,howeverthe visualizationbecomessimpler.

When thedimensionalityN increases approximationssuffertheso-calledcurseofdimensionality.Averyeﬃcientway ofalleviatingitconsistsinusingseparatedrepresentations[14] fromwhichthemappingreads

W

(

y

)

=

K

k=1

W

k N

i=1 Pi_k

(

yi

).

(4)

Inthepreviousexpression yi referstothei-componentofy.

Theuseoftheseseparatedrepresentationsconstitutesaworkinprogress.

3. Numericalexamples

Thissectionaddressestwocasestudies,thefirstdevotedtoclassificationofheterogenousdataandthesecondaimingat emphasizingtheclassificationandnonlinearregressioncapabilitiesofthetechniquejustintroduced.

3.1. Representationofhigh-dimensionalcategoricaldatafrommanufacturingprocesses

Inthissection anapplicationwithrealandhighlynon-lineardataispresented.Weconsideradatabaseprovidedbyan industry relatedtoarealmanufacturingprocess.The inputspaceconcernssixmanufacturingparameters collectedduring 53 realizationsoftheprocess.Parametersconcernfourcategoricalparameters:material,supplier,two processparameters whosequantificationisshowninTable1andtwootherquantitative processparameters.Theoutputisthestabilityofthe processitself,alsoquantitivelycodifiedasshowninTable1.Wordsarecodifiedbyusingnumbersintothe

[−

1

,

1

]

interval. Inthisparticularcasetheinput data,yi,

∀

i,isgroupedina 6

×

53 matrix(53realizationsoftheprocessdeﬁnedby 6

input parameters),withasinglescalartarget.Thevectorspacexi consistsofa2

×

53 matrix(53realizationsmappedinto

(5)

Fig. 2. Residue and tolerance related to the computation of mapping W.

Fig. 3. Vector space representations: output, material, and supplier.

Oncetheinputdataandtheoutputisquantitativelycodiﬁed,werunthealgorithmdescribedintheprevioussection,in particularitsnonlinearcounterpart,thatreachesconvergenceinfewiterations(indicatedbyindexi)asFig.2proves,where thetolerance,deﬁnedby

Tolerance

=

log

Wi

−

Wi−1

(5)

isdepictedateachiteration,withtheresidueoftheNewtoniteration(8).

Fig.3(a)depictsthemappedsamplesinthevectorspace,thatisxi,wherethe53 pointsrepresentingeachrealizationof

themanufacturingprocesshavebeencoloredaccordingtoitsoutput.Asthereadercanappreciate,twowelldifferentiated clustersareobtained, oneconcerningallthestableprocessesandtheothergroupingnon-stableconditions,withonlytwo points outoftheclusters.It isimportantto notethatbetter resultscanbeobtainedbyincreasing boththesamplingand theapproximationofthenonlinearmapping,thatis

K

intheapproximation(3).

Otherthanclusteringwithrespecttotheoutput,orusingthecomputingmappingforinferringstabilityfornewprocess conditions,onecouldalsotrytoextracttherelevanceofthedifferentparametersontheoutput.Forthatitsuﬃcescoloring the mapped points according to the value of the considered parameter. Figs. 3(b) and 3(c) reportthis analysis forboth materialandsuppliedrespectively.From theseimagesitcanbeconcludedthat materials F and D andsupplier D exhibit

stability.Onthecontrary,supplierC compromisesprocessstability.

ThesameanalysiswasperformedbyusingNN(neuralnetwork)withahiddenlayercomposedoftwoneuronswhose trainingwasensuredbydatayi andtheassociatedstabilityoutputs.TheNNisshowninFig.4,with53 dataconsistingof

(6)

Fig. 5. NN-based model.

Fig. 6. Acoustic absorbers.

superiorityofoneprocedurewithrespecttotheother (becausethelargenumberofNNvariants),wecanconcludeonthe simplicityandeﬃciencyoftheCode2Vect proposal.

3.2. Modelandregressionconstruction

Weconsideracousticsystemscomposedbyacousticresonators,whosesimpleststructure,theHelmholtzresonator–HR –isdepictedinFig.6a,andinvolvesthreemaingeometricparameters,length(L),section

(

A

)

andvolume(V ).HR canbe combinedtocreatemorecomplexacousticstructuresasillustratedinFig.6b,whosebehaviorisgovernedbythedynamic systemM

¨ξ +

C

˙ξ +

K

ξ

=

f (where

ξ

deﬁnestheairmotionamplitudeateverybottleneck)andwasdeeplyanalyzedin[15], fromwhichtheresonancefrequenciescanbeextracted.

Withoutlossofgenerality,theacousticsystemconsider:L1

=

L2

=

L3, A1

=

A2

=

A3 andV1

=

V2,reducingthenumber

(7)

Fig. 7. Vector space colored with respect to the parameter values (top) and output of interest (bottom).

wereconsideredandthetworesonancefrequenciescomputed,bothconstitutingtheoutputtobepredictedbythesearched data-basedmodel.

Results obtainedwhen applyingthe Code2Vect techniqueare depictedin Fig.7, wherethe two imagesat thebottom representthevectorspacewithsamplescoloredaccordingtotheassociatedoutputs.Inthepresentcase,thescalaroutput consideredinthemappingconstruction(Eq. (8))consistsofthenormofthevectorscontainingbothfrequencies,i.e.with t_i

= (

ω

1

i

,

ω

2i

)

,

|

O

i

−

O

j

|

=

ti

−

tj

.

From Fig. 7, we can concludethat section and volume (V and L) are inversely correlated with the output, because forexampleyellowcolorconcentratesonthe bottomwhenrepresentingtheoutputwhereasit localizesonthetopwhen referringtotheseparameters.Theoppositeoccurswhenconsideringthecross-sectionarea( A).Theseﬁndingsareinperfect agreementwiththeknownphysicsofaHelmholtzResonator,whoseresonancefrequency

ω

risgivenby

ω

r

=

c 2

π

A V L (6)

Now,withthemodelW

(

y

)

relatinggeometricalparametersandresonancefrequencies extracted,itspredictive capabili-tiesarebeingtested.Forthatpurposethemodelisextractedbyusingonly60% ofthe64geometricalvariantsandtherest ofthesamples(theremaining40% yj)usedforvalidatingthepredictions,thatis,forcomparingthepredictions,theoutput

tiatxj

=

W

(

yj

)

yjinterpolatedfromtheoutputtiattheneighboringsamplesxi.

Thepredictionsare depictedinFig.8; inyellowthesamplesusedastrainingandinbluethepredicted.Thedeviation withrespecttothediagonalrepresentsthepredictionerrorthatresults,asexpected,verylowforthesamplesusedinthe trainingandabitlarger,butlowerthan4% fortheothers.

From the discussion above, one expectsthat the best model isthe one relating thecombined parameter A

/

V L with

theresonancefrequencies, insteadoftheone that weconsidered abovethat considered thegeometricalparameters indi-vidually,thatis, L, A,andV .Inordertoprovethesuperiorityofmodelingbyconsidering thebestparametrization, Fig.9

compares the previous prediction withthe one obtained by using thecombined parameter A

/

V L that produced almost perfectpredictions.

However,thephysicsbehinddataisnotalwaysknowntoinformthebestchoiceofparameters.Aprocedurefor access-ing to thosemostrelevant combinedparameters constitutesone ofour mainresearch activities still in progress.Forthe moment,andinabsenceof“apriori”information,parameterswillbeselectedandincorporatedindividuallyinthemodel.

4. Conclusions

(8)

Fig. 8. Prediction of the target.

Fig. 9. Predicting the output.

spacethatiskeptlowdimensionalforfacilitatingvisualization.Itallowssupervisedclassiﬁcationbutalsomakingprediction from itsnonlinear regressionnature. Moreover, itenablesa very simplerepresentation emphasizingthe relevance ofthe parametersinvolvedinthemodel.Itsperformancesinthelow-datalimithavebeenproved.

Acknowledgements

TheauthorsgratefullyacknowledgetheANR(“Agencenationaledelarecherche”,France)foritssupportthroughproject AAPG2018DataBEST.AIRBUSGroupthroughthePhDCIFREprogramandESIGroupthroughtheESI-ENSAMChairprogram.

Appendix A. Nonlinearmodeling

BydenotingWi

=

W

(

yi

)

,theNewtonproceedsatiterationn

Wn+1,i

=

Wn,i

+

Wi

,

(7)

thatreplacedinthenonlinearcounterpartofEq. (2)

(

Wiyi

−

Wjyj

)

· (

Wiyi

−

Wjyj

)

=

xi

−

xj

2

= |

O

i

−

O

j

|,

(8)

thelefthandmemberreads

(

y_i

(

W_n_,_i

+

W_i

)

−

y_j

(

W_n_,_j

+

W_j

))((

Wn,i

+

Wi

)

yi

− (

Wn,j

+

Wj

)

yj

)

=

y_i

(

W_n_,_i

+

W_i

)(

Wn,i

+

Wi

)

yi

+

y_j

(

W_n_,_j

+

W_j

)(

Wn,j

+

Wj

)

yj

−

2y_j

(

W_n_,_j

+

W_j

)(

Wn,i

+

Wi

)

yi

.

(9)

SinceeachtermappearinginEq. (9) presentsthesamestructure,wedevelopthelinearizationforoneoftheseterms:

(9)

andsimilarlyfortheothertermsinEq. (9).

Thenonlinearmodelisapproximatedaccordingto

Wi

=

K

k=1

W

k

P

k

(

yi

).

(11)

Eq. (8) concernsdatayiandyj.Hence,byapplyingthesameprocedureforanydatapair,analgebraicsystemofequations

resultsinvolvingthe N

×

d

×

K unknowns. Thus,the safeconstructionofthe non-linear mapping requiresarich enough sampling,i.e.

N

×

d

×

K

<

P 2

2

−

P (12)

Thenonlinearapproximation (11) wasaccomplishedbyusingradial basisbasedoneitherthegaussian orthe inverse-quadraticradialkernels,bothmakinguseofthedistancer

rk

= ||

y

−

yk

||,

(13)

fromwhichbothinterpolantsread

P

k

(

r

)

=

e−(rk) 2

,

(14) and

P

k

(

r

)

=

1

+

e−(rk)2

,

(15)

with

affectingtheshapeoftheradialkernel. References

[1]S.Roweis,L.Saul,Nonlineardimensionalityreductionbylocallylinearembedding,Science290 (5500)(2000)2323–2326.

[2] E.Lopez,D.Gonzalez,J.V.Aguado,E.Abisset-Chavanne,E.Cueto,C.Binetruy,F.Chinesta,Archivesofcomputationalmethodsinengineering,Int.J. Numer.MethodsEng.25 (1)(January2018)59–68,https://doi.org/10.1007/s11831-016-9172-5.

[3]L.VansDerMaaten,G.Hinton,Visualizingdatausingt-SNE,J.Mach.Learn.Res.9(2008)2579–2605.

[4]A.Criminisi,J.Shotton,E.Konukoglu,DecisionForestsforClassiﬁcation,Regression,DensityEstimation,ManifoldLearningandSemi-Supervised Learn-ing.MicrosoftResearchtechnicalreport,TR-2011-114,2011.

[5]J.Greenhalgh,M.Miermehdi,TraﬃcsignrecognitionusingMserandRandomforests,in:20thEuropeanSignalProcessingConference,2012.

[6]J.Schmidhuber,Deeplearninginneuralnetworks:anoverview,NeuralNetw.61(2015)85–117.

[7]L.Yann,B.Yoshua,H.Geoffrey,Deeplearning,Nature521(2015)436–444.

[8]J.Donahue,L.A.Hendricks,M.Rohrbach,S.Venugopalan,S.Guadarrama,K.Saenko,T.Darrell,Long-termrecurrentconvolutionalnetworksforvisual recognitionanddescription,IEEETrans.PatternAnal.Mach.Intell.39 (4)(2017)677–691.

[9]C.Dong,C.C.Loy,K.He,X.Tang,Imagesuper-resolutionusingdeepconvolutionalnetworks,IEEETrans.PatternAnal.Mach.Intell.38 (2)(2016) 295–307.

[10]Y.Bengio,R.Ducharme,P.Vincent,C.Jauvin,Aneuralprobabilisticlanguagemodel,J.Mach.Learn.Res.3(2003)1137–1155.

[11]P.Norvig,S.Russell,ArtiﬁcialIntelligence,aModernApproach,Prentice-Hall,1994.

[12]T.Mikolov,I.Sutskever,K.Chen,G.Corrado,J.Dean,Distributedrepresentationofwordsandphrasestheir compositionality,in:ProceedingNIPS’13 Proceedingsofthe26thInternationalConferenceonNeuralInformationProcessingSystems,vol. 2,2013.

[13]M.Abadi,etal.,GoogleResearchLarge-ScaleMachineLearningonHeterogeneousDistributedSystems,TechnicalReport,2015.

[14] R.Ibanez,E.Abbisset-Chavanne,A.Ammar,D.Gonzalze,E.Cueto,J.-L.Duval,F.Chinesta,Amultidimensionaldata-drivensparseidentiﬁcation tech-nique:thesparsepropergeneralizeddecomposition,Complexity(2018)5608286,https://doi.org/10.1155/2018/5608286.

Code2vect: An efficient heterogenous data classifier and nonlinear regression technique

HAL Id: hal-02390399

https://hal.archives-ouvertes.fr/hal-02390399

Submitted on 4 Dec 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0

International License

Code2vect: An eﬀicient heterogenous data classifier and

nonlinear regression technique

Clara Argerich Martin, Rubén Ibáñez Pinillo, Anaïs Barasinski, Francisco

Chinesta

To cite this version:

nonlinear

regression

technique

Clara

Argerich

Martín

,

Ruben Ibáñez Pinillo

,

Anais Barasinski

,

Francisco Chinesta

,

∗

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

*

=

=

=

=

P

∈ R

×

∈ R

=

,

· · · ,

P

O

(

(

−

))

· (

(

−

))

= 

−

=

=