• Aucun résultat trouvé

Code2vect: An efficient heterogenous data classifier and nonlinear regression technique

N/A
N/A
Protected

Academic year: 2021

Partager "Code2vect: An efficient heterogenous data classifier and nonlinear regression technique"

Copied!
9
0
0

Texte intégral

(1)

HAL Id: hal-02390399

https://hal.archives-ouvertes.fr/hal-02390399

Submitted on 4 Dec 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0

International License

Code2vect: An efficient heterogenous data classifier and

nonlinear regression technique

Clara Argerich Martin, Rubén Ibáñez Pinillo, Anaïs Barasinski, Francisco

Chinesta

To cite this version:

(2)

nonlinear

regression

technique

Clara

Argerich

Martín

a

,

Ruben Ibáñez Pinillo

a

,

Anais Barasinski

b

,

Francisco Chinesta

c

,

aPIMM,ArtsetMétiersInstituteofTechnology,CNRS,CNAM,HESAMUniversity,151,boulevarddel’Hôpital,75013Paris,France bUniversityofPau&PaysAdour,E2SUPPA,IPREMUMR5254,64000Pau,France

cESIGROUPChair@PIMM,ArtsetMétiersInstituteofTechnology,151,boulevarddel’Hôpital,75013Paris,France

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Received20May2019 Accepted1August2019

Availableonline15November2019

Keywords: Machinelearning Datarepresentation Classification Categorialdata Neuralnetwork High-dimensionaldata Regression

The aim of this paper is to present a new classification and regression algorithm based on Artificial Intelligence. The main feature of this algorithm, which will be called Code2Vect, is the nature of the data to treat: qualitative or quantitative and continuous or discrete. Contrary to other artificial intelligence techniques based on the “Big-Data,” this new approach will enable working with a reduced amount of data, within the so-called “Smart Data” paradigm. Moreover, the main purpose of this algorithm is to enable the representation of high-dimensional data and more specifically grouping and visualizing this data according to a given target. For that purpose, the data will be projected into a vectorial space equipped with an appropriate metric, able to group data according to their affinity (with respect to a given output of interest). Furthermore, another application of this algorithm lies on its prediction capability. As it occurs with most common data-mining techniques such as regression trees, by giving an input the output will be inferred, in this case considering the nature of the data formerly described. In order to illustrate its potentialities, two different applications will be addressed, one concerning the representation of high-dimensional and categorical data and another featuring the prediction capabilities of the algorithm.

©2019 Académie des sciences. Published by Elsevier Masson SAS. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction

Theuseofartificialintelligencetoworkwithdatainordertoclassify,representoranalyzeitisafieldthatisattracting the moreandmoreinterestfromboththescientific andtechnologicalviewpoints.Manytechniquesforthispurposehave beendevelopedduring thelast years,suchastheLocallyLinearEmbedding,presentedin[1] andlargelyused fordiverse applications [2]. The tSNE (t-distributed Stochastic NeighborEmbedding) presented in [3], is widely used for visualizing high-dimensional dataanddiscoverthelatentmanifold inwhichdataisembedded,whosedimensionalityinformsonthe numberofexplicativeuncorrelatedparameters.Otheralgorithmsweredesignedforextractinghiddenregressionsfromdata, andthenbeingabletoinferaresponse(output)foragiveninput.Linearandnonlinearregressionswerewidelyconsidered in a variety of worksand diverse applications,decision trees and random forest were proposed foroperating in highly

*

Correspondingauthor.

E-mailaddresses:[email protected](C. ArgerichMartín),[email protected](R. Ibáñez Pinillo),

[email protected](A. Barasinski),[email protected](F. Chinesta).

https://doi.org/10.1016/j.crme.2019.11.002

(3)

Fig. 1. Input space (left) and target vector space (right).

dimensional spaces. In [4] authors perform a review of these techniques and some applications addressed in [5] using randomforests.Regressiontechniquesoperatequitewelleveninthelow-datalimit,inparticulartheonesbasedonsparse sensing combinedwiththe useofseparated representationformats. Whendata isabundant, the mostpowerfullearning techniquesaretheonesbasedontheuseofneuralnetworks,givenplacetotheso-calleddeep-learning(see[6] and[7] for generalknowledgeondeeplearning).

One ofthemostused applicationsofdeep Learningisvisual recognition andimage classification[8,9]. Deep learning techniquesare alsousedforprobabilisticlanguage [10]. Interestedreaderscan referto [11] andthenumerousreferences therein.

Oneofthemostrecentalgorithmsdevelopedforheterogeneousdataclassificationistheso-calledWord2Vec,developed by Google inc. This algorithm presented in [12] is based on the use of neural networks[13] and the representation of eachwordasavectorrepresentingtheprobabilityofhavingagivenneighboringword.Thisapproachtriedtocircumvent the difficulty of defining distances between qualitative data, that is, one could consider that yellow and red are quite closebecausebotharerepresentingcolors,however,fromthespellingofbothwordssuchaproximitybecomesdifficultto quantify.

Infieldslikemanufacturingengineeringsuchanapproachisparticularlyrelevantbecausetheperformancesoffor exam-plea manufacturedplasticpartdependsontheresin used(definedby itscommercialappellation orthe chemicalnature also codified but differently), the oven temperature(in a particular unit system), the curing time and the name of the employe(e.g.,Peter,Paul...).

Now,aprocesscouldbecharacterizedfromthesefourparameters:(i)resin,(ii)temperature, (iii)processtimeand(iv) employe surname; andwitheach one we could associate a labelassociatedwith aparticular output ofinterest, a value quantifyingaparticularperformanceoftheformedpart.

Thus, each process could be described by a point in a four-dimensional space, each point having a color depending on thevalue associated withtheoutput ofinterest. The fouraxes representtheheterogeneous parameters appropriately codified.Forexample,anumericcodemustbeassociatedwitheach resinoroperatorname(e.g.,polyester

=

1,vinylester

=

2,...Peter

=

1,Paul

=

2,...)inordertorepresentqualitativedataontheresinoroperatoraxes...

However,thesechoicesaretotallyarbitrary,aswellasunitsinthedifferentquantitativedata(temperatureandprocess time in ourexample). Thus, it isevident that calculatingdistances betweenprocesses insuch a space hasnot sense. In otherwords,twopointsrepresentingtwoprocessescould beverycloseinthefour-dimensionalspacebuthaveoutputsof interestverydifferentand,oppositely,remotepointscouldimplyverysimilarperformances.

Thisissue isnot new,andmany authorswhen considering heterogeneousbutquantitative data,normalized themfor minimizing theimpactofunitsinthecomputation ofdistances(forexampleusual stressesandstraininsolid mechanics differinabout12ordersofmagnitude,theformerbeingoforder106 andthelast10−6).However,normalizationhidesthe issuewithoutreallysolvingit.

Inwhatfollows,weproposeatechnique,sketchedinFig.1,formappingpointsfromarepresentationspacetoatarget spaceequippedofanEuclideanmetricallowingthequantificationofdistancesandthenapplyingallthenumericalartillery usingdistances,asconsideredlater.

2. Classificationandregressionstrategies

In thissection, themethodology sketched inSection 1 is described. Weassume that the points in the original space (space of representation) consist of

P

arrays composed on N entries (four in the example considered above). They are assumedarraysbecausetheycannotbeconsideredvectors,andarenotedbyyi.Theirimagesinthevectorspacearenoted

by xi

∈ R

d, withd



N, thistime realvectors subjected to therules ofcoordinate transformation.That vector spaceis

equippedwiththe standardscalarproductandtheassociatedEuclideandistance.Themappingisdescribed bythed

×

N

(4)

whereboth thecomponentsofW andtheimagesxi

∈ R

d,i

=

1

,

· · · ,

P

mustbecalculated.Eachpoint xi keepsthelabel,

denotedby

O

i(valueoftheoutputofinterest,hereassumedscalar),associatedwithitsoriginpointyi.

We wouldlike placing points xi,such that theEuclidean distancewitheach other point xj scales withtheir outputs

difference,i.e.

(

W

(

yi

yj

))

· (

W

(

yi

yj

))

= 

xi

xj



2

= |

O

i

O

j

|,

(2)

wherethecoordinatesofoneofthepoints canbearbitrarilychosen. Thus,thereare P2

2

− P

relationsfordeterminingthe d

×

N (thecomponentsofW becausethecoordinatesxi areobtainedfromW byapplyingrelation(1)).

Linear mappings are limited and do not allow operating in nonlinear settings. Thus, a better choice consists of the nonlinearmappingW

(

y

)

,expressiblefromthegeneralpolynomialform

W

(

y

)

=

K



k=1

W

k

P

k

(

y

),

(3)

wherematrices

W

khavethesamedimensionsas W andmustensuretheyj

xjmapping.

Ontheotherhand,theinterpolationfunctions

P

k

(

y

)

canbecontinuousordiscontinuous.Inthenumericalapplications

describedlater,standardpolynomialorradialfunctionsareconsidered.

The associated nonlinear problemcan be efficiently solved by usingan adequate linearization strategy, e.g., Newton’s methodaspresentedinAppendixA.Becauseoftheincrease inthenumberofunknownstobe determined,arichenough samplingisrequired.

Thechoiceofd deservessomecomments.Thelowerd is,thestrongerbecomes themappingnonlinearity,howeverthe visualizationbecomessimpler.

When thedimensionalityN increases approximationssuffertheso-calledcurseofdimensionality.Averyefficientway ofalleviatingitconsistsinusingseparatedrepresentations[14] fromwhichthemappingreads

W

(

y

)

=

K



k=1

W

k N



i=1 Pik

(

yi

).

(4)

Inthepreviousexpression yi referstothei-componentofy.

Theuseoftheseseparatedrepresentationsconstitutesaworkinprogress.

3. Numericalexamples

Thissectionaddressestwocasestudies,thefirstdevotedtoclassificationofheterogenousdataandthesecondaimingat emphasizingtheclassificationandnonlinearregressioncapabilitiesofthetechniquejustintroduced.

3.1. Representationofhigh-dimensionalcategoricaldatafrommanufacturingprocesses

Inthissection anapplicationwithrealandhighlynon-lineardataispresented.Weconsideradatabaseprovidedbyan industry relatedtoarealmanufacturingprocess.The inputspaceconcernssixmanufacturingparameters collectedduring 53 realizationsoftheprocess.Parametersconcernfourcategoricalparameters:material,supplier,two processparameters whosequantificationisshowninTable1andtwootherquantitative processparameters.Theoutputisthestabilityofthe processitself,alsoquantitivelycodifiedasshowninTable1.Wordsarecodifiedbyusingnumbersintothe

[−

1

,

1

]

interval. Inthisparticularcasetheinput data,yi,

i,isgroupedina 6

×

53 matrix(53realizationsoftheprocessdefinedby 6

input parameters),withasinglescalartarget.Thevectorspacexi consistsofa2

×

53 matrix(53realizationsmappedinto

(5)

Fig. 2. Residue and tolerance related to the computation of mapping W.

Fig. 3. Vector space representations: output, material, and supplier.

Oncetheinputdataandtheoutputisquantitativelycodified,werunthealgorithmdescribedintheprevioussection,in particularitsnonlinearcounterpart,thatreachesconvergenceinfewiterations(indicatedbyindexi)asFig.2proves,where thetolerance,definedby

Tolerance

=

log



Wi

Wi−1





Wi−1



(5)

isdepictedateachiteration,withtheresidueoftheNewtoniteration(8).

Fig.3(a)depictsthemappedsamplesinthevectorspace,thatisxi,wherethe53 pointsrepresentingeachrealizationof

themanufacturingprocesshavebeencoloredaccordingtoitsoutput.Asthereadercanappreciate,twowelldifferentiated clustersareobtained, oneconcerningallthestableprocessesandtheothergroupingnon-stableconditions,withonlytwo points outoftheclusters.It isimportantto notethatbetter resultscanbeobtainedbyincreasing boththesamplingand theapproximationofthenonlinearmapping,thatis

K

intheapproximation(3).

Otherthanclusteringwithrespecttotheoutput,orusingthecomputingmappingforinferringstabilityfornewprocess conditions,onecouldalsotrytoextracttherelevanceofthedifferentparametersontheoutput.Forthatitsufficescoloring the mapped points according to the value of the considered parameter. Figs. 3(b) and 3(c) reportthis analysis forboth materialandsuppliedrespectively.From theseimagesitcanbeconcludedthat materials F and D andsupplier D exhibit

stability.Onthecontrary,supplierC compromisesprocessstability.

ThesameanalysiswasperformedbyusingNN(neuralnetwork)withahiddenlayercomposedoftwoneuronswhose trainingwasensuredbydatayi andtheassociatedstabilityoutputs.TheNNisshowninFig.4,with53 dataconsistingof

(6)

Fig. 5. NN-based model.

Fig. 6. Acoustic absorbers.

superiorityofoneprocedurewithrespecttotheother (becausethelargenumberofNNvariants),wecanconcludeonthe simplicityandefficiencyoftheCode2Vect proposal.

3.2. Modelandregressionconstruction

Weconsideracousticsystemscomposedbyacousticresonators,whosesimpleststructure,theHelmholtzresonator–HR –isdepictedinFig.6a,andinvolvesthreemaingeometricparameters,length(L),section

(

A

)

andvolume(V ).HR canbe combinedtocreatemorecomplexacousticstructuresasillustratedinFig.6b,whosebehaviorisgovernedbythedynamic systemM

¨ξ +

C

˙ξ +

K

ξ

=

f (where

ξ

definestheairmotionamplitudeateverybottleneck)andwasdeeplyanalyzedin[15], fromwhichtheresonancefrequenciescanbeextracted.

Withoutlossofgenerality,theacousticsystemconsider:L1

=

L2

=

L3, A1

=

A2

=

A3 andV1

=

V2,reducingthenumber

(7)

Fig. 7. Vector space colored with respect to the parameter values (top) and output of interest (bottom).

wereconsideredandthetworesonancefrequenciescomputed,bothconstitutingtheoutputtobepredictedbythesearched data-basedmodel.

Results obtainedwhen applyingthe Code2Vect techniqueare depictedin Fig.7, wherethe two imagesat thebottom representthevectorspacewithsamplescoloredaccordingtotheassociatedoutputs.Inthepresentcase,thescalaroutput consideredinthemappingconstruction(Eq. (8))consistsofthenormofthevectorscontainingbothfrequencies,i.e.with ti

= (

ω

1

i

,

ω

2i

)

,

|

O

i

O

j

|

= 

ti

tj



.

From Fig. 7, we can concludethat section and volume (V and L) are inversely correlated with the output, because forexampleyellowcolorconcentratesonthe bottomwhenrepresentingtheoutputwhereasit localizesonthetopwhen referringtotheseparameters.Theoppositeoccurswhenconsideringthecross-sectionarea( A).Thesefindingsareinperfect agreementwiththeknownphysicsofaHelmholtzResonator,whoseresonancefrequency

ω

risgivenby

ω

r

=

c 2

π



A V L (6)

Now,withthemodelW

(

y

)

relatinggeometricalparametersandresonancefrequencies extracted,itspredictive capabili-tiesarebeingtested.Forthatpurposethemodelisextractedbyusingonly60% ofthe64geometricalvariantsandtherest ofthesamples(theremaining40% yj)usedforvalidatingthepredictions,thatis,forcomparingthepredictions,theoutput

tiatxj

=

W

(

yj

)

yjinterpolatedfromtheoutputtiattheneighboringsamplesxi.

Thepredictionsare depictedinFig.8; inyellowthesamplesusedastrainingandinbluethepredicted.Thedeviation withrespecttothediagonalrepresentsthepredictionerrorthatresults,asexpected,verylowforthesamplesusedinthe trainingandabitlarger,butlowerthan4% fortheothers.

From the discussion above, one expectsthat the best model isthe one relating thecombined parameter A

/

V L with

theresonancefrequencies, insteadoftheone that weconsidered abovethat considered thegeometricalparameters indi-vidually,thatis, L, A,andV .Inordertoprovethesuperiorityofmodelingbyconsidering thebestparametrization, Fig.9

compares the previous prediction withthe one obtained by using thecombined parameter A

/

V L that produced almost perfectpredictions.

However,thephysicsbehinddataisnotalwaysknowntoinformthebestchoiceofparameters.Aprocedurefor access-ing to thosemostrelevant combinedparameters constitutesone ofour mainresearch activities still in progress.Forthe moment,andinabsenceof“apriori”information,parameterswillbeselectedandincorporatedindividuallyinthemodel.

4. Conclusions

(8)

Fig. 8. Prediction of the target.

Fig. 9. Predicting the output.

spacethatiskeptlowdimensionalforfacilitatingvisualization.Itallowssupervisedclassificationbutalsomakingprediction from itsnonlinear regressionnature. Moreover, itenablesa very simplerepresentation emphasizingthe relevance ofthe parametersinvolvedinthemodel.Itsperformancesinthelow-datalimithavebeenproved.

Acknowledgements

TheauthorsgratefullyacknowledgetheANR(“Agencenationaledelarecherche”,France)foritssupportthroughproject AAPG2018DataBEST.AIRBUSGroupthroughthePhDCIFREprogramandESIGroupthroughtheESI-ENSAMChairprogram.

Appendix A. Nonlinearmodeling

BydenotingWi

=

W

(

yi

)

,theNewtonproceedsatiterationn

Wn+1,i

=

Wn,i

+ 

Wi

,

(7)

thatreplacedinthenonlinearcounterpartofEq. (2)

(

Wiyi

Wjyj

)

· (

Wiyi

Wjyj

)

= 

xi

xj



2

= |

O

i

O

j

|,

(8)

thelefthandmemberreads

(

yi

(

Wn,i

+ 

Wi

)

yj

(

Wn,j

+ 

Wj

))((

Wn,i

+ 

Wi

)

yi

− (

Wn,j

+ 

Wj

)

yj

)

=

yi

(

Wn,i

+ 

Wi

)(

Wn,i

+ 

Wi

)

yi

+

yj

(

Wn,j

+ 

Wj

)(

Wn,j

+ 

Wj

)

yj

2yj

(

Wn,j

+ 

Wj

)(

Wn,i

+ 

Wi

)

yi

.

(9)

SinceeachtermappearinginEq. (9) presentsthesamestructure,wedevelopthelinearizationforoneoftheseterms:

(9)

andsimilarlyfortheothertermsinEq. (9).

Thenonlinearmodelisapproximatedaccordingto



Wi

=

K



k=1



W

k

P

k

(

yi

).

(11)

Eq. (8) concernsdatayiandyj.Hence,byapplyingthesameprocedureforanydatapair,analgebraicsystemofequations

resultsinvolvingthe N

×

d

×

K unknowns. Thus,the safeconstructionofthe non-linear mapping requiresarich enough sampling,i.e.

N

×

d

×

K

<

P 2

2

P (12)

Thenonlinearapproximation (11) wasaccomplishedbyusingradial basisbasedoneitherthegaussian orthe inverse-quadraticradialkernels,bothmakinguseofthedistancer

rk

= ||

y

yk

||,

(13)

fromwhichbothinterpolantsread

P

k

(

r

)

=

e−(rk) 2

,

(14) and

P

k

(

r

)

=

1



1

+

e−(rk)2

,

(15)

with



affectingtheshapeoftheradialkernel. References

[1]S.Roweis,L.Saul,Nonlineardimensionalityreductionbylocallylinearembedding,Science290 (5500)(2000)2323–2326.

[2] E.Lopez,D.Gonzalez,J.V.Aguado,E.Abisset-Chavanne,E.Cueto,C.Binetruy,F.Chinesta,Archivesofcomputationalmethodsinengineering,Int.J. Numer.MethodsEng.25 (1)(January2018)59–68,https://doi.org/10.1007/s11831-016-9172-5.

[3]L.VansDerMaaten,G.Hinton,Visualizingdatausingt-SNE,J.Mach.Learn.Res.9(2008)2579–2605.

[4]A.Criminisi,J.Shotton,E.Konukoglu,DecisionForestsforClassification,Regression,DensityEstimation,ManifoldLearningandSemi-Supervised Learn-ing.MicrosoftResearchtechnicalreport,TR-2011-114,2011.

[5]J.Greenhalgh,M.Miermehdi,TrafficsignrecognitionusingMserandRandomforests,in:20thEuropeanSignalProcessingConference,2012.

[6]J.Schmidhuber,Deeplearninginneuralnetworks:anoverview,NeuralNetw.61(2015)85–117.

[7]L.Yann,B.Yoshua,H.Geoffrey,Deeplearning,Nature521(2015)436–444.

[8]J.Donahue,L.A.Hendricks,M.Rohrbach,S.Venugopalan,S.Guadarrama,K.Saenko,T.Darrell,Long-termrecurrentconvolutionalnetworksforvisual recognitionanddescription,IEEETrans.PatternAnal.Mach.Intell.39 (4)(2017)677–691.

[9]C.Dong,C.C.Loy,K.He,X.Tang,Imagesuper-resolutionusingdeepconvolutionalnetworks,IEEETrans.PatternAnal.Mach.Intell.38 (2)(2016) 295–307.

[10]Y.Bengio,R.Ducharme,P.Vincent,C.Jauvin,Aneuralprobabilisticlanguagemodel,J.Mach.Learn.Res.3(2003)1137–1155.

[11]P.Norvig,S.Russell,ArtificialIntelligence,aModernApproach,Prentice-Hall,1994.

[12]T.Mikolov,I.Sutskever,K.Chen,G.Corrado,J.Dean,Distributedrepresentationofwordsandphrasestheir compositionality,in:ProceedingNIPS’13 Proceedingsofthe26thInternationalConferenceonNeuralInformationProcessingSystems,vol. 2,2013.

[13]M.Abadi,etal.,GoogleResearchLarge-ScaleMachineLearningonHeterogeneousDistributedSystems,TechnicalReport,2015.

[14] R.Ibanez,E.Abbisset-Chavanne,A.Ammar,D.Gonzalze,E.Cueto,J.-L.Duval,F.Chinesta,Amultidimensionaldata-drivensparseidentification tech-nique:thesparsepropergeneralizeddecomposition,Complexity(2018)5608286,https://doi.org/10.1155/2018/5608286.

Références

Documents relatifs

Les pertes des grandes banques suisses en raison de leur banque d’investissement ont eu pour conséquence une dégradation de la réputation et de l’image de la banque mais cela

Again I ask you questions to each of which your answer is yes or no, again you are allowed to give at most one wrong answer, but now I want to be able to know which card you

prognosis of intracranial haemorrhage in patients on oral anticoagulant or antiplatelet drugs. Prior

analysis of cluster randomized controlled trials with a binary outcome: the Community 703. Hypertension Assessment

Such operations require prepared input data collections (tagged and indexed), clever data colocation across processing nodes and results storage (cf. Depending on

We apply the random neural networks (RNN) [15], [16] developed for deep learning recently [17]–[19] to detecting network attacks using network metrics extract- ed from the

In this study, we propose a new classifier based on Dempster-Shafer (DS) theory and deep convolutional neural networks (CNN) for set-valued classification, called the evidential

The different mentioned existing DL-based methods, which are dedicated for data completion and correction to improve the indoor localization accuracy, are based on unsupervised