HAL Id: hal-02982969
https://hal.archives-ouvertes.fr/hal-02982969
Submitted on 5 Nov 2020
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Data-Mining for analysis of semi-structured messages
from application execution
Oihana Coustié
To cite this version:
Oihana Coustié. Data-Mining for analysis of semi-structured messages from application execution.
Research Summer School on Statistics for Data Science (S4D 2018), Jun 2018, Caen, France.
�hal-02982969�
$ATA -ININGFORANALYSISOF
SEMI STRUCTUREDMESSAGES
FROMAPPLICATIONEXECUTION
#ONTEXT
$URING THEIR EXECUTION INFORMATION SYSTEM APPLICATIONS GENERATE A LARGE AMOUNT OF MESSAGES IN TEXTUAL FORMAT WHICH CAN BE CONSIDERED AS SEMI STRUCTUREDDATA/IHANA#OUSTI¯
0URSUINGA0H$IN#OMPUTER3CIENCE
7ITHTHECOLLABORATIONOF
!IRBUS/PERATIONS
!IRBUS/PERATIONS
3UPERVISOR8AVIER"ARIL
uANDuTHEu5NIVERSIT¯4OULOUSE)
3UPERVISORSu0ROFESSOR*OSIANE-OTHE
0ROFESSOR/LIVIER4ESTE
)NDUSTRIAL#HALLENGE
!NOMALYDETECTED EXERRORMESSAGEREPORTED 2EQUESTTOEVALUATECURRENT DATABASESTATEu #OMPARISONTONOMINAL STATEFEATURES±
±
±
Ç
!UTOMATEDROOTCAUSEANALYSIS
,EARNING $ATABASE .OMINALLOGFEATURESFREQUENCY VOLUMEPERIODICITYÇ 7HEN AN ERROR MESSAGE IS FOUND SUPPORT TEAMS ARE ASKEDTOPROVIDEPRECISEEXPLANATION uMANUALLYPERFORM
ROOT CAUSE ANALYSISNOT TRIVIAL DUE TO HIGH INTERFERENCE BETWEENAPPLICATIONS
Ç %ACH TIME THE INFORMATION SYSTEM IS UPDATED TEST NEWVERSION GENERATIONOFuTESTLOGS)NTEGRATIONTEAMSARE INTERESTED IN KNOWING WHETHER THENEWu VERSIONu REACTS DIFFENRENTLYTOTHESAMETESTSESSION
4HESISGOAL
5SE DATA MINING
ESPECIALLYONTIMESERIES TO 4EMPORALPATTERNSRULES 4IMESERIESu &INDSTATISTICALLY SIGNIFICANTFEATURES %XTRACTNUMERICALVALUES FROMLOGMESSAGESu 2ULED ISCO VERYON TIME SERIES u 2EPRESENTATION ;=;=;=
(EXE
QMRMRK
|QPVKOGUGTKGU 3DWWHUQ GLVFRYHU\ >@ 5XOH GLVFRYHU\ >@>@ >@ &OXVWHULQJ >@>@ 3EGMENTATION ;= 6IZUALISATION ;=;= 3IMILARITY MEASURE ;=;= Ç -ININGTASKS ;=3TATE OF THE ARTONTIMESERIES
&ODVVLILFDWLRQ >@Ç
.EWVERSIONCHANGESANALYSIS
1XPEHURIPHVVDJHVUHFHYLHGSHUKRXU DSSOLFDWLRQ 1XPEHURIPHVVDJHW\SHVIRUHDFKDSSOLFDWLRQ $SSOLFDWLRQV 0HVVDJHW\SHV$ATAPRESENTATIONKEYFIGURES
(IGH IMBALANCE IN TERM OF MESSAGE QUANTITY AMONGu THE DIFFERENT APPLICATIONSu HOSTED ON THESYSTEMu 4WODIFFERENTPLATEFORMSu
PSKWIEGLHE]
JSVIEGLEMVGVEJXPSKWIEGLLSYV
[LIRXLIW]WXIQMWSR )NFORMATIONSYSTEMVERSION UPGRADE±
4ESTSESSION UNCHANGED±
#OMPARISONWITHPREVIOUSTESTLOGS GENERATEDFROMTHESAMETEST SESSIONONPREVIOUS)3VERSION±
2ESEARCHQUESTION
%XTRACT BOTH NUMERICAL SERIES AND CONTEXT INFORMATIONFROMSEMI STRUCTUREDMESSAGES
!PPLYRULEDISCOVERYMETHODSONTIMESERIESORFIND NEWTIME ADAPTEDALGORITHMSÇ
5SE SIMILARITY MEASURE STUDIES TO COMPARE TWO CONSECUTIVEVERSIONOFSEMI STRUCTUREDDATASETÇ ;=!GHABOZORGIETAL)NFORMATION 3YSTEMS ;=$ASETAL+$$ ;=$INGETAL0ROCEEDINGSOFTHE 6,$"%NDOWMENT u
;= %SLING AND !GON
!#-#OMPUTING 3URVEYS #352 ;=&U%NGINEERING!PPLICATIONSOF !RTIFICIAL)NTELLIGENCE
;= 'ABER ET AL !#- 3IGMOD 2ECORD ;=(ANETAL)%%% u ;=(ANETAL0ROCEEDINGSOFTHE TH INTERNATIONAL CONFERENCE ON DATAENGINEERING u ;=;= ;=+EOGHETAL+NOWLEDGE
AND INFORMATION 3YSTEMS
;= +EOGH ET AL 7ORLD 3CIENTIFIC ;=,ASTETAL)%%% ;= ,IAOu 0ATTERN RECOGNITION
;= 6ANu 7IJK AND 6AN 3ELOW)%%%
;= 7EBERu ET AL )NFOVIS u ;=8INGuETAL!#-3IGKDD %XPLORATIONS .EWSLETTER AUTOMATICALLY PERFORM ROOTCAUSEANALYSIS DETECTCHANGESINTEST SESSION REACTIONS OF THESYSTEM
"OTH RECEIVE MESSAGES EITHER FROM THEIR OWN SERVICES EXECUTION OR FROM THE APPLICATIONSTHEYHOST
%ACH APPLICATION OR SERVICE CAN SEND UP TO DIFFERENT TYPES OF MESSAGES SOMEOFTHOSETYPESCONTAINNUMERICALVALUES CANBECONVERTEDTOTIMESERIES
SOME ONLY SEND MESSAGE PER HOUR WHILEOTHERSCANSEND UPTOMESSAGES MAKES IT DIFFICULT TO PRODUCE A GENERAL AUTOMATISEDSOLUTION