Acoustic scene classification: contributions to fundamental and applied research

(1)

2017 ENST 0069

EDITE - ED 130

Doctorat ParisTech T H È S E

pour obtenir le grade de docteur délivré par

TELECOM ParisTech

Spécialité Sécurité numérique

présentée et soutenue publiquement par

Daniele Battaglino

le 13 Decèmbre 2017

La classification des scènes acoustiques:

contributions à la recherche fondamentale et appliquée

Directeur de thèse: Nicholas EVANS

Co-encadrement de la thèse: Ludovick LEPAULOUX

Jury

M. Tuomas VIRTANEN,TUT, Tampere - FINLAND Rapporteur M. Emmanuel VINCENT,Inria, Nancy – France Rapporteur Mme/M. Bernard MERIALDO,EURECOM, Biot – France Examinateur Mme Christelle YEMDJI,Renault Software Labs, Sophia Antipolis – France Examinatrice

TELECOM ParisTech

(2)

ÉCOLE DOCTORALE EDITE DOCTORAL THESIS

A C O U S T I C S C E N E C L A S S I F I C AT I O N : C O N T R I B U T I O N S T O F U N D A M E N TA L

A N D A P P L I E D R E S E A R C H

Author:

Daniele Battaglino

Supervisor EURECOM:

Prof. Nicholas Evans

Supervisor NXP semiconductors:

Dr. Ludovick Lepauloux

A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy from

EURECOM – Telecom ParisTech 13December 2017

(3)

Daniele Battaglino:Acoustic scene classification: contributions to fundamental and applied research,

(4)

A B S T R A C T

Acoustic context information may be used by microphone-equipped devices in order to adapt their behaviour or configuration according to a particular scenario. Recognition of such scenarios according to the acoustic context is the goal of acoustic scene classification (ASC). The choice of audio sensors, instead of alternatives (e.g. motion or light sensors), is a natural one; almost all mobile and smart devices are equipped with at least one microphone.

Almost all previous solutions to ASC rely on feature extraction approaches designed specifically for speech and music genre recognition and are thus not necessarily optimal for ASC. Further limitations of existing solutions relate to the requirements for real-time and low footprint implementations. These requirements must be met in order that ASC algorithms can be developed for low power, always listening devices.

The work reported in this thesis aims to address these limitations and hence to reduce the gap between academic and industrial research in terms of methods, protocols and metrics.

Accordingly, this thesis presents the ASC problem from a dual perspective. This includes contributions in bothfundamentalresearch, which report contributions with respect to standard protocols and methods in addition toappliedresearch, which describes contributions to the adaptation of current methods to ‘real-world’ applications.

The main contributions of the work include: (i) the design of ASC-tailored features which exploit spectro-temporal patterns from spectrograms using local binary pattern analysis;

(ii) techniques for the automatic extraction of the most discriminative spectro-temporal patterns through the application of convolutional neural networks; (iii) the collection of a large database of realistic, low-quality audio recordings to support work in ASC;

(iv) the implementation of an always-listening, low-complexity ASC system, and (v) the first investigation of ASC in an open-set scenario, a new classifier tailored to open-set classification and new protocols and metrics for the assessment of open-set ASC.

The work presented in this thesis demonstrates that greater synergy between fundamental and applied research must become the standard pathway to future work with a view to creating practical, usable ASC techniques.

(5)

R É S U M É

Les informations de contexte acoustique peuvent être utilisées par des dispositifs équipés de microphones afin d’adapter leurs comportements ou leurs configurations en fonction de la scéne qui se déroule. La reconnaissance des scénarii en fonction du contexte acoustique est l’objectif de la classification des scènes acoustiques (ASC). Si il est naturel d’envisager l’utilisation des capteurs audio pour y parvenir, s’y restreindre se justifie par le fait que presque tous les appareils mobiles sont équipés d’au moins un microphone; ce qui n’est pas le cas pour d’autres types de capteurs (par exemple les capteurs de mouvement ou de lumière).

La plupart des solutions ASC reposent sur des algorithmes d’extraction de descripteurs conçus spécifiquement pour la reconnaissance de la parole et de la musique, et ne sont donc pas nécessairement optimales lorsqu’ils sont appliqués au domaine de l’ASC. Par ailleurs, rares sont les approches qui prennent en considération les exigences d’une implémentation temps réel conjointement à des contraintes de faible complexité. Or, ces exigences doivent être satisfaites pour que les algorithmes ASC développés puissent être portés sur des appareils focntionnant sur batterie et toujours à l’écoute.

Le travail présenté dans cette thèse vise à combler ces lacunes et donc à réduire l’écart entre la recherche académique et industrielle en termes de méthodes, de protocoles et de mesures. En conséquence, cette thèse propose une reformulation du problème de l’ASC sous deux aspects. Du point de vue recherche fondamentale, une première partie relate des contributions sur les protocoles et les méthodes standards. Une seconde partie traite de la recherche appliquée et décrit les contributions à l’adaptation des méthodes actuelles aux applications du monde réel.

Les principales contributions de ce travail comprennent: (i) la conception de descripteurs adaptés à l’ASC et qui exploitent les modèles spectro-temporels. Ces modèles sont calculés à partir de spectrogrammes sur lesquels une analyse de motifs binaires locaux (LBPs) est appliquée; (ii) des techniques d’extraction automatique des modèles spectro-temporels les plus discriminants par l’application de réseaux de neurones convolutionnels; (iii) la collecte d’une vaste base de données d’enregistrements de scènes sonores du quotidien;

(iv) la mise en oeuvre d’un système ASC toujours à l’écoute, de faible complexité, et (v) la première utilisation d’algorithme pour l’ASC dans un scénario de classification open-set, la description d’un nouveau classificateur adapté à la classification open-set et de nouveaux protocoles ainsi que de nouvelles métriques pour l’évaluation de l’ASC poru les problèmes open-set.

Le travail présenté dans cette thèse démontre qu’une plus grande synergie entre la recherche fondamentale et la recherche appliquée doit devenir la voie standard pour les travaux futurs en vue de créer des solutions en ASC pratiques et utilisables.

(6)

The inferno of the living is not something that will be; if there is one, it is what is already here, the inferno where we live every day, that we form by being together. There are two ways to escape suffering it. The first is easy for many: accept the inferno and become such a part of it that you can no longer see it. The second is risky and demands constant vigilance and apprehension: seek and learn to recognize who and what, in the midst of inferno, are not inferno, then make them endure, give them space.”

— Italo Calvino,Invisible cities

A C K N O W L E D G M E N T S

Firstly, I would like to express my gratitude to my supervisors,NickandLudovick, for the continuous support of my PhD, their precious advices and patience. Their guidance helped me through the thesis and the final writing. A special thank to my managers at NXP,JCand Laurentthat believed in me and created the best conditions to the fulfilment of my PhD.

Besides my supervisors, I would like to thank the rest of my thesis committee: Prof.

Virtanen, Prof. Vincent, Prof. Merialdo and Dr. Yemdji, for their insightful comments and encouragement. I would also like to thank all the colleagues at EURECOM and NXP for making me feel part of a team. In particular, my NXP team matesRafael, Adrien, Giacomothat worked with me and gave me useful insights for my research. A thank to the other colleagues from NXP for their kindness:Aurelie, Sebastien, Jean-Marc, Thomas, Guilleme, Fabrice, Jean, Alex. I would like also to thank the guys from the audio&speech group at EURECOM:

Pramod, Pepe, Hector, Massimiliano, Giovanni, Leela.

A special thanks goes to the friends I met in these three years: Stefano, Alberto, Cedric, Danja, Julie, Marco. They encouraged me to do my best and they were always present.

Words cannot express the gratitude to my family: my motherLucilla, my fatherGainbeppe and my sisterAgnese. A special thank toChiara, my girlfriend, and to her parentsGianna and Davide. They all supported me throughout writing this thesis and my life.

(7)

List of Figures

Figure1 Machine listening research areas 3 Figure2 Thesis mind map 6

Figure3 ASC main blocks 9

Figure4 ASC contributions timeline 10 Figure5 The5-fold protocol 13

Figure6 Linear vs mel-scale filters bank 14

Figure7 Mel power spectrum and MFCC over stationary and speech content 16

Figure8 DCASE2013accuracies and CIs 23

Figure9 Similarity matrix of40consecutive MFCCs 26 Figure10 Effect ofσin the gaussian kernel 29

Figure11 Different grid-search strategies 30 Figure12 Grid-search accuracies 31

Figure13 Examples of feature tuning 33

Figure14 Confusion matrix ofMFCC+RQA-900andMFCC+RQA-8000 34 Figure15 Accuracy as a function of segment lengths 35

Figure16 t-SNE visualisation 41

Figure18 t-SNE embeddings forMFCC+RQA-900Hz 43 Figure19 Fisher score for each feature 46

Figure20 Bhattacharyya distance as a function of class combinations 48 Figure21 NXP visualization 51

Figure22 RMS distributions 52 Figure23 RMS-based features 53 Figure24 BER features 54

Figure25 An illustration of the entire system, as explained in Section 5.2.1: (1.) LBP histogram generation for each sub-band; (2.) Codebook creation, through clustering;(3.)Histograms in(1.)are mapped to the codebook. This is repeated for each histogram extracted from each block;(4.)SVM training and testing by using the histogram of acoustic patterns. 58

Figure26 From spectrogram block to LBP histogram 59 Figure27 The effect of interpolation 60

Figure28 A toy problem example for LBP 61 Figure29 LBP patterns from toy problem 62 Figure30 Codebook words 63

Figure31 The codebook histograms for abusscene (a) and arestaurantscene (b) for the DCASE2013evaluation set. The codebook words are depicted in Fig.30 63

Figure32 LBP results on DCASE2013 65

Figure33 LBP features robustness to different gains 66 Figure34 An example of CNN architecture 70

Figure35 CNN input data 71

(12)

Figure36 Details of the convolutional layer 71 Figure37 Details of the pooling layer 72 Figure38 DCASE2016protocol 77

Figure39 CNN implementation details 78

Figure40 Insights over the first convolutional layer 81

Figure41 t-SNE visualization of the intermediate outputs of the proposed

CNN 83

Figure42 DCASE2016main trends 86

Figure43 DCASE2016results on evaluation set 86 Figure44 The real-time MFCCs extractor 91 Figure45 Tandem estimator mechanism 94

Figure46 Tandem estimator and recursive estimator adaptation on a varying signal 94

Figure47 Reduced complexity ASC system 96

Figure48 Diagram of data reduction using K-means clustering 97 Figure49 Silhouette values as a function of different K clusters 99 Figure50 Data decimation techniques comparison 102

Figure51 The universe of acoustic classes 106 Figure52 Openness plot 106

Figure53 kernel width vs SV 110

Figure54 False positive and false negative of a binary confusion matrix 111 Figure55 Estimation of false negative 112

Figure56 λ_radius andλ_AUCcomparison 114 Figure57 SVM-SVDD comparison 116 Figure58 AUC for Rouen dataset 117

Figure59 Individual class AUC for different features 117 Figure1 Thesis mind map 132

Figure2 DCASE2013accuracies and CIs 135

Figure3 Du bloc spectrogramme à l’histogramme LBP: à partir du coin supérieur gauche de l’image, le bloc spectrogramme est analysé en utilisant LPB8,2avec8 voisins et rayon égal à2; le code binaire local est ensuite généré; enfin le code binaire est mis à jour dans la case correspondante de l’histogramme. 137

Figure4 La courbe montre la précision moyenne avec des intervalles de con- fiance (IC) de95% sur une validation croisée de5 pour l’ensemble de données DCASE 2013. Dans les cercles bleus, les valeurs de l’ensemble d’évaluation, dont la ligne de base est également ex- primée par une ligne bleue; dans les étoiles rouges, les valeurs de l’ensemble de développement avec la ligne de base exprimée en ligne rouge pointillée. A l’exception de la ligne de base et de la RNH, les autres systèmes ont été proposés dans ce travail. 138

Figure5 Un exemple d’architecture CNN étudiée dans ce travail: l’entrée est un spectrogramme statique et dynamique à 2 canaux. Ils sont suivis de deux couches de convolution et de regroupement em- pilées. Les couches entièrement connectées et en sortie produisent les probabilités des données d’entrée appartenant à chaque classe acoustique. 140

(13)

Figure6 Résultats sur l’ensemble d’évaluation DCASE2016. Le système de référence a une précision globale de77,2% et il est indiqué par une ligne bleue continue. Le nom du système suit la même dénomination des soumissions de défi. En rouge continu, les systèmes basés sur CNN Battaglino_1et Battaglino_2. 140

Figure7 Tracés de l’aire sous la courbe caractéristique de réception (AUC) par rapport à l’ouverture pour (a) ensemble d’évaluation DCASE2013et (b) ensembles de données Rouen2015pour les classificateurs SVM (profils en pointillés bleus) et SVDD (profils rouges-rouges). L’écart type est illustré par des barres verticales. 144

List of Tables

Table1 DCASE2013submission list 22 Table2 RNH re-implementation 32

Table3 The effect of the window length on MFCC∆and∆∆derivatives. 36 Table4 Fisher scoreFfor different ASC systems 45

Table5 Duration of recordings for each context in the NXP database beside of associated meta-tag options. 49

Table6 Energy-based feature accuracies 55

Table7 The accuracy and confidence intervals (±CI) for DCASE 2013de- velopment dataset as a function of codebook sizes obtained with a k-means clustering. In bold the best results. 64

Table8 Accuracies computed over different datasets 66

Table9 The hyperparameters selection, based on performance of DCASE 2016development set 76

Table10 ASC performance for the DCASE2016development (dev) and evaluation (eval) set 79

Table11 Standard vs real-time estimation of mean and standard deviation over different segment lengths. Results refer to DCASE2013evaluation set. 100

Table12 Reduction for DCASE evaluation set 101 Table13 Reduction for NXP dataset 101

Table14 Examples of openness for two standard 107 Table15 Influence of kernel width on samples distance 110 Table2 DCASE2013submission list 134

(14)

Listings

M AT H E M AT I C A L N O TAT I O N S

x vector X matrix

X⁻¹ inverse of matrix X^T transpose of matrix

x_i i^th element of single vectorx x_n n^th vector of a set

x_n,i i^th element of vectorx_n

x^(t)_n,i i^th element of vectorx_nat timet ˆ

x estimated vectorx

x⁰ normalised or scaled vectorx x[n] discrete signal

X[k] Fourier transform of x[n]

Pr(x) probability of random variablex Pr(x,h) marginal probability

Pr(x|h) conditional probability of random variablexgiven another random variable h

Pr(x;h) conditional probability of random variablexgiven fixed parametersh F{.} Fourier transform

F⁻¹{.} inverse Fourier transform

|.| absolute value

||.|| euclidean norm

||.||p p-norm α^∗ optimal value

(15)

F I X E D S Y M B O L S

X set of sample vectors

N_c number of samples of classc

T P_c number of correctly predicted samples of classc µ mean of all samples

Σ_c covariance of all samples µc mean of samples of classc Σc covariance of samples of classc θ_c set of model parameters for all classes

θ_c set of model parameters or function for classc Xc set of sample vectors of classc

X˜ decimated set of sample vectors of classc lloss function

Jcost function FFisher score

D_BBhattacharyya distance

A C R O N Y M S

ASA auditory scene analysis

CASA computational auditory scene analysis

ASC Acoustic scene classification

VAD voice activity detection

ASR automatic speech recognition

NLP natural language processing

MIR music information retrieval

AED audio event detection

DCASE detection and classification of acoustic scenes and events

MAP mean average precision

(16)

MFCC mel frequency cepstra coeffient

SVM support vector machine

RBF radial basis function

GMM Gaussian mixture model

EM expectation-maximization

HMM hidden Markov model

MP matching pursuit

UBM universal background model

KNN k-nearest neighbours

FFT fast Fourier transform

DCT Discrete cosine transform

DWT discrete wavelet transform

RQA recurrence quantification analysis

PCA principal component analysis

HOG histogram of gradients

BER band energy ratio

DTs decision trees

CI confidence interval

KKT Karush-Kuhn-Tucker

SVs support vectors

SNE Stochastic neighbor embedding

t-SNE t-distributed stochastic neighbor embedding

KL Kullback-Leibler

DNN deep neural network

RMS root mean square

LDA linear discriminant analysis

LBP local binary patterns

BoF bag of features

CNNs convolutional neural networks

MLPs multi layer perceptrons

RNNs recurrent neural networks

GD Gradient descent

(17)

NAG Nesterov accelerated gradient

ReLU rectifier liner unit

NMF non-negative matrix factorization

VC Vapnik-Chervonenkis

SVDD support vector data description

BSVs boundary support vectors

ROC receiver operating characteristic

AUC area under the curve

(18)

A Chiara e Ada, scrigni delle piccole cose.

This work was entirely funded by NXP semiconductors within the terms of theindustrial PhDcontract – convention CIFRE n. 2014/0356.

(19)

1

I N T R O D U C T I O N

Imagine closing your eyes for a moment and listening carefully to the sounds in your immediate surroundings. You may recognise specific sounds like footsteps, air conditioning, passing cars or perhaps voices. Even in the absence of visual cues, humans can identify most of the times events and sounds with acoustic cues. These acoustic cues provide information about objects which are not within the listener’s field of vision. The research presented in this thesis focuses on the recognition of a specific acoustic scene by machines.

The choice of acoustic cues to recognise the surrounding environment is driven by the omnipresence of microphone in smartphones, devices with the sphere of the internet of things, wearables and hearing aid devices. While some devices are equipped with multiple, heterogeneous sensors (examples include light sensors, gyroscopes and accelerometers), acoustic sensors are the most widely used in practise. Furthermore, there is evidence [1] that context recognition using acoustic cues gives better performance than using accelerometer measurements alone. In any case, acoustic and other cues are complementary in a fusion framework.

Acoustic scene classification (ASC) aims to categorise the environment in which a device is used. The problem of recognising acoustic scenes is particularly pertinent in the case of mobile devices given their use in multiple situations throughout the course of a typical day.

Here, for instance, the ringer volume of a smart telephone might be adjusted according to whether the user is on a bus, in an office or at home.

The motivation of this work stems from the continuous demand for advanced functionality by automatically adapting the device configuration to the situation or context. Moreover, the industrial nature of this PhD has conditioned tracks and axes of research. With ASC being a recent area of study, there still exists a gap between academia and industry in terms of problems, solutions, protocols and metrics; there are clear differences between lab evaluation and performance in the field. This dichotomy accounts for the structuring of this thesis in two parts; one linked to fundamental research; the other related to applied research. The final goal is to design a robust ASC system which analyses and classifies acoustic scenes in real-time on low-power devices.

This introduction is structured as follows: a definition of ASC is presented in Sec.1.1, together with a discussion about the relationship of ASC with other domains in Sec.1.2; examples of practical use cases are listed in Sec.1.3; Sec.1.4discusses motivations and goals of this research; Sec.1.5details the research contributions, peer-reviewed publications and a detailed outline of the thesis.

1.1 a c o u s t i c s c e n e c l a s s i f i c at i o n

ASC is the task of classifying a global scene according to ambient sounds. A scene refers to a high-level semantic concept such ascar,parkoroffice. ASC is a difficult task for both humans and machines without any other cues (e. g. visual). The labelling of a scene is not always clear and is open to interpretation on taxonomy. For example, different people may

(20)

describe the same scene with different high-level semantic concepts: from one angle, some distinctions are impossible to obtain from sounds alone (i. e. some cars sound like buses when only engine noise is present); from another, quiet and noisy streets may be labelled under a more general street concept even though they may not share common acoustic characteristics. One of the first definitions of ASC has its origins in the psycho-acoustical studies ofsoundscapes[2]. As for visuallandscapes,soundscapesare also composed of ambient background noises in addition to descriptive foreground sounds. The scene is therefore a composition of background noise and foreground sounds.

Even though many computational approaches are inspired by perceptual research, there exists a notable distinction between these studies which aim to understand the human cognitive process [3] and how a machine perceives and detects sounds. The question "Do machines hear as we do?" exemplifies the discrepancy between human and machine sound perception. As an example, differences in perception are introduced immediately through different microphone characteristics (directivity, sensitivity, etc.). These and other such differences may lead to a representation far from that of the human auditory system.

1.2 a s c i n t h e r e a l m o f m a c h i n e l i s t e n i n g

Perceptual studies [3,4,5] influenced the definition of an acoustic scene, which can benefit from prior research in other related domains such as speech recognition or music genre identification. These domains are focused on a specific problem related to audio even though they share common audio processing and classification techniques. More generally, these domains are part of a broader area of research, calledmachine listening, which tries to mimic the human auditory systems with machines as a whole.

As for the human auditory system, machines replicate a hierarchical process going from audio samples to a meaningful description: the audio is represented (e. g. spectrogram), organized (e. g. source separation), detected and classified. A vast majority of current machine listening domains (e. g. speech recognition, music genre identification, acoustic scene classification) can be interpreted according to this scheme.

Even so, the relationship between ASC and other machine listening domains appears somewhat blurred. Inspired by original work [6], current machine listening domains can be split into simpler tasks, as illustrated in Fig.1:

• detection, the segmentation of useful information within a longer sequence;

• classification, the association of a label with the segmented information;

• description, the creation of high-level semantic information from the classification (e. g. from genre classification to music recommendation systems).

Following this vision, for instance, the speech domain would be split into voice activity detection (VAD) [7] (detection) followed by automatic speech recognition (ASR) [8] (classification) and then natural language processing (NLP) [9] (description) to give sense to the resulting sequence of words. The music domain would be split into music/speech separation (detection), music information retrieval (MIR) [10,11] (classification) which may determine the genre and on top of that music recommendation (description). The ASC task fits the same formulation: the context is segmented according to some criteria, classified and then labelled to describe, for instance, a log of the different acoustic scenes encountered during a day. Audio events [12], are detected and classified before the complex scene is described as a mixture of overlapping sounds.

Each domain shares the detection-classification-description formulation, together with methods and solutions to common problems. This helps to exploit knowledge and solution

(21)

Description Emotion & NLP Music recommendation

Scene

description Event logging

Classification

Automatic speech recognition

Music information

retrieval

Acoustic scene classification

Audio event classification

Detection Voice activity

detection Speech/music Acoustic scene segmentation

Audio event detection

Speech Music Acoustic scene Audio events

Figure1:Machine listeningrealm is composed of different research areas varying the abstraction level of the task (i. e. detect, classify and describe). On y-axis are expressed different level of task, while on the x-axis the main research area.

from one domain in another by using, for instance, similar features or processing. As an example, ASC may exploit an audio detector algorithm to better describe a scene with audio events. At the same time, ASC may provide prior information of the scene, therefore reducing the number of possible events [12] (e. g. keyboard taping is more probable in an office rather than in a street).

ASC is a relative new topic in machine listening. The presence of consolidated machine listening domains (e. g. speech, music) has initially allowed researchers to adapt methods to ASC. At the same time, the availability of a huge number of different methods has partially limited a broader discussion on the specifics of the ASC task.

1.3 a p p l i c at i o n s o f a s c

Applications which can directly benefit from ASC encompass existing technologies from smartphones to hearing aids:

Context-awareness devicesinclude analways-listeningcapabilities to adapt behaviour to the surrounding situation [13]. Examples include the adaptation of a ringer volume according to whether the user is on a bus, in a office or at the cinema [14]. Evidence [15] shows that the capability to associate a behaviour to a context is particularly convenient for users. Another example of practical applications is reported in [16], where wearable devices adjust the rate (or intensity) of notifications depending on the context. The cost of being distracted by a device may be high: imagine receiving many notifications in the car while driving, at the restaurant with other people or while crossing the street. The decision to notify or not and how to notify the user, should be made with consideration for the current context.

Listening robotsuse information of "where I am" to switch behaviour. Especially in high mobility conditions, prior information of where the robot is located helps in defining the most appropriate actions to be performed [17]. Concrete examples may use ASC to change robot speed whether it is located indoors or outdoors [18].

Automatic data taggingexploits context similarities for automatically labelling audiovisual data. There exists a huge amount of multimedia content not segmented, neither labelled, whose manual tagging would be practically impossible. Combining video, image and acoustic scene information would allow to tag automatically a huge amount of material. This material could then be used to re-train ASC with larger datasets [19].

(22)

Hearing aids adapt their configuration to the user’s environment, such as a quiet office, restaurant or music hall. Current hearing aid solutions are tuned according to general acoustic environments that do not adapt quickly to changes in context [20].

ASC solutions could be used to improve audio quality and to enable context-based configurations.

In all of the above applications, ASC is essentially a preprocessing step which provides prior information to other systems. It can inform speech recognition engines on the type of acoustic noise to improve performance [21]; it can help noise-monitoring [22] or source separation systems [23]. In addition, different applications may fuse audio cues with other sensor information such as acceleration, pressure or light [24] to obtain more accurate and confident predictions of a context.

1.4 m o t i vat i o n s a n d g oa l s

The investigation of ASC is motivated by many factors, linked to the practical scope of this PhD: ASC research was driven by bridging the gap between fundamental and applied research. The main goal of this work is to deploy context-awareness systems which can help users in their daily lives. Considering that context-aware algorithms are to be implemented for low power devices, computational efficiency and real-time processing assume a strategical role. Dealing with channel variation or adapting metrics and evaluation protocols are other examples.

The choice of focusing on application to embedded devices rather than full power or cloud solutions is strategical in context-awareness: unreliable data connections and power implications of continually streaming audio to a remote server makes cloud solutions impractical. Moreoveralways-listeningdevices may impact user privacy by sending sensitive, personal information contained within audio recordings. Cloud solutions require the sharing of context information such as speech, music and other sound events which can be used to track individuals and their activities [25]. Under this assumption, ASC approaches that run locally on the device have clear advantages.

1.5 c o n t r i b u t i o n s

The structure of the thesis reflects the nature of the contributions regarding both fundamental and applied research. The outline is illustrated graphically by amind-mapin Fig.1. Fundamental research is the focus of Part1(to the left of the Fig.1) which describes the contributions between the first public challenge on ASC in 2013 [6] and the second in 2016[26]. The sequence of the chapters follows temporally these twomilestones, relating the the public DCASE challenges in2013and2016. Applied research is the focus of Part2(to the right of Fig.1) which deals with practical implications of ASC in real-world scenarios.

Contributions of this part include the adaptation of ASC solutions to work in streaming fashion with reduced complexity.

The work reported in this thesis resulted in several publications:

• publication1 (conference paper): "Acoustic context recognition for mobile devices using a reduced complexity SVM",2015IEEE European Signal Processing Conference (EUSIPCO);

• publication 2 (conference paper): "Acoustic context recognition using local binary pattern codebooks", 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA);

(23)

• publication3 (workshop paper): "Acoustic scene classification using convolutional neural networks", 2016 IEEE Detection and Classification of Acoustic Scenes and Events challenge (DCASE);

• publication4(conference paper): "The open-set problem in acoustic scene classification",2016IEEE Workshop on Acoustic Signal Enhancement (IWAENC);

• publication5 (conference paper): "Baby cry sound detection: a comparison of hand crafted features and deep learning approach",2017Springer Engineering Applications of Neural Networks conference (EANN);

• publication 6 (patent): "Acoustic Context Recognition using Local Binary Pattern Method and Apparatus", US Patent App.15/141,942

• publication7(patent): "Embedded car detector based on acoustic sensor", EU patent App. under approval.

Part 1 starts with Chapter 2 which describes the state of the art of ASC in 2013, at the time of the first public challenge in ASC. Together with a public challenge, a dataset was also released. Albeit being a huge step towards the standardisation of the ASC task (data, protocols, evaluation metrics), standard methods were still based on features mainly designed for speech or music (e. g. mel frequency cepstra coeffient (MFCC)). The winning system of this challenge, in fact, estimates and models recurrent patterns in MFCCs. This system and its main limitations are discussed in Chapter3, where a first baseline is also presented. Possible ways to evaluate and visualize audio features are presented in chapter4 which leads to the design of new features. To date, almost all existing approaches to ASC are based on traditional features designed for other domains. Even so, experiments show that these features may not be sufficiently discriminative for the ASC task.

Given the focus on ASC-tailored features, the complex acoustic structure of a scene is found to be represented by local spectro-temporal patterns, extracted directly from spectrogram (publication2and6). Consequently the idea of extracting spectro-temporal patterns is then exploited using a particular topology of deep neural networks as reported in Chapter6. This contribution (publication3) has been submitted and publicly evaluated within the context of the DCASE 2016 evaluation whose main results and trends are presented in Chapter7.

The outline of Part2 is summarized as follows: Chapter8 describes practical issues of ASC. The NXP dataset, while proprietary, is considered a contribution in the context of an industrial PhD. The data contained in this dataset can be used not only for ASC, but also for other related tasks (event detection, mixing speech with acoustic scene recording to learn more robust model, etc.). Computational constraints in terms of complexity and memory are addressed in Chapter8with an additional contribution including a reduced complexity ASC system (publication1).

One of the biggest limitations of current ASC systems involves its application to closed set problems. In practice, ASC applications are open set in nature, where the number of classes during evaluation is unbounded. Contributions include the proposal for a new approach to the evaluation of ASC solutions with an open-set approach, as reported in Chapter9. This contribution (publication4) presents the ASC problem as an acoustic scene detection where a small number ofknownscenes are detected in a larger universe ofunknownclasses.

Conclusions in the final Chapter10collect thoughts and findings from fundamental (Part1) and applied (Part2) research, and describe ideas for future research.

(24)

Part1: fundamental researchPart 2: applied research Introduction: what is ASC and why is important

Literature: prior works and DCASE 2013 submissions Top performing method: limitations of the state-of-the-art system

Features analysis: visualisation and measure of separability Time-frequency patterns: applying the local binary pattern technique to ASC

Features investigation: energy-related features as band energy ratio (BER) Convolutional nets: automatic feature learning from time-frequency data DCASE 2016: ASC trends and DCASE 2016 submissions analysis

Product constraints: implementing ASC in an embedded device Low-complexity: reducing feature and model complexity

Real-time: feature extraction from a signal in continuous way Open-set scenario: testing robustness to unknown scenes Conclusions: learnings and further research for ASC

12 34 45 6 710

8 8 8

9 Legend 2

Common chapters State-of-the-art and movitations Contributions to fundamental research Contributions to applied research Chapter number

4 NXP dataset: collecting audio data from mobile phones

Figure2:Mind-mapof main blocks composing the thesis. The legend in the bottom-right helps to read the entire picture. Numbers in the upper right corner represent the chapter index.

(25)

Part I

F U N D A M E N TA L R E S E A R C H

(26)

2

L I T E R AT U R E R E V I E W

ASC covers works spanning a period of time from the first work in1997to the more recent in2014, the start of this thesis. This chapter offers a view of the historical background and the most influential methods in ASC literature. As for the majority of machine learning systems, ASC solutions are composed of successive blocks. These blocks treat the audio input (preprocessing), extract a compact representation from it (feature extraction and feature post-processing), learn an inference model (classifier) and test performance on unseen samples (testing). This process is properly defined in Sec.2.1. A timeline of ASC works between 1997 to 2013 is illustrated in Sec. 2.2. In Sec. 2.3 a detailed explanation of detection and classification of acoustic scenes and events (DCASE) 2013 database and associated challenge are reported. Works submitted to DCASE2013 challenge are then grouped according to these blocks: features extraction in Section2.4; feature post-processing in Sec.2.5; classifier and testing in Sec. 2.6 and 2.7. A comparison of the ASC systems performance is then presented and discussed in Sec.2.8.

2.1 m a i n b l o c k s o f a s c

The task of recognising and classifying an acoustic environment is generally to assign a semantic label to a certain portion of an audio signal. The labelling of a generic acoustic scene is open to interpretation: a taxonomy shared by all researchers in this domain is, therefore, difficult. Current approaches treat ASC as a supervised classification problem, where the taxonomy of possible categories is bounded and known in advance depending on the application (e. g. a hypothetical transport scene classifier may use a subset of categories such asbus,car,trainandplane). Even though the majority of current methods uses supervised classification, alternative solutions have also been reported in an unsupervised manner [27,28,29], where the scenes are deduced during the processing (i. e. clustering audio samples). These unsupervised approaches require a huge amount of data to extract the underlying data structure. Therefore, the lack of a common dataset has initially blocked investigation in this direction.

In its supervised formulation, ASC does not differ from other standard machine learning problems [30] which are a concatenation of specific domain knowledge (acoustic properties of a scene) with statistical inference over the data (model training and testing). Hence ASC can be split into simpler blocks, such as those illustrated in Fig.3, whose details are listed below:

1. preprocessingtransforms and prepares the acoustic wave for the further processing (feature extraction and classification). Each acoustic wave is described as a variation of sound pressure across time. This sound pressure is measured by a digital microphone at a certainsampling frequency. The resulting digital signal is discrete in amplitude and time [31]. Examples of preprocessing operations comprise filtering, segmenting a long recording in equal-size clips or the averaging a two-channel stereo signal;

(27)

Training phase

Testing phase

Preprocessing Feature extraction

Feature post-processing

Model learning

Testing Performance metric Model parameters

θ

Figure3: The main blocks composing an ASC solution: first the waveform is processed, then feature vector is extracted and processed. After that, a model is learned from all the feature vector samples and then tested on unseen data.

2. features extractionhas the role of describing an audio wave in a more compact way by reducing the number of dimensions needed to represent it. Standard approaches consist of splitting an audio input intoframeof20ms over which features are calculated.

Extracting features over these short-term frames ensure that each feature vector represents a statistically stationary segment of the original audio signal;

3. features post-processingadditionally enhances a particular aspects of the original features. As an example, time derivatives of consecutive frame-based features can be added as additional information on time evolution of an acoustic scene;

4. model learning recognises patterns in the features space. Let x be a continuous random variable whose value is the feature vector and withθ_c the model of thec^th class. The goal of model learning is to "learn" this relation. As we will see in the next sections, there exist different methods to estimate this relation: some of them aim to learn the underlying distribution of the training data; others aim to maximise the separability between class samples;

5. testing assigns a feature vector z to the most likely class c. We define as posterior probability Pr(θ_c|z)the probability of a class modelθ_c given the feature vectorz. All previous preprocessing (feature extraction and feature post-processing) are applied to the test samplez. The decision of the most likely class for z corresponds to the predicted class ˆc which maximises ˆc = arg maxcPr(θ_c|z), where the model θ_c is obtained from model learning. Consequently, each sample in the test set will be assigned to one ofCclasses;

6. performance metric estimates classification accuracy and is defined as the ratio between the correctly predicted samples and the total predictions for allCclasses. The confusion matrix, instead, displays directly the misclassification between classes. In the C×Cconfusion matrix, each element corresponding to thec^th row and ˆc^th column represents the true classcwhich has been predicted as ˆc. From the resulting confusion matrix, the global accuracy is found by summing the elements on the diagonal divided by the total number of elements.

(28)

Ellis '96 Sawhney and Maes '97 Peltonen et al. '01 Eronen et al. '03 Chu et al. '08

Eronen et al. '06

El-Maleh '99 Rasanen et al. '11

Couvrer et al. '96 Peltonen et al. '02

Figure4: The main contributions in ASC before DCASE2013.

ASC problems usually involve a huge number of possible scenes (C > 10). Accuracy is the standard metric because summarises the performance of a multi-class system with a single value. Nevertheless, it is heavily influenced by the balance between class samples. This is called "accuracy paradox" [32] and affects unbalanced datasets.

In order to deal with this paradox, the mean average precision (MAP) [33] metric has been preferred to standard accuracy. This variant of the metric calculates the global accuracy as a sum of single class precisions:

MAP (%)= 100 C

XC c=1

T P_c

N_c, (1)

where Cis the number of classes involved,T P_c stands for the number of correctly predicted samples divided by the total elements N_c of the c^th class. Consider an example with Class 1 (N_c = 10,T P_c = 0) and Class2 (N_c = 1000,N_c = 1000) and calculate the MAP and accuracy metrics:MAP= (₁₀⁰ + ¹⁰⁰⁰₁₀₀₀)¹⁰⁰_N =50%;accuracy= (_1000+10⁰⁺¹⁰⁰⁰ )¹⁰⁰_N =99%. MAP reports a less biased metric, while the standard accuracy shows an unrealistic measure of performance (Class 1 has no correct predictions).

MAP will be the reference metric of the system performance in the following of this thesis, because it will be perfectly comparable with the standard accuracy performance in the case of balanced datasets while being less biased in presence of unbalanced datasets.

In conclusion, the vast majority of ASC methods follows the aforementioned structure differing by the choice of preprocessing, features and classification methods. Even systems which may appear very different on the surface, still fit this common interpretation.

2.2 h i s t o r i c a l b a c k g r o u n d o f a s c

Contributions from the first work in1997until the DCASE challenge in 2013are depicted on a timeline in Fig.4. Several approaches have been proposed in the past to classify sounds and acoustic scenes, supported by psycho-acoustical studies [5]. One of the most relevant conclusions of these studies is that our auditory system relies on a sound-memory capable of associating sounds to a meaningful environment. In light of this, Ellis [34] in ’96proposed to describe an acoustic scene as a mixture of simpler building elements. In the same year, Couvreur et al. [34] investigated an automatic recognition of environmental noise sources (such ascar,truck,plane) based on their global acoustic properties. This approach was further developed by El-Maleh et al. [35] in ’99using spectral features and a Gaussian classifier.

The first method specifically addressing the ASC problem relates to a technical report of Sawhney and Maes in ’97[36]. The authors recorded a small dataset composed of people voices, subway, traffic and other classes. From these recordings they extract features based

(29)

on psycho-acoustical filters, employing a recurrent neural networks classifier. They report a classification accuracy of68% over5classes.

Few years later in ’01, Peltonen et al. [37] were showing that humans identify a scene with typical sound events, such as a click, a door slam or a car engine. Tests performed on19 individuals showed an overall70% classification accuracy over25classes. The huge variation of accuracies between classes (it varies from32% to100%) depends on acoustic cues present in the scene: when the sounds in the scene are determinant in distinguishing a class from another, accuracy was higher. As expected, silent environments without prominent sounds do not bring sufficient information for the classification. This leads to the conclusion that an ASC task needs a longer excerpts of information to define its prediction where the probability of finding prominent sounds increases over time.

Influenced by these cognitive studies, Peltonen et al. [38] in ’02 experimented the recognition of6 meta-classes built over 17 starting ones. Vehicle, for instance, is a meta-class comprisingcar,planes,bus. The second contribution correlates the classification accuracy with the duration of a scene. As expected, a classification integrated on a longer time contains more prominent information, as previously mentioned in [37]. Therefore, an ideal length for having stable classification results suggests a30-40seconds of signal. In spite of these observations, the most relevant aspect of Peltonen’s research was to apply for the first timeMFCCand Gaussian mixture model (GMM) to the ASC problem, achieving a68% accuracy over 17 classes. The adoption of MFCC-GMM provided a baseline system for future research. Continuing Peltonen experiments, Eronen et al. [39] in ’03exploited the temporal evolution of the acoustic scene to improve the MFCC-GMM baseline system, by using a2-state fully connected hidden Markov model (HMM). This system was compared to human ability to recognize18classes and6meta-classes (e. g. outdoor, vehicles, indoor, etc . . . ). The recognition accuracy of HMM system is61% over18classes against the69% of human listening tests.

Another research axis questioned the scene taxonomy: which are the connections between everyday personal experience and collective assessment through a high-level linguistic concept? Dubois et al. [40] in ’06investigated this association between high-level concepts and acoustic scenes. The research showed that individuals classify acoustic scenes on the basis of prior experience. To enforce this perspective, a further study was conducted by Tardieu et al. [41] in ’08about the human organization of acoustic cues in increasing levels of abstraction. In the context of arail stationacoustic scene, they demonstrated that people use local acoustic cues (human activity) and global information (reverberation, intensity) to hierarchically construct an acoustic scene. The same idea has been recently proposed by Torija [42] in ’13. By using15 acoustic descriptors, an acoustic scene is composed by these building elements.

The definition of a suitable set of features for ASC became the subject of research for Chu et al. [43] in ’08. In their work, a new way of extracting features, called matching pursuit (MP), was applied: the audio signal is decomposed by selecting the closest basis from a dictionary previously created. Then each audio signal is represented as a linear combination of these dictionary atoms.

According to Räsänen et al. [1] in ’11, the use of audio classifier combined with acceleration brought to better context classification performance. Instead of fusing low-level sensory information (i. e. directly combining features coming from acoustic and acceleration sensors), only classification predictions are combined. In fact, the final prediction is a weighted-sum of single predictions coming from audio and acceleration classifiers. A similar intuition has been adopted for fusing visual and acoustic cues by Lee et al. [44] in ’12.

A full hierarchical approach was proposed in Feki at al. [45] in ’11. In this top-down approach, each audio streaming was classified into speech, music or environmental sounds.

(30)

If the audio streaming did not contain either speech or music was further classified according to the most probable acoustic scene. This approach decomposes a global classification problem into simpler sub-classification tasks, from high-level concepts until single sound events.

In term of reproducibility and comparability of results, ASC domain was lacking of a common dataset. Before2013, each work mentioned above was using a different dataset (with a different number of classes and recording conditions). The first dataset onDCASEwas released in2013, associated to a public evaluation of ASC methods. Sec.2.3details protocols and rules of this challenge. To summarise, problems coming from this section anticipate those of the following chapters, in particular: i) the bottom-up or top-down strategy to solve an ASC task, the former initially expressed by Ellis [34] and the latter by Couvreur [22]; ii) the capacity of human listeners to distinguish different scenes (Peltonen [37], Eronen [39]);

iii) the class taxonomy from Dubois [40]; iv) the temporal recurrence of acoustic scene in Eronen [46]; v) ASC-tailored features in Chu [43].

2.3 d c a s e 2 0 1 3

Recent trends in the signal processing community have promoted reproducibility as a fundamental aspect of scientific research. This attitude relates to sharing code, datasets and tools in order to reproduce exactly experiments described in papers: examples include music retrieval [47], speech recognition [48], source separation [49], speaker authentication [50] and anti-spoofing for speaker authentication [51].

Works prior to DCASE2013were typically performed with variable data (quality of the microphone, types and number of classes are some examples). As a result, most works were assessed using different databases of recordings. DCASE challenge dataset, whose main objective was to support reproducibility and comparisons with other solutions, addressed exactly this issue. The DCASE2013database contains recordings of the following acoustic scenes: bus, busy street, office, open-air market, park, quiet street, restaurant, supermarket, tube (underground train) and tube station. The database is split into two separate datasets of the same size, one publicly released and a second which is reserved for evaluation. Each of those datasets contains100recordings of30-second audio files (WAV,2-channel stereo, 44.1kHz,16-bit) with10samples per class. The development dataset was already provided to participants with ground truth labels identifying each scene. Training, validation and testing of the system parameters are performed on a split of the development set. The split is obtained with a5fold cross-validation which allows the creation of5non-overlapping portions of80recordings for training, and20for testing. The cross-validation covers the full datasets (meaning that each recording will be at least in one of the testing split). The result of the stratified5-fold is illustrated in Fig.5.

Once validated on the development set, the algorithm is submitted to be tested by the organisers on the withhold evaluation set. The evaluation protocol employs the same cross- validation used for the development set so that each of the5folds contains80training files and20testing files. This is done for two main reasons: first, the possibility of selecting a good subset by chance is reduced; second, the composition of recordings from each acoustic class is balanced. Before doing any other quantitative analysis of the data, it is necessary to listen to all of100wave files which are part of the DCASE2013development set. This helps to have a general overview of the data and which characteristics of the scene can be represented by the features. The following provides a qualitative description of each of the acoustic classes in the DCASE2013development dataset.

bus characterised by the engine noise, mainly concentrated below300Hz. In some recordings, there are some door beeps (more than2000Hz) which are repeated during few seconds.

(31)

Figure5: The5-fold protocol in DCASE development and evaluation set. The split is always done at file level in order to obtain two completely disjoint training and testing sets.

Prominent sounds include gear changes or acceleration. There are also voices (from both passengers and artificial voices announcing the stops). Overall energy level is concentrated in lower frequencies.

busystreet similar to thebusscene, but energy is distributed more equally across all frequencies. There are also sounds of traffic (passing car, engine, breaks) with relatively less energy.

office low energy due to a predominance of silence. There are sparse events, such as keyboard taping, cough, whispers, mouse clicks. The only repetitive sounds in the scene are a printer or an air conditioning fan.

park Similar tooffice. This acoustic scene is characterised by quietness and silence interrupted by wind noise, steps sounds, bells and bird tweets. Sounds of nature are the most prominent, while other sounds (i. e. traffic) can be heard in the far-field.

quiet street acoustically close topark. The two are difficult to distinguish even for human listeners. Even so, some sounds are prominent such as walking on asphalt which produces a specific sound. For some recordings, the presence of traffic closer to the microphone suggests a street noise rather than park noise.

restaurant characterised by highly energetic impulsive sounds coming from forks, knives, plates. The background noise comprises a mix of overlapping voices.

supermarket similar torestaurantnoise. Prominent sounds include the beginning of a cashier register and radio music.

tube identified by doors, cyclic sounds of the carriages, doors opening beeps and artificial or registered voices announcing the stops.

tube station similar totube. The most deducible difference is a stereo effect of trains passing from one channel to another.

The DCASE 2013 database was publicly released together with a baseline ASC. This baseline is based on MFCC feature extraction and GMM classifier.

(32)

(a) (b)

Figure6: The difference between linear (a) and Mel-scale (b) filters-bank. The Mel-scale filters-bank has more resolution in the low frequencies.

2.4 f e at u r e s

The general success of machine learning techniques to solve this kind of classification problems relates to the form of data. Feature extraction transforms raw data in a new space of representation where underlying structure and patterns are easier to detect. In the following subsections, DCASE methods sharing similar characteristics have been grouped under a specific featuresfamily. Examples of thesefamiliescomprise low-level audio features, cepstral or spatial.

2.4.1 MFCC: a baseline feature extraction

The utilisation of MFCC as audio feature led to advancements in speech and speaker recognition, music genre classification among others. It has been used also in ASC as a reference feature extraction method. This section will explain the reasons for its adoption.

Let x[n] be the signal after being framed with a window of N samples and |X[k]| the absolute value of its fast Fourier transform (FFT). Frequency bins corresponding to a certain frequency range are mapped into the Mel frequencies bands, which approximate the human pitch perception. The Mel filters-bank has a higher resolution at low frequencies than at high frequencies. The difference between a linear and a Mel filters-bank is shown in Fig.6: frequencies in[0,5]kHz range are mapped on the first26linear-spaced bands (left inset of Fig.6); in Mel-spaced bands,26bands represent a[0,3]kHz frequency range, resulting in a higher resolution of low frequencies (right inset of Fig.6). The magnitude coefficients of FFT are then multiplied with the corresponding Mel-filter weights and the results accumulated.

The m^th filter to be applied to a specific frequency bink is identified withH_m[k]. M stands for the total number of Mel-scale filters andKthe total number of frequency bins.

Hence, the log-power at each of the Mel frequencies is calculated according to:

S[m] =ln

K−1X

k=0

|X[k]|²H_m[k]

!

06m6M, (2)

wheremtypically varies from20to40depending on different implementations and tasks.

Discrete cosine transform (DCT) is the last step in the MFCCs calculation. It encodes the

(33)

rate of change in different spectrum bands as a sum of cosines at different frequencies and amplitudes:

x_i=2

M−1X

m=0

S[m]cos( πi

2M(2m+1)) 06i6D, (3)

whereMis number of mel-scale filters,mthe currentm^th filter,Dis the dimensionality of feature vector x at the i^th dimension. Any periodicities or repeated patterns in the Mel-log spectrum will be represented with the corresponding MFCC coefficients. Thus, one reason of the success of MFCCs for ASC stems from representing general properties of the spectrum with a relatively small number of coefficients. There exist eight different ways to express the DCT, in particular related to the period of the cosine. The DCT of typeII extends a signal sequence to match a symmetric period cosine of2M. This is demonstrated to have a higherenergy compaction[52]: MFCCs coefficients are concentrated at lower indices than other DCT transformations. From a machine learning point of view, DCT-II energy compaction is preferable because it gives a higher fidelity representation of the original signal with fewer coefficients.

MFCC is an approximation of a homomorphic operation [31], since MFCCs are obtained through a reverse order of summations and logarithms. It would have been if Eq.2had been written asS[m] =PK−1

k=0 ln|X[k]|²H_m[k]

. The advantage of performing the logarithm of the output of filtered energies|X[k]|²H_m[k]is indeed to be more robust to noise. On the opposite, doing the logarithm within the sum would amplify the small variations produced by noise before the Mel-filtering.

MFCCs have been proven to be particularly pertinent in the speech domain, because they well approximate the separation of the glottal excitation (source) from the vocal tract (filter). This separation is obtained by selecting only the first MFCCs coefficients, because the logarithm separates the source and the filter with a simple subtraction. This operation is equivalent to take the first13out ofMcoefficients.

With MFCCs being originally designed for speech, they rise criticisms when applied to different domains such ASC. The first criticism relates to mel filter-bank resolution in low frequencies. While being beneficial for some acoustic scenes, a better resolution in low frequencies may affect other classes. The second criticism concerns the choice of selecting the first13coefficients. In fact,13MFCCs encode information about vocal tract which is essential for speech or speaker recognition. Using the same number of MFCCs to a non-speech-based task as ASC may not be optimal. The third criticism deals with robustness of MFCCs in presence of overlapping sounds. DCT represents the rate of change in different spectrum bands. By changing a value of a specific band, several DCT coefficients can change.

MFCCs for bus noise and speech signals are illustrated in Fig.7. Difference are visible in the lower coefficients for bus noise, where the noise of the engine creates harmonics captured by the coefficients1-2; the speech, on the other side, has a more complex structure reflected by a larger and higher number of coefficients.

MFCC has been largely employed in DCASE2013challenge [53,54,55,56,57,58]. Nev- ertheless, some specific works extract ceptral coefficients from different time-frequency representations: in [58], for instance, discrete wavelet transform (DWT) is used as an alternative to standard FFT spectrograms.

Acoustic scene classification: contributions to fundamental and applied research

Doctorat ParisTech T H È S E

TELECOM ParisTech

Spécialité Sécurité numérique

Daniele Battaglino

La classification des scènes acoustiques:

contributions à la recherche fondamentale et appliquée

A C O U S T I C S C E N E C L A S S I F I C AT I O N : C O N T R I B U T I O N S T O F U N D A M E N TA L

A N D A P P L I E D R E S E A R C H

Contents

List of Figures

List of Tables

Listings

1

2

Training phase

Testing phase