• Aucun résultat trouvé

Real-time Audio Classification based on Mixture Models

N/A
N/A
Protected

Academic year: 2021

Partager "Real-time Audio Classification based on Mixture Models"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01481934

https://hal.archives-ouvertes.fr/hal-01481934

Submitted on 3 Mar 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Real-time Audio Classification based on Mixture Models

Maxime Baelde, Christophe Biernacki, Raphaël Greff

To cite this version:

Maxime Baelde, Christophe Biernacki, Raphaël Greff. Real-time Audio Classification based on Mix- ture Models. The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2017), Mar 2017, La Nouvelle-Orléans, United States. �hal-01481934�

(2)

,

Real-time Audio Classification based on Mixture Models

Maxime Baelde

1,2

, Christophe Biernacki

1

, Rapha¨el Greff

2

1

University Lille 1, CNRS, Inria

2

A-Volute

,

1. INTRODUCTION

Standard Machine Learning (ML) approach for audio classification:

Input sound Preprocessing Feature extraction

ML Classifier ML Model

x = (x[1], ..., x[P])

STFT, Harmonic decomposition, ...

Energy, MFCC, ...

GMM, Neural Networks, ...

Our approach to real-time audio classification:

Input sound Preprocessing Fit Mixture Model

Dictionary

x = (x[1], ..., x [P])

STFT: S = (S[1], ..., S [N]) p (f |θ) using S

D =

nθˆ o

2. CREATE A DICTIONARY OF MODELS

How the sounds are grouped and splitted:

G1 Gi GI

Ci1 Cij CiJi

cij1 cijk cijKij Group of sounds

Sounds

Sound buffers

Split a sound into buffers with a window size T and an overlap D:

D

T

Modeling of each buffer with a mixture model [2]:

Magnitude

f

Spectrum Model

3. SOUND MODELS Normalized spectrum:

Sijk[n] = N

sijk[n]

2

PN p=1

sijk[p]

2.

Mixture model:

p (f |θijk) =

Mijk

X

m=1

πijk(m)N

f

µ(m)ijk ,

σijk(m)

2 .

Model likelihood for binned data:

L(θijk) = p(S|θijk) =

N

Y

n=1

Z f [n+1]

f [n]

p(f |θijk)df

!S[n]

.

4. IDENTIFY NEW SOUNDS

Test sounds S: Split with window size T and consider groups of R buffers.

S1 Sr SR S1 Sr SR

Aggregate the likelihoods:

p

Sr

cijkr

p(Sr|Cijr) p(Sr|Gir)

Models Sounds Groups

Conditional probabilities of the groups Gir: p(Gir|Sr) = p(Sr|Gir)p(Gir)

P

h p(Sr|Ghr)p(Ghr).

Aggregate the probabilities over R buffers:

p(Gi|S) =

R

Y

r=1

p (Gir|Sr).

Final decision (for every group of R buffers):

Gbi = argmax

Gi

p (Gi|S).

5. RESULTS & DISCUSSION

Cross-Validation Good classification rate (%) (Comparison with state-of-the-art methods)

Dataset A-Volute ESC-50 ESC-10

Our algorithm 96.5 94.0 96.0

Parametric method 73.6 45.5 73.5 Non-parametric method 46.6 53.2 76.0

Human 91.8 81.3 95.7

Parametric method: standard GMM with standard features [1]

Non-parametic method: Deep ConvNet with spectrogram features [3]

Complexity

(Example on the A-Volute database)

O(Number of operations)

Our algorithm 28 × 106

Parametric method 22 × 103 Non-parametric method 14 × 106

6. RESOURCES

Website available with free demonstrator of the method:

7. REFERENCES

[1] C. Clavel, T. Ehrette, and G. Richard. “Events Detection for an Audio-Based Surveillance System”. In: 2005 IEEE International Conference on Multimedia and Expo. July 2005, pp. 1306–1309.

[2] G. McLachlan and D. Peel. Finite Mixture Models. Wiley, 2000.

[3] K. J. Piczak. “Environmental sound classification with convolutional neural net- works”. In: 2015 IEEE 25th International Workshop on Machine Learning for

Signal Processing (MLSP). Sept. 2015, pp. 1–6.

modal.lille.inria.fr March 5-9th, 2017 maxime.baelde@a-volute.com

Références

Documents relatifs

This paper studies various priority metrics that can be used to progressively select sub-parts of a number of audio signals for real- time processing.. In particular, five

From this embedding scheme, a watermark extraction method based on nonstationary linear filtering and matched filter de- tection is proposed in order to recover informations carried

Moreno, On the use of support vector machines for phonetic classification, IEEE International Conference on Acoustics, Speech and Signal Processing 2 (1999) 585–588.

The sinusoids plus noise model discussed up to now does contain the information that is required for high level sig- nal transformation, but the information is provided in a way that

The teachers’ responses are assumed to echo their understanding and consciousness of the ways in which cognitive processes and psychological factors underlie the EFL

À partir de notre terrain principal de recherche, le MRDI, étayé d’exemples de deux autres processus de mise en exposition de mémoires sensibles, le projet

Thus, equation (29) means that the crack growth velocity in a glassy polymer such as polycarbonate can be expressed roughly as the product of a creep velocity, that takes into