Sound classification using summary statistics and N-path filtering

(1)

Sound Classification Using Summary Statistics and N-Path Filtering

Daniel Villamizar", Daniele Battaglinol , Dante G. Muratore", Reza Hoshyarl andBori s Murm ann*

*Departmentof Electrical Engineering,St anford University, California, USA t NXP Semiconductors, Mougins, FR & EURECOM,Biot , FR

t Texas Instruments, SantaClar a, California, USA

Abstract-Always-ons oundclas sification is a desirable but power-intensive function fora variety of emerging Internet of Everything applications. This work explores the accuracy- complexity tradeoff byusin g summary statisticsforclas sifying semi-stationary sounds. Compared to contemporary solutions including deep learning, this approach requires oneto three orders of magnitude fewer parameters and can therefore be trained overten times faster. We propose a mixed-signal design using N-path filters for feature extraction to further improve energy efficiency without incurring a large accuracy penalty for a binary classification task (less than 2.5 % area reduction under receiver operating characteristic curve).

Index Terms-acoustic environment recognition, acoustic tex- tures, baby cry detection, speaker identification, Internet of Everything, N-path filter, passive mixer

I. I NTROD UCTIO N

Thet ask of soundcl assification isa topic ofactivere search duet o its broad applications pace.Re cent advances indeep learnin g have shown super-human performance forth ese tasks buttheir accuracycome s attheco st ofhi gh compl exity, high computational loads,a ndl arge training datasets. For Internet of Everything(IoE )s ystems,iti s import ant to perform thecla ssification inre al time, using lowpower , andin an always-on fashion [I], [2],[3].To staywithinth eir power budgets , systemsth at leverage large deep learnin g model s must therefore resort tocloud processing attheco st ofhighlatency andpriv acy issues. These issues motivatethede velopment of systemsthatc an produce the inference results locally.

To implement local inference atl ow powercon sumption, iti s necessary tore strict model parameter count as much as possible. Further, foremer ging applications whereonlylimited trainin g datai sa vailable,iti s imper ative toh ave amodel thatcanquickl y converge ontr ained parameters. Finally, it is desirable tob enefit fromapplic ation-specifics implifications whenthereis a pri oriknowledgea boutth es ignalsofintere st.

Tomeetth ese conditions, we studyacla ssifier that employs summarystati stics asefficient andfl exible featuresa ndpre sent a comparison toothercla ssifiers designed forthe sameta sks.

We showthatin some casesthecla ssifier can train lOx faster while being IOOOx smallerth an adeepl earning network and only exhibiting moderate accuracy loss. Wethenpropo se a CMOSchip implementation designed to perform the required signal processing atultra-lowpowerwhile maintaining flexi- bilityforav ariety of applications.

There st ofthi s paper is organized as follows. Section II introduc es the summary statisticsfeature set.S ection III presentsa learning rate comparison toa convolutional neural networkde signed forthe sameta sk. SectionIV compares the size and computational complexity ofthepropo sed classifier with published workontwoaddition al soundcl assification datasets. Section Vintrodu ces amixed -signal approa ch to furtherincrea se the computationale fficiencyofthecla ssifier andpre sents initials imulatedre sults.

II. CLASSIFICATIONW ITHS UMMARY STATISTICS

Distinguishing soundsthatarenot strictly stationary but exhibit semi-stationary properties constitutes ausefulta sk for manyIoE systems.Th ese types of signalslendthem selves to proce ssing techniques that exploit their structure. Examples of soundsthatfallinthi s category includemotor s, rain, a crowdedroom , insects , babycry , sirens, applause,andeven voicetimbre .Ex tractingu seful information fromthe se types ofsi gnals ina computationally-efficient manner requires a feature setde signed withthi s purpo se inmind.Forthiswork , we choseto evaluatea bio-inspired analysi s basedonthe workof McDermott and Simoncelli [4],wheretheauthor s propose a setof statisticsth at repre sent human auditory cortex functionality. Subsequent workte sted theses tatistical sound properties onaclassific ation taskin[5]and showed experimental results , butdidnot compare performance and computation cost toother techniques on common datasets.

Wefurther showthatan important benefito f this technique is thatitmapsintoanefficient circuit implementation inCMOS technology(s eeSe ction V).

The summary statistics(SS)fe atures etis composed of fourmain categories: (I ) audio sub-bandcentral moments, (2)sub -band correlations, (3) modulation-band power, and(4 ) modul ation-band correlations. The mathematical definitionof these feature s is described in[4] . Here , weonly illustrate the resultin gs ignalchain ass ummarizedinFig.I .

Wee valuated thefeature setwith multiple modelsandcon- cludedthatthep erformance ofthe classification is dominated bythemappin ge xpressivenessofthefeature s andnotthe classifier modelu sed. We therefore chose a Support Vector Machine (SVM ) model employing alinearkern el to ensure

(2)

TABLE I

CON vN ETE MPLOY ED FORB ABYC RYD ETE CTIO N

Fig. 2. ROCcurveforbot h classifiers evaluated ontheBabyCrydata set.

Accuracyperformance is essentially equivalent.

layer nameoutputsizelayerparameters

input250 x40

5 x 5, 8,s trideI , batchnorm convl 84 xl4 3 x 3 max pool, stride 3

dropoutp

=

0.1 5 x 5, 16,stride I, batchnorm

conv229 x6 3 x 3maxpoo l, stride3

dropoutp

=

0. 2

output IxI 1024-d, fc, softmax

dropoutp

=

0. 5

1.0 2.89 x

10

12.5x

10:

0.20 .4 0.60.8

False positive rate FLOPs

Parameters

0.0

I ,

~_,-...

--- --- I I

SS-SVM(AUC= 0.9 3)

I

1=

ConvNet(AUC= 0.92) ,

0.0

oS^{0. 8}

~

~ 0.6

""

·iii 8.0.4

Q)2

I-0.2 1.0

Forthe comparison we usedthe dataset described in [ I I ].

The observed ROCcurveisshowninFig.2.The learning curves forboth systems were reported inFig.3tounder- standwhytheSS -SYM classifier slightly outperforms the ConvNet. For this analysis, the classifiers weretrainedon progressivelyl arger subsets of the training dataset andthe errorratewascalcu lated foreach.The SS-SYM classifier showsfaster learningwhichis consistent withthefactthat it possesses muchfewer parameters thanthe ConvNet (2.9M vs.2k)andis lesslikelytosufferfromoverfit.Forthetask

evaluated , the ConvNet requires lOxmoretrain ing datathan theSS -SYM classifier toachieve10 % error rate(seeFig.3) . Thisisan important tradeoff in applications thathave limited training dataor require fast online learning (e.g., reinforcement learning) and should be considered when choosing inference systems.

IV. COMPUTATION COST AND COMPLEXITY COMPARISON

To understand the tradeoffs in computational costandcom- plexity, weeva luate theSS -SYM classifier ontwo additional datasets. Foralleval uated tasks,the number of featuresareof the order of the number of training samp les, thusa lso being likelytosufferfromoverfit.

A. Environmental SoundClassifi cation

Forthistaskwe evaluated onthe datasets published in [12],[13], [14],[8], [I S], [10], namely DCASE 2016and

SVM classifier (one vs. all, linear kernel)

III. LEARNING RATE COMPARISON

pred icted class audio

low inference computation cost.I We therefore refer tothe class ifier as SS-SYM. Forthismode l choice, the number of learned parameters aresimp ly givenby M x D whereM is the number of classes andDisthe number of features used. Performance tradeoff comparisons between this classifier and recent published workare summarized inthefo llowing sect ions.

ITheSYMmodelw as implemented using thelibl ine arpackageusing one-vs.-all learners formulti -class tasks. Regularization hyper-parameterwa s optimized foreachtrainingrun.Experimentsu sing RBForqu adratic kernels didnoty ield appreciable gains. Otherclassifier s testedincludeGauss ian Discriminant Analysis andMultinomia l Logistic Regressionwhichalsodid notpe rform significantly differentthantheSYMmodelused .

Fig. I. SS-SYMcla ssifier diagram. Features aredefinedin [4] and are flattenedandconcatenatedasinput s toanSYMclassifier .

Deep learning techniques are increasingly beingusedfor thetasks of environmental sound detection. For example, out of the top-10 submitted entries tothe2016and2017 DCASE challenges [6],[ 7], fifteen use ConvNets either exclusively or in combination with other feature extracting techniques.

A similar task of practical application (butlower complexity)isthat of detecting babycryevents.In order to compare the tradeoffs between deep learningandthe summary statistics classifier we designed a ConvNet forbabycryde tection. The inputtothe network isa5 second mel-spaced spectrogram of 40 frequency binsand250timeframes of 20msd uration com - putedfromtherawaudiosigna l. The network iss ummarized inTable I. A ReLU activa tion function wasusedaftermax pooling fora ll convo lutionall ayers.Simi lar networks have been reported in[8],[9], [10], [II].

(3)

Reference Accurac y Params( x 106 ) Method Reynolds [18] 99.5

0.95 MFCC+GMM

Stadelman [19] 100

Lukic[17] 97 275.97 ConvNet

This work 97.2 1.08 SS+SVM

TABLE II

ACCURACY (IN% )OFC LASS IFIERS ON THE T1MIT DATASET FOR SPEAKER IDENTI FICATIO N.

hand-crafted featuresets[18],[19].TableII summarizes this comparison.

100

I ...

^{SS-SVM I}

-0- Cc nvblet ]

... --'"'

Q.

, , , ,

'a.

20 4060 80

Percent age of Trainin g Set Used (%)

O '--- --~--~---~--~-- ---'

o

2 5 1 -~--~-~- --;::::; ~ ::::;::: :;l

20

Fig. 3. En-or vs. trainin gs amplesu sed forb oth classifiers evaluatedonthe Baby Crydataset.Thelearnin g rateo f theSS -SVM classifieris muchhi gher whilethe ConvNet is likelytooutpe rform theSS-SVM withm ore training data .

Fig. 4. Comparison of classifierso ndifferent environmental soundd atasets . Size ofthecirclesrepre sents themodelp arameter count.Number s inb rackets arethecitation referencesa ndthe parameter countislabeledbel ow. Operation counti s defined as a multiply -add operation and was calculated fora single inference performed onth e DCAS E 2016d ataset for3 0sclip ssa mpled at 16kH z. Thed ata points fortheUS8Kdataset[13]wereadded as an indirect comp arison through the work in[8] . Comparison is drawn betweens ystems withoutre sorting todataau gmentation .

V. HARDWAR E IMPLEME NTATIO N

Tofurtherincreasetheenergyefficiencyofaudio classifier systems, someofthefeature extraction processing function- alitycanbe implemented intheanalogdomain , closetothe sourceoftheaudio signal.Thisgeneralideahasalreadybeen investigated inpriorre search. Theworkin[2] presents a 6nWfront-endbutlimit s the signalsofinteresttofixedtones under500Hzandtolimited dynamic range.Otherworkhas shownthatanalogfilteringisefficientinpe rforming frequency analysisforthe purpose ofaudio classification. Theworkin[3]

presents a710nWfront-endand[20]demon strates a380nW front-end , butinbothcasesthefilterbanksareba sed ongm-C topologies, which sutler fromlimited configurability because thefilter bandwidths and center frequencies dependonboththe absoluteandrelativeaccur acy ofc apacitors andbias currents.

Additionally, in[3]thefeaturesmustbele arned on a chip-to- chipba sis byatrainin gs tep, whichi s undesirable formass- produced devices.

Wepropo se an approach that overcomes theabove- described limitations whilebein g particularly suitableforthe extraction of summary statistics. Theenvi sioned systemis described inFig . 5.Itfollowsamixed- signal approach , where the different sub-blocks are partitioned between analogand digital implementations. Because the feature-set extraction can be approximated using passive switched capacitor circuits, this approachpromi ses tobeenergyefficientwhileal so offeringa simplemean s for frequency tuningvi a thesystemclockand on-chip clock dividers.

The subband filterbanki s composed ofanalogN -path filters wheretheir respective center frequencies canbesetbytheir switched capacitor clock rateandtheir bandwidths canbeset byan adjustable baseband resistor valueas described in[21].

Following each filter, we employ adirect -conversion mixer withalow-pas s filteratba seband to extract the sub-band envelopesi gnals depicted intheblocklabeled "envelope"in FigI.The resulting analogsignalha s a bandwidth of200Hz andcanthereforebe sampled byananalog -to-digital converter sampling atonly400 Sis. The remaining feature s canbe extracted digitally at significantlylowerenergie s thanifthey were extracted fromtherawaudiosignal , whichwouldbe traditionally digiti zed at16 kS/s(mediumquality audio sample rate).

ESC10 ESC50 US8K DCASE

104 Operati on c ount (in millions) this work

30k 90

85

~;:: 80 oell

~ 75 oo

~ 70

.Q1§⁶⁵

;;:::

'00gj60

<:5

55

50L -_ _~ _ ~ _ ~ ~ ~ ~ - - . J . . . ~ _ ~ ~ _

103

B. Speaker Identification

Forthetaskof speaker identification weevalu ated the SS-SYM onthe TIMIT dataset[16].The SS-SYM slightly outperforms the ConvNet approach in[17]atafractionof themodel parametersa ndlieswithin3 % ofthehighe st- performing systemsthat employ GMMmodel s withother ESC.Theseare classification problems with10 -50 classes and therefore require systems withmore information capacity than thebin ary classification taskofbabycry . Theperform ance vs.

costtradeoff " ofthe SS-SYM and severalrecently published deep learning classifiers is summarized inFig . 4. Theplot also illustrates therelativesizesofthe parameter count for the different classifiers. Whilethedeep learning methods for complextasksarecle arly superior in accuracy, this analysisis usefulfor determining thecostofusingsuch approaches over smaller and lighter classifiers.

(4)

V,

Fig. 6. Comparisonbet ween perfectrecon struction half-cosine filterbankand N-path implementation .

i':IIMMMI\M]

^o ^{100 0} ²⁰⁰⁰ ³⁰⁰⁰⁴⁰ ⁰⁰ ⁵⁰⁰⁰ ^6000700 ⁰⁸⁰ ⁰⁰

j,: ~

o 100020 00 3000 40 00 5000 60007 00080 00

!req (Hz)

v=r----

^... I

-

,-

-/

"

- _I

-1-

^---- ^Ideal(AUCImplementation(AUC⁼ ^0.93) = 0.9 1)

^I

0.0

1.0 2 0.8

~

~ 0.6

:;::;

·iii 8..0 .4

ell2 I-0. 2 t/J,

~c I

^L

t/J,

~ c I

^L

t/J3

~ c I

^L

~ c

N-p ath Bandpa ss Filter

I

^L

Passive Low-p ass Mixer

Fig. 5. Classifier system showingan alog anddigitalsignal chainp artition forener gy-efficients ignalproces sing. Simplifiedsin gle-ended circuit implementations oftheanalo g processing is shownaswell .

TABLEIII

SIMULATE D PO W ER CO NSUM PTI ON OF A NAL OG F RONT - E ND Fig. 7. ROC curves forbab y cry classification using ideal vs. implemented feature extraction.

logic . Weexpecttheactualpowercon sumption of a fullchip toincre ase between20-30 % to accountforlong -distance clock routingandother nonidealities. Ourfutureworkwillevaluate the performance andpowercon sumption through measurement ofthefabricatedchip.

VI. CONCL USION

Wehavepre sented a quantitative analysi s ona compact audiocl assification modelwithtest error that converges about tentime s fasterthanadeeple arning model , whilerequir- ingonetothreeordersof magnitude fewer parameters for training. Insomecases , the classifier accuracy is competitive withdeeplearning techniques andother engineered feature extraction, while consi stent ly maintaining lower computation count.Wehavefurther demonstrated potentialbenefitsof a CMOSmixed- signal circuit implementation to extract the samefeaturesand observed that theimpo seds implifications donot significant ly degrade the classification accuracy.

0. 2 0 .4 0.6 0.8

False positive rate

1.0

361 689

Analog( nW)Tota l (nW) 27652

Digital (nW) 0.0

ClockingLogic

These implementation choices change the original signal processing reported in McDermott etal.[4].Themain dif- ferencesaretwofold.First , the bandpass filterbanks , which were originally proposed as half-cosine orthogonal filterbanks , are implemented asN-pathfilterswith equivalent 3-dBband - widths. The differences infiltertran sfer functionm agnitudes are illustrated inFig . 6. Second, the envelope extraction step, whichwasorigin ally computed asthe magnitude ofthe analyticsignal(i.e ., Hilberttran sform), is implemented asa passivedirect demodulation combined withalow-passfilterat baseband. Forsimplicity, single-endedversionsofthese circuit implementation choices areal so illustrated inFig5.Inactual implementations, their differential counterparts wouldbeused.

Using harmonic transfer matrix models [22]to approximate theN -path filter transfer functionandasimplified time-domain modelforthe demodulation step,weevaluatethe effectiveness ofthe implementation toachieve comparable classification performance. The resulting feature sa rethenpa ssed tothe sameSVM classifier usedintheori ginals ystemde scribed in Section II.Fromthi s simulation used onthetaskofbaby cry detection, therewasasmall degradation observed inthe ROC characteristics ass howninFi g 7.

Theanalogfront-endshowninFig5was designed ina CMOS 130nmproces s inorderto simulate the models used fortheprevious classification results atthe transistor level.

Theeffi ciency ofour proposed architecture isillustr ated by the simulatedpowercon sumption summarizedinTab le III.

Theanalogpoweris spentonlevel shifters andondriving theirre spective switches, whilethedigitalpoweris separated tosh ow howmuchisusedfor clocking versus supporting

(5)

R EFERENCES

[I] N. D. Lanea nd P. Georgiev, "Can Deep Learning Revolutionize Mobile Sensing?" inInternational WorkshoponMob ile Computing Systems and Applications (HotMobile), 2015.

[2]S . Jeong , Y. Chen,T . Jang, J. M.-L. Tsai, D. Blaauw, H.-S. Kim, andD. Sylvester, "Always-On1 2-nWAc oustic Sensing and Object Recognition Microsystem forU nattended Ground Sensor Nodes,"IEEE J. Solid-State Circuits,vol.5 3, no. I, pp. 261-274, Jan2 018.

[3] K. M. H. Badami, S. Lauwereins ,W. Meert, andM. Verhelst, "A 90 nm CMOS, 6 uW power-proportionalaco ustic sensing frontendfor voice activity detection ,"IEEE J. Solid-State Circuits,vol. 51, no.I ,pp.29 1- 302,J an2 016.

[4]J . H. McDermotta nd E. P. Simoncelli , "Sound texture perception via statistics ofth ea uditory periphery: evidencef romso und synthesis."

Neuron,vol. 7 1, no. 5, pp. 926-40, Sep2 011.

[5] D. P.W. Ellis, X.Ze ng, andJ.H . McDermott, "Classifyingso undtracks with audio texture features," in2011I EEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ,20II .

[6] T.Virtanen, A. Mesaros ,T.Heittola , M.Plumb ley, P. Foster, E.Bene tos, andM . Lagrange, Detectiona nd Classification of Acoustic Scenes and Events Workshop (DCASE), 2016.

[7] "Acousticsc enecl assification dcase2017,"

http://www.cs.tut.fi/sgn/argldcase20 17/challenge /task-acoustic -scene- classification-results , [Online ; accessed 24-0ctober-2017].

[8] K. J. Piczak, "Environmental Sound Classification with Convolutional Neural Networks;'IEEE International Workshopon Machine Learning forSig nalP rocessing,2 015 .

[9]J . Salamon andJ . P. Bello, "Deep Convolutional Neural Networks andD ataA ugmentation for Environmental Sound Classification,"IEEE SignalP rocessingLette rs,vol. 24, no. 3, pp. 279-283, Mar2 017.

[10]M . Valenti, S. Squartini, A. Diment,G . Parascandolo,a ndT . Virtanen ,

"A convolutional neural network approach forac oustic sceneclassi - fication," in20/7 InternationalJo int Conference onN eural Networks (!JCNN),2017.

[II] R. Torres , D.B attaglino , andL.L epauloux , "Baby cryso und detection : Aco mparison ofhand craftedf eaturesa ndd eep learninga pproach ,"

in International Conference onEngi neering Applica tions of Neural Networks,2017.

[12] K. J. Piczak, "ESC: Dataset for Environmental Sound Classification ," in ACM International Conference on Multimedia (MM), 2015.

[13] J.Sa lamon andJ . P. Bello, "Unsupervised FeatureL earning for Urban Sound Classification," inIEEE International Conference on Acous tics, Speech and SignalP rocessing (ICASSP) , 2015.

[14]D . Stowell , D. Giannoulis , E.B enetos, M. Lagrange , andM.D . Plumbley, "Detectionand Classification of Acoustic Scenesand Events,"

IEEET rans. on Multimedia,vol. 17, no. 10, pp. 1733-1746, Oct2 015.

[15] Y. Aytar, C. Vondrick , and A. Torralba ," SoundNet:L earningSo und Representations from UnlabeledVid eo," inConference on NeuralI nfor- mation ProcessingSy stems (NIPS), 2016.

[16] W. M. Fisher, G. R. Doddington , andK . M. Goudie-Marshall , "The DARPAS peechR ecognition Research Database: Specifications and Status," in Proceedings of DARPAWo rkshop onS peechR ecognition , 1986, pp. 93-99.

[17] Y. Lukic, C.Vogt , O. Durr, andT.Stadelmann, "Speake rId entification and Clustering Using Convolutional Neural Networks ," inIEEE Inter- nationalW orkshopo n Machine Learning forSig nalP rocessing(M LSP), 2016.

[18] D. A. Reynolds ," SpeakerId entificationa ndV erificationUs ing Gaussian Mixture Speaker Models,"Speech Communication ,vol. 17, no. 1-2, pp.

9 1-108 , Aug1 995.

[19] T.Stadelm ann andB . Freisleben ," UnfoldingS peaker Clustering Poten- tial," inACM InternationalCo nference on Multimedia (MM), 2009.

[20] M.Y ang, C.-H.Y eh,Y. Zhou, J.P . Cerqueira , A.A .L azar, andM. Seok,

"A Ip;W voicea ctivity detector usinga nalogf eatureex tractionand digital deep neural network ; ' inIEEE InternationalSo lid-State Circuits Conference (ISSCC) , 2018.

[21] M. Darvishi, R. vand erZee,a nd B. Nauta," Design ofA ctive N-Path Filters," IEEE J. Solid-State Circuits,vol. 48, no.12 , pp. 2962-2976, Dec2 013.

[22]S . Hameed , M. Rachid , B. Daneshrad , and S. Pamarti, "Frequency- DomainA nalysis of N-Path Filters Using ConversionM atrices ,"IEEE Trans. onCirc uits andSys tems II: Express Briefs,vol. 63, no. I ,pp.

74-78, Jan 2016 .