Audio Inpainting

(1)

HAL Id: inria-00577079

https://hal.inria.fr/inria-00577079

Submitted on 16 Mar 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

Amir Adler, Valentin Emiya, Maria Jafari, Michael Elad, Rémi Gribonval, Mark D. Plumbley

To cite this version:

Amir Adler, Valentin Emiya, Maria Jafari, Michael Elad, Rémi Gribonval, et al.. Audio Inpainting.

IEEE Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2012, 20 (3), pp.922 - 932. �10.1109/TASL.2011.2168211�. �inria-00577079�

(2)

a p p o r t

d e r e c h e r c h e

0249-6399ISRNINRIA/RR--7571--FR+ENG

Domaine Audio, Speech, and Language Processing

Audio Inpainting

Amir Adler — Valentin Emiya — Maria G. Jafari — Michael Elad — Rémi Gribonval — Mark D. Plumbley

N° 7571

March 16, 2011

(3)

(4)

Centre de recherche INRIA Rennes – Bretagne Atlantique

AmirAdler , ValentinEmiya, MariaG. Jafari, Mihael Elad ,

RémiGribonval,Mark D. Plumbley

Domaine:

Équipe-ProjetMetiss

Rapportdereherhe n°7571Marh 16,201124pages

Abstrat:

We propose the Audio Inpainting framework that reoversaudio intervals

distortedduetoimpairmentssuhasimpulsivenoise,lipping,andpaketloss.

Inthisframework,thedistortedsamplesaretreatedasmissing,and thesignal

is deomposed into overlappingtime-domainframes. Therestoration problem

is then formulatedasan inverse problem peraudioframe. Sparserepresenta-

tionmodelingisemployedperframe,andeahinverseproblemis solvedusing

theOrthogonalMathingPursuitalgorithmtogetherwithadisreteosineora

Gabor ditionary. Theperformaneofthis algorithmisshownto beompara-

ble orbetterthan state-of-the-artmethodswhen bloks ofsamplesof variable

durationsaremissing. Wealsodemonstratethatthesizeoftheblokofmissing

samples, rather than the overall number of missing samples, is a ruial pa-

rameterforhighqualitysignalrestoration. Wefurtherintrodueaonstrained

MathingPursuitapproahforthespeialaseofaudiodelippingthatexploits

thesignpatternoflippedaudiosamplesandtheirmaximalabsolutevalue,as

wellasallowingtheusertospeifythemaximumamplitudeofthesignal. This

approah is shown to outperforms state-of-the-artand ommerially available

methods foraudiodelipping.

Key-words: Inpainting,lipping,sparserepresentation,mathingpursuit.

A.AdlerandM.EladarewiththeComputerSieneDepartment,TheTehnion,Haifa

32000, Israel. V.EmiyaandR.GribonvalarewithINRIA,CentreInriaRennes-Bretagne

Atlantique,35042RennesCedex,Frane. M.G.JafariandM.D.PlumbleyarewithQueen

MaryUniversityofLondon,CentreforDigitalMusi,DepartmentofEletroniEngineering,

LondonE14NS,U.K.,(e-mail:maria.jafariele.qmul.a.uk).

This workhas been submittedto IEEE Transations onAudio Speeh and Language

Proessing. Partofthisworkhasbeenpresented atthe IEEEInternational Confereneon

Aoustis,SpeehandSignalProessing(ICASSP)in2011[1℄.

ThisworkwassupportedbytheEUFramework7FET-OpenprojetFP7-ICT-225913-

SMALL:SparseModels,AlgorithmsandLearningforLarge-Saledata.

(5)

Résumé: Nousintroduisonsleoneptd'InpaintingAudiopourlarestauration

deportionsdedonnéesaudiodistorduespardesdégradationstelsqueleslis,

lelippingoulapertedepaquets. Danseontexte,lesdonnéesdistorduessont

onsidéréesommemanquantesetlesignalestdéomposédansledomainetem-

porelen trames. Le problèmederestaurationest formuléommeunproblème

inversedanshaquetrame. Celle-iestmodéliséeparunereprésentationpari-

monieuseetleproblèmeinverseestrésoluvial'algorithmeOrthogonalMathing

PursuitenutilisantunditionnairedeosinusdisretoudeGabor. Lesperfor-

manesobtenuessontomparablesàl'étatdel'art,avedesblosd'éhantillons

manquants de durée variable. Nous montrons également que la qualité de la

restauration dépend davantagede lataille desblos d'éhantillonsmanquants

que du nombre total d'éhantillons manquants. Nous introduisons enn un

algorithme de type Mathing Pursuit ave ontraintes pour le as partiulier

dudelippingaudio,danslaquellesontexploitéeslespropriétésd'amplitudedes

éhantillonssaturés: signe,amplitudeminimumetmaximum. Lesperformanes

obtenuessontsupérieuresàellesdel'étatdel'artetàdelogiielsommeriaux

pourledelipping.

Mots-lés : Inpainting, lipping, représentations parimonieuses, mathing

pursuit.

(6)

0 0.01 0.02 0.03

−1 0 1

Time (s)

Amplitude

(a) Speeh signal orrupted by liks (ir-

les).

0 0.01 0.02 0.03

−1 0 1

Time (s)

Amplitude

(b)Clippedversion(blak)ofaspeehsignal

(gray).

()Theimageinpaintingproblem:reov-

eryofloally-hiddenpixels.

Figure1: Examplesofrestorationproblemsrelatedtoinpainting.

1 Introdution

Speeh and musi signalsare often subjetto loalized distortions, where the

intervalsof distorted samples are surrounded by undistorted samples. Exam-

ples inlude impulsive noiseorliks(see Fig.1a), lipping (seeFig. 1b), CD

srathes,paketlossin ordlessphonesorVoieoverIP (VoIP)andmore. In

suhsituations,thedistorted samplesanbetreatedasmissing. Arestoration

algorithm is employedto reonstrut the missing samples,in a similarwayas

for image inpainting(see Fig.1). However,in the audioeld,suh problems

have been treated separately and depending on the ontext, they have been

referredtoasaudiointerpolation[26℄,extrapolation[3,7,8℄,imputation[9,10℄,

indution [11℄,(bandwidth)extension[1215℄oronealment[16,17℄.

Substantialeorthasbeenfoused onthe restorationof audiosignalsor-

rupted by liksdue to old reordingsorsrathed CDs (seeFig.1a). In this

problem,intervalsoforruptedsamplesfrom20µ^s^to4^ms^[4℄^our^at^ran-

domloations. Typialapproahesemployautoregressive(AR)modeling[2,3℄,

or Bayesian estimation to reover the orrupted samples [4℄. Other methods

utilize neuralnetworks[18℄ or sinusoidalmodeling[5,8℄. A relatedproblem is

automati speeh reognitionin the presene of isolated noisy samples. This

problem istreatedin [10℄ withaompressivesensingapproah in thespetro-

gramimage domain,andbysolvinganl1 regularizedleastsquaresproblem.

(7)

Anotherimportant thoughless often addressed problem is audiolip-

ping[6,7,19℄,whihreferstothetrunationofthewaveformbeyondathreshold

when the maximum range in an aquisition systemis exeeded (see Fig. 1b).

Thelippedsamplesarearrangedingroupsandtheirloationsaredetermined

bytheamplitudeofthesignal(ratherthanbeingrandomlyspread). Thedelip-

ping problemispartiularly hallengingfor thisreasonandastheinformation

arriedbythehighest-amplitudesamplesisompletely absent.

Long intervals of samples may be lost during transmission over ordless

phones orin VoIP systems, where the problem is addressedusing paket loss

onealmentalgorithms [16,17℄. Missing intervals lengthsare in the range of

5^ms^to60^ms,^whih^are^lose^to^the^typial^duration^for^thepseudo-stationarity of audiosignals. Thelowlateny requirement in theVoIP ase resultsin rel-

ativelysimple algorithms; however,estimating missingpakets in peer-to-peer

repositories is a new appliation where higher quality reonstrution an be

expeted(as thelatenyrequirementislessstringent).

Finally, the unreliable or missing audio data an be time-frequeny re-

gions[5,9,11,14,20℄,inlassiationappliationslikeautomatispeehreogni-

tion[9,20℄orsoureseparationwithtime-frequenyloalizedinterferenethe

phraseaudio inpainting hasbeen usedone in thisspei ase [11℄. Band-

width extension [1215℄ is another important time-frequeny-domain applia-

tion,wherehighfrequenyontentisestimatedfromthelowfrequenyontent

in ordertoprovidehighqualityaudio.

Inthispaper,wepresentauniedframeworkfortherestorationofdistorted

audiodata,leveragingtheoneptofImageInpainting [2123℄. Intheproposed

framework, termed Audio Inpainting, the distorted data is assumed missing

and its loation is assumed to be known a-priori. We further employ Sparse

Representations(SR),whihhavebeendemonstratedtofaithfullymodelaudio

signals[24,25℄andto addresstheimageinpaintingframework[22,26,27℄. The

proposedapproahisdiretlybaseduponthosepriorworks.

Theontributionsofthispaperarefour-fold:

a) Audio inpainting isdened asan inverse problem, basedupon theonept

ofimageinpainting.

b) Aframeworkforaudioinpaintingin thetimedomain isproposed,basedon

sparserepresentations. It exploits twopossibleditionaries(disrete osine

andGabor)knownto provideauratesparsemodelsforaudiosignals.

) TheOrthogonalMathingPursuit (OMP)algorithmforaudioinpaintingis

adapted,in partiulartodealwiththepropertiesoftheGaborditionary.

d) Aonstrainedmathingpursuitapproahisappliedtosigniantlyenhane

theperformaneforaudiodelippingproblems.

Thispaperisorganizedasfollows. InSetion2,audioinpaintingisformal-

izedasaninverseproblem. TheproposedframeworkisintroduedinSetion3

inludingthesparsemodelsused fortime-domain audioinpainting. Theadap-

tation ofthe OMP algorithm foraudio inpaintingin the timedomain and for

audiodelippingispresentedinSetion4. Severalexperimentsareproposedin

Setion 5,whilewedisussourndingsanddrawonlusionsinSetion6.

(8)

2 Audio Inpainting Problem Statement

We dene audio inpainting as a general problem enountered in many appli-

ations: oneobserves apartial set of reliable audio data while the remaining

unreliabledataiseithertotallymissingorhighlydegraded;theunreliabledata

isonsideredmissing anditisestimatedfromthereliabledataportion.

The general formulation of audio inpainting is given in Setion 2.1 while

severalpartiulartime-domainasesaredetailedin Setions2.2and2.3.

2.1 Formulation of audio inpainting

We onsider a vetor s ∈ R^L of audio data and an a-priori known partition

{I^m, I^r} ôf ^the ^support I , {1,2,· · ·, L} ôf s^: I^m ⊂ I ând I^r , I\I^m^. ^We

assume that the oeients s(I^m) âre êither ^missing ôr ^masked ^by â ^severe

distortion. Thus, theobserveddata y ∈R^L oinides withs ^onI^r ^only. ^The

audio inpainting problem is dened as the reovery of the oeients s(I^m)

basedontheknowledgeof:

1. thereliabledatay^r,y(I^r) =s(I^r)^,

2. thepartition {I^m, I^r}^,

3. additionalinformationabouttheobservedsignal,

4. and, optionaly, informationabout themissing data (see e.g. in thease

oflippingbelow).

Inmatrixform,thereliabledatay^r^result^from^the^linear^model

y^r=M^rs, ⁽¹⁾

whereM^r^is^the^so-alledmeasurementmatrixobtainedfromtheL×L^identity

matrixI_L ^by^seleting^the^rowsI^r^assoiated^with^the^reliable^oeientsⁱⁿ s^.

In a similar way, the missing data to be reoveredare s(I^m) = M^ms^, ^where M^m^onsists^of^the^rowsI^m ⁱⁿI_L^.

Inthegeneralaudioinpaintingframework,audiodataanbeeithersamples

in waveformsor oeientsin transforms liketime-frequeny representations.

The problemformulation aboveanbe usedfor multi-dimensionalsignalslike

multihannelwaveformsortime-frequenyoeients,bysimplyreshapingthe

signalmatrixasavetors^.

Intherestofthispaper,weonlyonsidertheinpaintingofmissingsamples

in a single-hannel waveform. The multi-dimensional ase is disussed in the

onlusion(seeSe.6).

2.2 Inpainting samples distorted by impulsive noise

Inthepartiularaseofasignalorruptedbyimpulsivenoisesuhasliks(see

Fig. 1a),I^m îs â^set ôf întegers^between 1 ând L ând ^must ^be êstimated ⁱⁿ â

preliminarystage. Oneoftenonsidersthatthedistortedsamplesareorrupted

byaGaussiannoisen^with^high^variane. ^Hene,^the^omplete^observed^signal

inludes boththereliablesamplesy^r^and^distorted ^onesy^m^: (y^r =M^rs

y^m =M^ms+n,

(2)

(9)

wherethesamplesM^msⁱⁿy^m âre^masked^byn^so^that^theyâreônsideredâs

unknown.

2.3 Inpainting intervals of missing samples

In the ase where intervals of samples are missing, due to paket lossduring

transmissionor tomaskingbyaudible interferenes,I^m îsômposedôf ^groups

of onseutive integers: the samples s(I^m) âre ^totally ^missing ând ône ônly

observesy^r=M^rs^.

Intheaseoflippedsignals,thesamplestobeestimatedarealsoarranged

in intervals of onseutive samples, as depited in Fig. 1b. Their loations

dependontheamplitudeofthesignal,suhthat

I^m,{n|1≤n≤L,|s(n)| ≥θ^lip}, ⁽³⁾

where θ^lip îs ^the ^lipping ^level. Ône ôbserves ^both ^the ûn-lipped, ^reliable

samplesy^r ^and^the^lipped, ^masked^samplesy^m (y^r =M^ry=M^rs

y^m =M^my=M^msign (s)θ^lip,

(4)

where sign (·) îs ^theelement-wisesign funtion. As presented in thenext setions, the information provided by y^m^, êven ^though ^very ^rude â ^sign ^(per

sample)andthelippinglevel,stillsubstantiallyenhanestheestimationper-

formane.

3 Time-domain framework and models

Theproposed frameworkfousesontime-domainaudioinpainting. It relieson

aframe-basedproessing,asdetailedinSetion3.1andonthesparserepresen-

tationsmodelingofaudiosignals,aspresentedin Setion3.2. Twoditionaries

usedin thismodelingareintroduedinSetion 3.3.

3.1 Frame-based proessing and reonstrution

Asin manyaudioproessingtasks,thesignalisloally proessed:

bysegmentingitintoframes,

byindependentlyinpaintingeah frame,

andbysynthesizingthefull restoredsignalusing theoverlap-add(OLA)

method[28℄.

Wedeomposethesignalintooverlappingframesindexedbyi^,^startingât^time tiând^weighted^byânânalysis^windowwa^with^lengthN^. ^Bystraightforwardly adapting to the loal frames theproblem statementdened for thefull signal

in Setion2,thereliablesamplesin framei^an^be^written^as

y^r_i =M^r_isi ⁽⁵⁾

(10)

where M^r_i îs^themeasurementmatrixofthe i^-th^frameôbtained^from M^r ând s_i(t),s(t+ti)w_a(t) îs^the ^windowed^frame ^dened ^for0 ≤t≤N−1^. ^We

alsodenethesupportsI_i^rândI_i^môf^the^reliable^samplesândôf^the^missingôr

maskedsamples,respetively. One theestimationbsi ^of si ^by^some^inpainting

algorithmisahieved,thereonstrutionof thefullsignalisobtainedas

bs(t),X

i

w_s(t−ti)bs_i(t−ti) ⁽⁶⁾

where w_s ^is ^the ^synthesis^window ^suh ^that P

iw_s(t−ti)w_a(t−ti) = 1,∀t^.

Intheproposedapproahes,weutilized64^ms-frames^with75%^overlap,^a^ret-

angularwindowforw_a ^and^a^sine ^window^forw_s^.

3.2 Sparse Representations modeling of audio frames

IntheSparseRepresentations(SR)modelingframework[23℄,itisassumedthat

eahframeiswellapproximatedbyasparselinearombinationoftheolumns

ofa(possiblyoveromplete)ditionary:

s_i≈Dx_i, ⁽⁷⁾

where D ∈ R^N^×K^D is theditionary, N ≤ K^D ând xi ∈R^K^D^×1 is the repre- sentationvetorofthei-thframe. xi îsâssumed^to^be^sparse,î.e. ^to^have^few

non-zerooeientsompared toN^. Âsâônsequene,^weânâlsoûtilize^the

SRmodelfortheobservedreliablesamplesin eahframe

y_i^r,M^r_is_i≈M^r_iDx_i. ⁽⁸⁾

We propose to reover the unknown samples s_i(I_i^m) ^by ^estimating ^as xˆ_i

the(sparse)representationvetorofeah frame,givenonlytheleanobserved

samples(8)andlimitedsideinformation(forthelippingase)

bs_i(I_i^m) =M^m_i Dˆx_i. ⁽⁹⁾

This formulation inluding the notion of sparsity was rst introdued for

image inpainting [22℄ with a global treatment with global transforms. Then,

eortsweredediatedtoworkonloalpathessimilartoaudioframesand

to introdue alearned ditionary to improve the inpainting results [26℄; they

have been improved [27℄ by modeling betterthe problem and by learningthe

ditionarydiretlyfromtheorruptedimage.

3.3 Ditionaries

We propose two optionsto hoose aditionary D ⁱⁿ ^whih ^audio ^signals ^are

sparse: theDisreteCosineTransformditionary,andaGaborditionary. Both

are widely used for sparse models of audio signals [24,25,29℄. Other xed

ditionaries suh asmultisale DCT [30℄,orlearnedditionary [26℄ spei to

partiularinpaintingtasksmayalso beinterestingoptions.

(11)

3.3.1 DisreteCosine Transform (DCT) ditionary

TherstoptiononsistsinawindowedDisreteCosineTransform(DCT)over-

omplete ditionary D^c =£

d^c₀, . . . ,d^c_K

c−1

¤

, atom j ^being ^dened ^for 0 ≤j ≤ Kc−1 ^and0≤t≤N−1 ^as

d^c_j(t),w_d(t) cos µ π

Kc

µ t+1

2

¶ µ j+1

2

¶¶

(10)

whereKcîs^the^sizeôf^the^DCT^ditionaryî.e. ^the^numberôf^disrete^frequen-

ies andw_d îsâ^weighting^window^set ^by^theûser. ^This^hoieîs ^motivâted

bythewideuseofwindowedDCTatomsforsparserepresentationofaudiosig-

nals[25℄. However,thezerophaseofD^c âtomsîs^notâdapted^to âudio^signals

thataremadeupwithsinusoidalomponentswithinitialphasedistributedbe-

tween0ând2π^. Âs âônsequene,^the^DCT^modelâtsâsâ^basis^rather^than

asasynthesismodelandthesignalsarenotreallysparsein D^c^.

3.3.2 Gaborditionary

Theseondoptionaimsatsparselymodelingarbitrary-phasesinusoidalompo-

nentsbyusingaGaborditionaryD^g=n d^g_(j,ϕ)o

(j,ϕ)∈Γ

inwhihtheatomsare

index byaontinuoussetΓ,J0, Kg−1K×[0,2π[ândâre^denedâs d^g_j,ϕ ,wd(t) cos

µ π Kg

µ t+1

2

¶ µ j+1

2

¶ +ϕ

¶

, ⁽¹¹⁾

whereKg ^is^the^size^of^the^Gabor^ditionary.

Notethat in theurrentaseofaontinuously-indexed ditionary,eq. (7),

(8)and(9)arestillvalid ifwedene

D^gxi= X

(j,ϕ)∈Γ x_i(j,ϕ)6=0

d^g_j,ϕxi(j, ϕ) ⁽¹²⁾

where x_i ={xi(j, ϕ)}_(j,ϕ)∈Γ^. Îndeed, êq.⁽¹²⁾îs â^nite ^sum^sineônly â^few

oeientsinthesparserepresentationvetorx_i^are^non-zero. ^The^algorithmi

aspetsofthisdeompositionwillbeaddressedinSetions4.2and4.3.

4 Audio inpainting algorithms based on Orthog-

onal Mathing Pursuit

Foragiven ditionaryD^, ^we ûse^the Ôrthogonal^Mathing^Pursuit âlgorithm

toperformtheinpaintingofanaudioframe,aspresentedin Setion4.1. Some

ditionary-dependentalgorithmistagesarethendetailedinSetion4.2and4.3.

An extension of thealgorithm spei to delipping is nally detailledin Se-

tion4.4.