• Aucun résultat trouvé

Audio Inpainting

N/A
N/A
Protected

Academic year: 2021

Partager "Audio Inpainting"

Copied!
28
0
0

Texte intégral

(1)

HAL Id: inria-00577079

https://hal.inria.fr/inria-00577079

Submitted on 16 Mar 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

Amir Adler, Valentin Emiya, Maria Jafari, Michael Elad, Rémi Gribonval, Mark D. Plumbley

To cite this version:

Amir Adler, Valentin Emiya, Maria Jafari, Michael Elad, Rémi Gribonval, et al.. Audio Inpainting.

IEEE Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2012, 20 (3), pp.922 - 932. �10.1109/TASL.2011.2168211�. �inria-00577079�

(2)

a p p o r t

d e r e c h e r c h e

0249-6399ISRNINRIA/RR--7571--FR+ENG

Domaine Audio, Speech, and Language Processing

Audio Inpainting

Amir Adler — Valentin Emiya — Maria G. Jafari — Michael Elad — Rémi Gribonval — Mark D. Plumbley

N° 7571

March 16, 2011

(3)
(4)

Centre de recherche INRIA Rennes – Bretagne Atlantique

AmirAdler , ValentinEmiya, MariaG. Jafari, Mihael Elad ,

RémiGribonval,Mark D. Plumbley

Domaine:

Équipe-ProjetMetiss

Rapportdereherhe 7571Marh 16,201124pages

Abstrat:

We propose the Audio Inpainting framework that reoversaudio intervals

distortedduetoimpairmentssuhasimpulsivenoise,lipping,andpaketloss.

Inthisframework,thedistortedsamplesaretreatedasmissing,and thesignal

is deomposed into overlappingtime-domainframes. Therestoration problem

is then formulatedasan inverse problem peraudioframe. Sparserepresenta-

tionmodelingisemployedperframe,andeahinverseproblemis solvedusing

theOrthogonalMathingPursuitalgorithmtogetherwithadisreteosineora

Gabor ditionary. Theperformaneofthis algorithmisshownto beompara-

ble orbetterthan state-of-the-artmethodswhen bloks ofsamplesof variable

durationsaremissing. Wealsodemonstratethatthesizeoftheblokofmissing

samples, rather than the overall number of missing samples, is a ruial pa-

rameterforhighqualitysignalrestoration. Wefurtherintrodueaonstrained

MathingPursuitapproahforthespeialaseofaudiodelippingthatexploits

thesignpatternoflippedaudiosamplesandtheirmaximalabsolutevalue,as

wellasallowingtheusertospeifythemaximumamplitudeofthesignal. This

approah is shown to outperforms state-of-the-artand ommerially available

methods foraudiodelipping.

Key-words: Inpainting,lipping,sparserepresentation,mathingpursuit.

A.AdlerandM.EladarewiththeComputerSieneDepartment,TheTehnion,Haifa

32000, Israel. V.EmiyaandR.GribonvalarewithINRIA,CentreInriaRennes-Bretagne

Atlantique,35042RennesCedex,Frane. M.G.JafariandM.D.PlumbleyarewithQueen

MaryUniversityofLondon,CentreforDigitalMusi,DepartmentofEletroniEngineering,

LondonE14NS,U.K.,(e-mail:maria.jafariele.qmul.a.uk).

This workhas been submittedto IEEE Transations onAudio Speeh and Language

Proessing. Partofthisworkhasbeenpresented atthe IEEEInternational Confereneon

Aoustis,SpeehandSignalProessing(ICASSP)in2011[1℄.

ThisworkwassupportedbytheEUFramework7FET-OpenprojetFP7-ICT-225913-

SMALL:SparseModels,AlgorithmsandLearningforLarge-Saledata.

(5)

Résumé: Nousintroduisonsleoneptd'InpaintingAudiopourlarestauration

deportionsdedonnéesaudiodistorduespardesdégradationstelsqueleslis,

lelippingoulapertedepaquets. Danseontexte,lesdonnéesdistorduessont

onsidéréesommemanquantesetlesignalestdéomposédansledomainetem-

porelen trames. Le problèmederestaurationest formuléommeunproblème

inversedanshaquetrame. Celle-iestmodéliséeparunereprésentationpari-

monieuseetleproblèmeinverseestrésoluvial'algorithmeOrthogonalMathing

PursuitenutilisantunditionnairedeosinusdisretoudeGabor. Lesperfor-

manesobtenuessontomparablesàl'étatdel'art,avedesblosd'éhantillons

manquants de durée variable. Nous montrons également que la qualité de la

restauration dépend davantagede lataille desblos d'éhantillonsmanquants

que du nombre total d'éhantillons manquants. Nous introduisons enn un

algorithme de type Mathing Pursuit ave ontraintes pour le as partiulier

dudelippingaudio,danslaquellesontexploitéeslespropriétésd'amplitudedes

éhantillonssaturés: signe,amplitudeminimumetmaximum. Lesperformanes

obtenuessontsupérieuresàellesdel'étatdel'artetàdelogiielsommeriaux

pourledelipping.

Mots-lés : Inpainting, lipping, représentations parimonieuses, mathing

pursuit.

(6)

0 0.01 0.02 0.03

−1 0 1

Time (s)

Amplitude

(a) Speeh signal orrupted by liks (ir-

les).

0 0.01 0.02 0.03

−1 0 1

Time (s)

Amplitude

(b)Clippedversion(blak)ofaspeehsignal

(gray).

()Theimageinpaintingproblem:reov-

eryofloally-hiddenpixels.

Figure1: Examplesofrestorationproblemsrelatedtoinpainting.

1 Introdution

Speeh and musi signalsare often subjetto loalized distortions, where the

intervalsof distorted samples are surrounded by undistorted samples. Exam-

ples inlude impulsive noiseorliks(see Fig.1a), lipping (seeFig. 1b), CD

srathes,paketlossin ordlessphonesorVoieoverIP (VoIP)andmore. In

suhsituations,thedistorted samplesanbetreatedasmissing. Arestoration

algorithm is employedto reonstrut the missing samples,in a similarwayas

for image inpainting(see Fig.1). However,in the audioeld,suh problems

have been treated separately and depending on the ontext, they have been

referredtoasaudiointerpolation[26℄,extrapolation[3,7,8℄,imputation[9,10℄,

indution [11℄,(bandwidth)extension[1215℄oronealment[16,17℄.

Substantialeorthasbeenfoused onthe restorationof audiosignalsor-

rupted by liksdue to old reordingsorsrathed CDs (seeFig.1a). In this

problem,intervalsoforruptedsamplesfrom20µsto4ms[4℄ouratran-

domloations. Typialapproahesemployautoregressive(AR)modeling[2,3℄,

or Bayesian estimation to reover the orrupted samples [4℄. Other methods

utilize neuralnetworks[18℄ or sinusoidalmodeling[5,8℄. A relatedproblem is

automati speeh reognitionin the presene of isolated noisy samples. This

problem istreatedin [10℄ withaompressivesensingapproah in thespetro-

gramimage domain,andbysolvinganl1 regularizedleastsquaresproblem.

(7)

Anotherimportant thoughless often addressed problem is audiolip-

ping[6,7,19℄,whihreferstothetrunationofthewaveformbeyondathreshold

when the maximum range in an aquisition systemis exeeded (see Fig. 1b).

Thelippedsamplesarearrangedingroupsandtheirloationsaredetermined

bytheamplitudeofthesignal(ratherthanbeingrandomlyspread). Thedelip-

ping problemispartiularly hallengingfor thisreasonandastheinformation

arriedbythehighest-amplitudesamplesisompletely absent.

Long intervals of samples may be lost during transmission over ordless

phones orin VoIP systems, where the problem is addressedusing paket loss

onealmentalgorithms [16,17℄. Missing intervals lengthsare in the range of

5msto60ms,whiharelosetothetypialdurationforthepseudo-stationarity of audiosignals. Thelowlateny requirement in theVoIP ase resultsin rel-

ativelysimple algorithms; however,estimating missingpakets in peer-to-peer

repositories is a new appliation where higher quality reonstrution an be

expeted(as thelatenyrequirementislessstringent).

Finally, the unreliable or missing audio data an be time-frequeny re-

gions[5,9,11,14,20℄,inlassiationappliationslikeautomatispeehreogni-

tion[9,20℄orsoureseparationwithtime-frequenyloalizedinterferenethe

phraseaudio inpainting hasbeen usedone in thisspei ase [11℄. Band-

width extension [1215℄ is another important time-frequeny-domain applia-

tion,wherehighfrequenyontentisestimatedfromthelowfrequenyontent

in ordertoprovidehighqualityaudio.

Inthispaper,wepresentauniedframeworkfortherestorationofdistorted

audiodata,leveragingtheoneptofImageInpainting [2123℄. Intheproposed

framework, termed Audio Inpainting, the distorted data is assumed missing

and its loation is assumed to be known a-priori. We further employ Sparse

Representations(SR),whihhavebeendemonstratedtofaithfullymodelaudio

signals[24,25℄andto addresstheimageinpaintingframework[22,26,27℄. The

proposedapproahisdiretlybaseduponthosepriorworks.

Theontributionsofthispaperarefour-fold:

a) Audio inpainting isdened asan inverse problem, basedupon theonept

ofimageinpainting.

b) Aframeworkforaudioinpaintingin thetimedomain isproposed,basedon

sparserepresentations. It exploits twopossibleditionaries(disrete osine

andGabor)knownto provideauratesparsemodelsforaudiosignals.

) TheOrthogonalMathingPursuit (OMP)algorithmforaudioinpaintingis

adapted,in partiulartodealwiththepropertiesoftheGaborditionary.

d) Aonstrainedmathingpursuitapproahisappliedtosigniantlyenhane

theperformaneforaudiodelippingproblems.

Thispaperisorganizedasfollows. InSetion2,audioinpaintingisformal-

izedasaninverseproblem. TheproposedframeworkisintroduedinSetion3

inludingthesparsemodelsused fortime-domain audioinpainting. Theadap-

tation ofthe OMP algorithm foraudio inpaintingin the timedomain and for

audiodelippingispresentedinSetion4. Severalexperimentsareproposedin

Setion 5,whilewedisussourndingsanddrawonlusionsinSetion6.

(8)

2 Audio Inpainting Problem Statement

We dene audio inpainting as a general problem enountered in many appli-

ations: oneobserves apartial set of reliable audio data while the remaining

unreliabledataiseithertotallymissingorhighlydegraded;theunreliabledata

isonsideredmissing anditisestimatedfromthereliabledataportion.

The general formulation of audio inpainting is given in Setion 2.1 while

severalpartiulartime-domainasesaredetailedin Setions2.2and2.3.

2.1 Formulation of audio inpainting

We onsider a vetor s RL of audio data and an a-priori known partition

{Im, Ir} of the support I , {1,2,· · ·, L} of s: Im I and Ir , I\Im. We

assume that the oeients s(Im) are either missing or masked by a severe

distortion. Thus, theobserveddata y RL oinides withs onIr only. The

audio inpainting problem is dened as the reovery of the oeients s(Im)

basedontheknowledgeof:

1. thereliabledatayr,y(Ir) =s(Ir),

2. thepartition {Im, Ir},

3. additionalinformationabouttheobservedsignal,

4. and, optionaly, informationabout themissing data (see e.g. in thease

oflippingbelow).

Inmatrixform,thereliabledatayrresultfromthelinearmodel

yr=Mrs, (1)

whereMristheso-alledmeasurementmatrixobtainedfromtheL×Lidentity

matrixIL byseletingtherowsIrassoiatedwiththereliableoeientsin s.

In a similar way, the missing data to be reoveredare s(Im) = Mms, where MmonsistsoftherowsIm inIL.

Inthegeneralaudioinpaintingframework,audiodataanbeeithersamples

in waveformsor oeientsin transforms liketime-frequeny representations.

The problemformulation aboveanbe usedfor multi-dimensionalsignalslike

multihannelwaveformsortime-frequenyoeients,bysimplyreshapingthe

signalmatrixasavetors.

Intherestofthispaper,weonlyonsidertheinpaintingofmissingsamples

in a single-hannel waveform. The multi-dimensional ase is disussed in the

onlusion(seeSe.6).

2.2 Inpainting samples distorted by impulsive noise

Inthepartiularaseofasignalorruptedbyimpulsivenoisesuhasliks(see

Fig. 1a),Im is aset of integersbetween 1 and L and must be estimated in a

preliminarystage. Oneoftenonsidersthatthedistortedsamplesareorrupted

byaGaussiannoisenwithhighvariane. Hene,theompleteobservedsignal

inludes boththereliablesamplesyranddistorted onesym: (yr =Mrs

ym =Mms+n,

(2)

(9)

wherethesamplesMmsinym aremaskedbynsothattheyareonsideredas

unknown.

2.3 Inpainting intervals of missing samples

In the ase where intervals of samples are missing, due to paket lossduring

transmissionor tomaskingbyaudible interferenes,Im isomposedof groups

of onseutive integers: the samples s(Im) are totally missing and one only

observesyr=Mrs.

Intheaseoflippedsignals,thesamplestobeestimatedarealsoarranged

in intervals of onseutive samples, as depited in Fig. 1b. Their loations

dependontheamplitudeofthesignal,suhthat

Im,{n|1nL,|s(n)| ≥θlip}, (3)

where θlip is the lipping level. One observes both the un-lipped, reliable

samplesyr andthelipped, maskedsamplesym (yr =Mry=Mrs

ym =Mmy=Mmsign (s)θlip,

(4)

where sign (·) is theelement-wisesign funtion. As presented in thenext se- tions, the information provided by ym, even though very rude a sign (per

sample)andthelippinglevel,stillsubstantiallyenhanestheestimationper-

formane.

3 Time-domain framework and models

Theproposed frameworkfousesontime-domainaudioinpainting. It relieson

aframe-basedproessing,asdetailedinSetion3.1andonthesparserepresen-

tationsmodelingofaudiosignals,aspresentedin Setion3.2. Twoditionaries

usedin thismodelingareintroduedinSetion 3.3.

3.1 Frame-based proessing and reonstrution

Asin manyaudioproessingtasks,thesignalisloally proessed:

ˆ bysegmentingitintoframes,

ˆ byindependentlyinpaintingeah frame,

ˆ andbysynthesizingthefull restoredsignalusing theoverlap-add(OLA)

method[28℄.

Wedeomposethesignalintooverlappingframesindexedbyi,startingattime tiandweightedbyananalysiswindowwawithlengthN. Bystraightforwardly adapting to the loal frames theproblem statementdened for thefull signal

in Setion2,thereliablesamplesin frameianbewrittenas

yri =Mrisi (5)

(10)

where Mri isthemeasurementmatrixofthe i-thframeobtainedfrom Mr and si(t),s(t+ti)wa(t) isthe windowedframe dened for0 tN1. We

alsodenethesupportsIirandIimofthereliablesamplesandofthemissingor

maskedsamples,respetively. One theestimationbsi of si bysomeinpainting

algorithmisahieved,thereonstrutionof thefullsignalisobtainedas

bs(t),X

i

ws(tti)bsi(tti) (6)

where ws is the synthesiswindow suh that P

iws(tti)wa(tti) = 1,∀t.

Intheproposedapproahes,weutilized64ms-frameswith75%overlap,aret-

angularwindowforwa andasine windowforws.

3.2 Sparse Representations modeling of audio frames

IntheSparseRepresentations(SR)modelingframework[23℄,itisassumedthat

eahframeiswellapproximatedbyasparselinearombinationoftheolumns

ofa(possiblyoveromplete)ditionary:

siDxi, (7)

where D RN×KD is theditionary, N KD and xi RKD×1 is the repre- sentationvetorofthei-thframe. xi isassumedtobesparse,i.e. tohavefew

non-zerooeientsompared toN. Asaonsequene,weanalsoutilizethe

SRmodelfortheobservedreliablesamplesin eahframe

yir,MrisiMriDxi. (8)

We propose to reover the unknown samples si(Iim) by estimating as xˆi

the(sparse)representationvetorofeah frame,givenonlytheleanobserved

samples(8)andlimitedsideinformation(forthelippingase)

bsi(Iim) =Mmi xi. (9)

This formulation inluding the notion of sparsity was rst introdued for

image inpainting [22℄ with a global treatment with global transforms. Then,

eortsweredediatedtoworkonloalpathessimilartoaudioframesand

to introdue alearned ditionary to improve the inpainting results [26℄; they

have been improved [27℄ by modeling betterthe problem and by learningthe

ditionarydiretlyfromtheorruptedimage.

3.3 Ditionaries

We propose two optionsto hoose aditionary D in whih audio signals are

sparse: theDisreteCosineTransformditionary,andaGaborditionary. Both

are widely used for sparse models of audio signals [24,25,29℄. Other xed

ditionaries suh asmultisale DCT [30℄,orlearnedditionary [26℄ spei to

partiularinpaintingtasksmayalso beinterestingoptions.

(11)

3.3.1 DisreteCosine Transform (DCT) ditionary

TherstoptiononsistsinawindowedDisreteCosineTransform(DCT)over-

omplete ditionary Dc =£

dc0, . . . ,dcK

c−1

¤

, atom j being dened for 0 j Kc1 and0tN1 as

dcj(t),wd(t) cos µ π

Kc

µ t+1

2

¶ µ j+1

2

¶¶

(10)

whereKcisthesizeoftheDCTditionaryi.e. thenumberofdisretefrequen-

ies andwd isaweightingwindowset bytheuser. Thishoieis motivated

bythewideuseofwindowedDCTatomsforsparserepresentationofaudiosig-

nals[25℄. However,thezerophaseofDc atomsisnotadaptedto audiosignals

thataremadeupwithsinusoidalomponentswithinitialphasedistributedbe-

tween0and. As aonsequene,theDCTmodelatsasabasisratherthan

asasynthesismodelandthesignalsarenotreallysparsein Dc.

3.3.2 Gaborditionary

Theseondoptionaimsatsparselymodelingarbitrary-phasesinusoidalompo-

nentsbyusingaGaborditionaryDg=n dg(j,ϕ)o

(j,ϕ)∈Γ

inwhihtheatomsare

index byaontinuoussetΓ,J0, Kg1K×[0,2π[andaredenedas dgj,ϕ ,wd(t) cos

µ π Kg

µ t+1

2

¶ µ j+1

2

+ϕ

, (11)

whereKg isthesizeoftheGaborditionary.

Notethat in theurrentaseofaontinuously-indexed ditionary,eq. (7),

(8)and(9)arestillvalid ifwedene

Dgxi= X

(j,ϕ)∈Γ xi(j,ϕ)6=0

dgj,ϕxi(j, ϕ) (12)

where xi ={xi(j, ϕ)}(j,ϕ)∈Γ. Indeed, eq.(12)is anite sumsineonly afew

oeientsinthesparserepresentationvetorxiarenon-zero. Thealgorithmi

aspetsofthisdeompositionwillbeaddressedinSetions4.2and4.3.

4 Audio inpainting algorithms based on Orthog-

onal Mathing Pursuit

Foragiven ditionaryD, we usethe OrthogonalMathingPursuit algorithm

toperformtheinpaintingofanaudioframe,aspresentedin Setion4.1. Some

ditionary-dependentalgorithmistagesarethendetailedinSetion4.2and4.3.

An extension of thealgorithm spei to delipping is nally detailledin Se-

tion4.4.

Références

Documents relatifs

Due to the non-convexity of the functionals which are typi- cally used in global patch-based methods, having a reliable initialization is crucial for the local minimization.

[14] predict information in large missing regions from a trained generative model using a semantic image inpainting in deep generative modelling framework.. [13] developped an

Our loss function consists of five major terms, namely (i) a L1 loss to ensure the pixel-wise reconstruction accuracy especially if using quantitative evaluation metrics such as

ADHESIVE ---+7 SUBSTRATE --->-r'S System Description REINFORCING FABRIC BASE COAT INSULATION BOARD MECHANICAL ATTACHMENT MMKセNNM^ゥGセ FINISH

En perspective, il est prévu d’appliquer un nouveau modèle d’ordre élevé à cinq variables avec la prise en compte de « l’effet stretching » pour le calcul

56 presents the ratios of the number of raw electron triggers S1 over the accumulated charge obtained for good slices of run (every detector is working, the beam is stable at one

The teachers’ responses are assumed to echo their understanding and consciousness of the ways in which cognitive processes and psychological factors underlie the EFL

was omitted and replaced by a saline injection. One week later, EGFP expression was investigated in the hippocampus. A) EGFP and DAPI fluorescence of coronal