AUDIO EXTRACTION - MULTIMEDIA INFORMATION EXTRACTION: HISTORY AND

MULTIMEDIA INFORMATION EXTRACTION: HISTORY AND

2.3 AUDIO EXTRACTION

Just as information extraction from text remains important, so too there are vast audio sources from radio to broadcast news to audio lectures to meetings that require audio information extraction. There has been extensive exploration into information extraction from audio. Investigations have considered not only speech transcription but also non - speech audio and music detection and classiﬁ cation.

Automated speech recognition ( ASR ) is the recognition of spoken words from an acoustic signal. Figure 2.4 illustrates the best systems each year in competitions administered by NIST to objectively benchmark performance of speech recognition systems over time. The graph reports reduction of word error rate ( WER ) over time.

As can been seen in Figure 2.4 , some of automated recognition algorithms (e.g., in read speech or in air travel planning kiosks) approaches the range of human error in transcription.

The systems were assessed on a wide range of increasingly complex and chal-lenging tasks, moving from read speech, to broadcast (e.g., TV and radio) speech, to conversational speech, to foreign language speech (e.g., Chinese Mandarin and Arabic). Over time, tasks have ranged from understanding reading Wall Street Journal text, to understanding foreign television broadcasts, to so called “ switch-board” (ﬁ xed telephone and cell phone) conversations. Recent tasks have included recognition of speech (including emotionally colored speech) in meeting or confer-ence rooms, lecture rooms, and during coffee breaks, as well as attribution of speakers.

AUDIO EXTRACTION 19

For example, in the NIST Rich Transcription ( RT ) evaluation effort (Fiscus et al.

2007 ) using head - mounted microphones, speech transcription from conference data is about 26% WER, lectures about 30%, and coffee breaks 31% WER, rising to 40– 50% WER with multiple distant microphones and about 5% higher with single distant microphones. The “ who spoke when ” test results (called speaker diarization) revealed typical detection error rates ( DER s) of around 25% for conferences and 30% for lectures, although one system achieved an 8.5% DER on conferences for both detecting and clustering speakers. Interestingly, at least one site found negli-gible performance difference between lecture and coffee break recognition.

2.3.1 Audio Extraction: Music

In addition to human speech, another signiﬁ cant body of audio content is music.

The Music Genome Project (Castelluccio 2006 ) is an effort in which musicians manually classify the content of songs using almost 400 music attributes into ann dimensional vector characterizing a song ’ s roots (e.g., jazz, funk, folk, rock, and hip hop), inﬂ uences, feel (e.g., swing, waltz, shufﬂ e, and reggae), instruments (e.g., piano, guitar, drums, strings, bass, and sax), lyrics (e.g., romantic, sad, angry, joyful, and religious), vocals (e.g., male and female), and voice quality (e.g., breathy, emotional, gritty, and mumbling). Over 700,000 songs by 80,000 artists have been catalogued using these attributes, and the commercial service Pandora (John 2006 ), which

Figure 2.4. NIST benchmark test history ( Source : http://www.itl.nist.gov/iad ) .

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

100%

claims over 35 million users and adds over 15,000 analyzed tracks to the Music Genome each month, enables users to deﬁ ne channels by indicating preferred songs.

Unlike collaborative ﬁ ltering which recommends songs others enjoy that are similar to the ones you like, this content - based song recommender is based on music content, and users set up user - deﬁ ned channels (e.g., 80s rock, European Jazz, and Indian folk music), in which songs are selected by calculating distances between vectors of other songs and responding to listener feedback.

While useful to audiophiles, manual music classiﬁ cation suffers inconsistency, incompleteness, and inaccuracy. Automated music understanding is important for composition, training, and simply enjoyment. Acoustic feature extraction has proven valuable for detecting dominant rhythm, pitch or, melody (Foote 1999 ).

These capabilities are useful for beat tracking for disc jockeys or for automated pitch tracking for karaoke machines. Humans can predict musical genre (e.g., classical, country, jazz, rock, disco) based on 250 ms samples, and Tzanetakis et al.

(2001) report “ musical surface ” features for representing texture, timbre, instrumen-tation, and rhythmic structure in 20 ms windows over a second to predict musical genre in samples from an over 6 - hour diverse music collection. Blum et al. (1997) SoundFisher system performs both acoustic and perceptual processing of sounds to enable users, such as sound engineers, to retrieve materials either based on acoustic properties (e.g., loudness, pitch, brightness, and harmonicity) or user - annotated perceptual properties (e.g., scratchy, buzzy, laughter, and female speech) or via clusters of acoustic or perceptual features (e.g., bee - like or plane - like sounds). Blum et al. augment this with music analysis (e.g., rhythm, tempo, pitch, duration, and loudness) and instrument identiﬁ cation. The authors demonstrated SoundFisher ’ s utility by retrieving example laughter, female speech, and oboe recordings in the context of a modest 400 sound database of short (1 – 15 second) samples from animals, machines, musical instruments, speech, and nature. One challenge of this early work was the lack of standardized collection both to develop and evaluate innovations.

Inaugurated in 2005 and supported by the National Science Foundation and the Andrew W. Mellon Foundation, the Music Information Retrieval Evaluation eXchange ( MIREX ) ( http://www.music - ir.org/mirex ) is a TREC - like community evaluation with standard collections, tasks, and answers (relevancy assessments or classifi cations) organized by the International Music Information Retrieval Systems Evaluation Laboratory ( IMIRSEL ) at the University of Illinois at Urbana Champaign ( UIUC ). Because of copyright violation concerns, unlike TREC, data is not distributed to participants, but instead algorithms are submitted to IMIRSEL for evaluation against centralized data, consisting of two terabytes of audio data representing some 30,000 tracks divided among popular, classical, and Americana subcollections. Unfortunately, the same ground truth data has been used for 2005, 2006, and 2007 evaluations, risking overfi tting. Evaluations have taken place at the International Conferences on Music Information Retrieval ( ISMIR ), with multiple audio tasks, such as tempo and melody extraction, artist and genre identifi cation, query by singing/humming/tapping/example/notation, score following, and music similarity and retrieval. One interesting task introduced in 2007 is music mood/

emotion classifi cation, wherein algorithms need to classify music as one of fi ve clusters (e.g., passionate/rousing, cheerful/fun, bittersweet/brooding, silly/whimsical, and fi ery/ volatile). Approximately 40 teams per year from over a dozen countries

IMAGE EXTRACTION 21

compete, most recently performing an overall 122 runs against multiple tasks (Downie 2008 ). Sixteen of the 19 primary tasks require audio processing, the other three symbolic processing (e.g., processing MIDI formats). Three hundred algo-rithms have been evaluated since the beginning of MIREX. For example, the rec-ognition of cover songs (i.e., new performances by a distinct artist of an original hit) improved between 2006 and 2007 from 23.1 to 50.1% average precision. Interest-ingly, the 2007 submissions moved from spectral - based “ timbral similarity ” toward more musical features, such as tonality, rhythm, and harmonic progressions, over-coming what were perceived by researchers as a performance ceiling. Another ﬁ nding was that particular systems appear to have unique abilities that address speciﬁ c subsets of queries, suggesting that hybrid approaches that combine the best aspects of the individual systems could improve performance. A new consortium called the Networked Environment for Music Analysis ( NEMA ) aims to overcome the challenges with the distribution of music test collections by creating an open and extensible web service - based resource framework of music data and analytic/

evaluative tools to be used globally.

Related, the innovative TagATune evaluation (tagatune.org) compares various algorithms’ abilities to associate tags with 29 - second audio clips of songs. A TagA-Tune game is used to collect tags of audio clips by giving two players audio clips and having them tag them and then try to ﬁ gure out if they have the same clip by looking only at the tags of each other. Interestingly, negative tags are also captured (e.g., no piano and no drums). This data annotation method is a promising approach to otherwise expensive and time - consuming creation of ground truth data. Whereas humans can tag music at approximately 93% accuracy, in early 2009, four of ﬁ ve systems performed with 60.9 – 70.1% accuracy (Law et al. 2009 ).

Dans le document MULTIMEDIA INFORMATION EXTRACTION (Page 36-39)