Examples of Music Mood and Emotion Classification .1 Mood Classfication Using Acoustic Data Analysis.1Mood Classfication Using Acoustic Data Analysis

Mood and Emotional Classification

5.4 Examples of Music Mood and Emotion Classification .1 Mood Classfication Using Acoustic Data Analysis.1Mood Classfication Using Acoustic Data Analysis

Liu, Lu, and Zhang [50, 51] decomposed their four-class classification problem (see Section 5.3.3.3) into a two-layer binary classification problem, in which

the first level separated between “Tiredness” and “Energy” and the second level separates, within each of the two classes generated by the first level classifier between “Calm” and “Tenses.” The extracted features are subband-divided spectral features, short-term Fourier transform features, and rhythmic features. They applied a transform to make these signals orthogonal and then fed the transformed feature vector into Gaussian Mixture Models consisting of four Gaussian distributions. The accuracy was around 85%.

Li and Ogihara [47] used MFCC and short-term Fourier transform by way of Marsyas [81, 80], as well as statistics (the first three moments plus the mean of absolute value) calculated over subband coefficients obtained by applying the Daubechies Db8 wavelet filter [11] on the monaural signals (see Li and Ogihara [49]). On each axis, the accuracy of binary classification was in the 70% to 80% range for both labelers.

In 2007, MIREX first included a task on audio music mood classifica-tion, using the five mood clusters mentioned previously, and performance has increased each subsequent year [29]. In 2007, Tzanetakis achieved the high-est correct classification (61.5%), using MFCC, spectral shape, centroid, and rolloff features with an SVM classifier [79]. The highest performing system in 2008 by Peeters demonstrated some improvement (63.7%) by introducing a much larger feature corpus including, MFCCs, Spectral Crest/Spectral Flat-ness, as well as a variety of chroma based measurements [61]. The system uses a Gaussian Mixture Model (GMM) approach to classification, but first em-ploys Inertia Ratio Maximization with Feature Space Projection (IRMFSP) to select the most informative 40 features for each task (in this case mood), and performs linear discriminant analysis for dimensionality reduction. In 2009, Cao and Li submitted a system that was a top performer in several cate-gories, including mood classification (65.7%) [10]. Their system employs a

“super vector” of low-level acoustic features, and employs a Gaussian Super Vector followed by Support Vector Machine (GSV-SVM).

5.4.2 Mood Classification Based on Lyrics

Hu, Chen, and Yang [30] designed a method for dividing sentences in a lyric into emotion groups and then extracting emotion labels from each group, using a large collection of emotion words (seeFigure 5.8). The technique was applied to popular music sung in Chinese but the method should be applicable to other languages.

The first step of the method is to identify a large set of Chinese affect words and then map each collected word to a point in the two-dimensional real space, R². For this purpose, they created a two-level database of approximately 3,000 emotion words. The first level was created by translating into the Chinese ANEW database [8], a large (1,031 words) collection of emotion words as follows: (a) First, 10 native speakers of Chinese independently translated these words into Chinese. During this process, if a translator found that there was no appropriate Chinese translation (mainly because of cultural difference),

Word-level Emotional Map

Word-level Part-of-speech

Instance-level

Map Sentence-level Map

Manhattan Distance Based Minimum Spanning Tree of

Sentences Connected Components

Components-wise Averaging

Figure 5.8

The process of producing a sentence group emotion map.

the word was removed from the collection. (b) For each remaining word, the translation that had been suggested by the largest number of translators was chosen as the translation. The second level was created by adding to this collection approximately 2,000 synonyms by looking up a Chinese synonym database. Following the word gathering, they annotated all the words in the database in terms of their Chinese parts-of-speech (POS). This resulted in a two-tier emotion word collection consisting of nearly 3,000 words.

In the second step, the words in the database were mapped intoR². In the ANEW database every word is rated with respect to its Valence, Arousal, and Dominance levels, each with nine levels. Hu, Chen, and Yang used the first two, Valence and Arousal. That is, if a Chinese word w translated from an English worduin ANEW or ifwis a synonym of a Chinese wordw⁰translated from an English worduin ANEW, then its map,µ(w), is:

µ(w) = (Valence(u),Arousal(u)).

The range of the map was then extended to the sentence level. Given a sen-tences, all the emotion words appearing in s were identified using the POS information attached to them, and then depending on the tense in which the words are used in s and on their modifiers appearing in s, the Valence and Arousal values are scaled (the values may go out their original range of [−1,1] due to scaling). The map of the sentencesis computed by taking the component-wise average of the maps of the emotion words ins.

Next, Hu, Chen, and Yang collected nearly 1,000 popular music lyrics sung in Chinese along with the timing information of the sentences appearing in the lyrics in their corresponding performance. They had seven people label them into one of the four quadrants of the Thayer map. They kept only those for which at least six labelers agreed. There remained 400 songs. Interestingly, the Calm-Tiredness quadrant had only eight lyrics and the Tension-Energy had as few as 54.

In the next step, using the Manhattan distance on the map they devised

a similarity measure between two sentences. Given a lyric, its sentences can be viewed as nodes with the distance between any pair of nodes set to the similarity between the two nodes. By computing the maximum spanning tree of the graph and then cutting the tree into components by eliminating all the edges whose distance value is below a fixed threshold, a set of sentence groups was obtained (a method suggested by Wu and Yang [88]). Each group was then mapped to a point inR² by computing the weighted average of the maps where the weight of a sentence was calculated based on which level of the two-level database their emotion words came from and how much time was allocated to the sentences in the singing. Then they selected as the represented group from each lyric the one with the highest total weight.

The above process produced a map of the lyrics to the quadrants of Thayer.

The authors compared this against the human based map using F-measure:

Precision·Recall

Precision + Recall. (5.2)

Here for given an alignment the Calm-Energy had the F-measure of 0.70, suggesting that there is strong agreement between the human perception of this emotion in Chinese songs and representation of this emotion in Chinese lyrics. As to the remaining three, Tension-Energy had F-measure of 0.44 and the other two had much lower F-measure values.

5.4.3 Mixing Audio and Tag Features for Mood Classification

Bischoff et al. [6] used All Music Guide (allmusic.com) [1] for collecting mood labels. All Music Guide is a Web-based guide book that offers an extensive discography, musicological information, various metadata, and sound samples.

It offers theme and mood labels assigned by experts. Bischoff et al. collected from All Music Guide, 73 those expert-assigned theme labels and 178 mood labels for nearly 6,000 songs. The 178 mood labels were organized in two ways:

using the two-dimensional diagram of Thayer and using the five-cluster groups according to the aforementioned MIREX mood label clusters [29]. Then from the nearly 6,000 songs, any piece that matched more than one Thayer class was removed, which resulted in a reduction of size to 1,192 songs.

For these songs, audio features and tag features were obtained. As to the audio features, audio files in the MPEG-3 format were used with the 192K bps bit rate. The extracted features included MFCC, Tempo, Loudness, Pitch Classes, chroma, and the Spectral moments. The total number of features was 240. As to the tags, they were collected from Last.fm [2].

Bischoff et al. built a support vector machine classifier based on audio fea-tures and a Naive Bayes classifier based on tags. They actually tested various regressions for audio features and concluded that support vector machines performed the best. These classifiers were configured to output a real value between 0 and 1 for each mood class. Then, using a mix ratio parameterβ,

0 < β <1, the two classifiers were linearly combined into a single classifier.

Letc be a class and letfc and gc respectively be the classifier for c with re-spect to audio and the classifier forc with respect to tags. Then the output of mixed classifier,hc, withβ is:

βf_c(x) + (1−β)g_c(x), (5.3) where xis the input feature vector consisting of the audio and tag features.

Both in the case of single-feature set classifiers and in the case of linearly mixed classifiers, the predicted class assignment is the class c for which the output is the largest; that is,

argmax_cϕc(x), (5.4)

whereϕis one off,g, andh.

Both for the Thayer class prediction and for the MIREX mood class pre-diction, with an appropriately chosen α, the linearly mixed classifier per-formed better than the single-feature set classifiers. For the MIREX classes, the best accuracy, in terms of F-measure, was 0.572 and that was achieved whenβ = 0.3. This was better than the reported accuracies of 0.564 for tag-only features and of 0.432 for audio-tag-only features. As for the Thayer mood classes, the best accuracy, again in terms of F-measure, was 0.569 and that was achieved when α= 0.8. This was an improvement from the accuracy of 0.539 for tag-only features and 0.515 for audio-only features.

5.4.4 Mixing Audio and Lyrics for Mood Classification Hu, Downie, and Ehmann [28] mixed audio signals and lyrics for mood clas-sification of Pop and Rock songs. Ground-truth mood-class (multiple classes for some) identification was done as follows.

To set up experiments, approximately 9,000 audio recordings were obtained such that their corresponding lyrics were available atLyricwiki.organd such that their tags were available at Last.fm [2]. From those, the songs with lyric length of less than 100 words were removed.

The collected tags were processed to create a vocabulary of emotion words and their grouping as follows. First, from the collection words that were not re-lated to emotion were eliminated. Subject to elimination were: (a) genre/style names (such as “Dance”), (b) terms that were judgmental (such as “Bad”

and “Great”), and (c) terms for which the intention of their taggers was am-biguous (such as “Love,” which could mean either the tagger loved it or the tagger felt that the song was about love). Second, using WordNet-Affect [74]

as a guide words having identical origins (e.g., a name and its adjective form) and their synonyms were grouped together. Finally, these groups were then examined by human experts for further merger. This was the vocabulary of emotion words and their groups.

After this preprocessing, for each emotional word group the songs whose

Calm Sad Happy

Romantic Upbeat

Comfort Unhappy Glad Gleeful

Depressed Anger Grief

Dreamy Cheerful

Blue Fury Heartbreak Festive

Brooding Aggressive Confident Angst Desire Contemplative Aggression Encouraging Anxiety Hope

Earnest Pessimism Excitement Heartfelt Cynical Exciting Table 5.4

The 18 Groups by Hu, Downie, and Ehmann

tags match a word in the group was collected from the song collection. How-ever, if a tag of a song happened to be contained in its artist name (such as

“Blue” in Blue ¨Oyster Cult), that particular occurrence of the tag was ignored.

Then, emotion word groups of song collection size less than 20 were removed.

This resulted in 18 emotion word groups that in large part correspond to the emotions appearing in the Russell diagram over 135 tags and 2,829 songs.

A couple of tags from each group are presented in Table 5.4. Note that be-cause these words came from free-form tags they are not necessarily emotion adjectives.

About 43% of the songs were assigned to multiple mood classes, and so the mood class prediction problem was naturally considered to be a multiclass labeling problem. So, it was decomposed as a collection of individual mood classification problems.

As to feature extraction, the lyrics were processed for the following:

• Part-of-Speech (POS): A grammatical classification of words into some types (such as nouns, verbs, and pronouns)

• Function Words (FW): The words that very frequently occur in documents and serve some functions but not so directly related to the contents (such as “about,” “the,” and “and”)

• Content Words (Content): The words that are not function words, may or may not be processed by stemming (removing ending of words to turn them into their stems, for example, changing “removing” to “remov-”) As to the audio features, MFCC and spectral features were extracted using Marsyas [81, 80]. These extracted features were incorporated in support vector machines with a linear kernel. The average prediction accuracy was around 60%. For some categories, the use of both text and audio improved prediction accuracy, but there were categories in which the combined features performed worse than one of the text-only based predictor and the audio-only-based predictor.

5.4.4.1 Further Exploratory Investigations with More Complex Feature Sets

Hu and Downie [26, 27] look deeper into the issue of extracting features from lyrics and combining them with audio features to improve the 18-category binary classification problem. They considered the following five types:

1. They considered bigrams and trigrams of each of the four types of infor-mation mentioned earlier, i.e., Part-of-Speech, Function Words, Content Words without Stemming, and Content Words with Stemming, in addi-tion to their unigrams, which had been used earlier.

2. They considered an extensive set of stylistic features of lyrics, such as the number of lines, the number of words per minutes, and the interjection words. There were 19 such features. Some of these features were shown to be effective in artist style classification when used with Bag-of-Words and audio features [48].

3. They used a set of 7,756 affect words constructed from the aforemen-tioned ANEW [8] by adding synonyms using WordNet [16, 54] and by adding words from WordNet-Affect [74]. These additional words were identified with their original words in the ANEW [8] by way of synonym relations. ANEW provides, in addition to the nine-level scores for Va-lence, Arousal, and Dominance, the standard deviation of their scores.

So there are six values assigned to each word in ANEW and its syn-onyms. By calculating the average and standard deviation of these six values, Hu and Downie generated a 12-feature vector for each lyric.

4. They considered various Bag-of-Words models of the aforementioned General Inquirer. Again, each word appearing in General Inquirer is represented as an 183-dimensional binary vector. The vectors associ-ated with all the words of a lyric that appear in General Inquirer can be summarized in four different ways: frequency, TF-IDF, normalized frequency, and boolean value.

5. For ANEW as well as for General Inquirer, each lyric can be viewed as a collection of words appearing in the vocabulary. Then for each word in the vocabulary, one can compute frequency, TF-IDF normalized frequency, and boolean value.

All these possibilities were individually studied to find their best representa-tion. The best individual performance was achieved by Content Words with-out Stemming when unigrams, bigrams, and trigrams are combined using their occurrences (accuracy: 0.617) and the second best was achieved by Content Words with Stemming when unigrams, bigrams, and trigrams are combined using TF-IDF (accuracy: 0.613). For Function Words, the best use is to com-bine their unigrams, bigrams, and trigrams based on occurrences (accuracy:

0.594). When the Context Words without Stemming is combined with Func-tion Words and with either Stylistic InformaFunc-tion or the TF-IDF of General Inquirer, the accuracy is further improved to 0.632 or 0.631. Then they mixed these best feature sets with audio features (Timbral and Spectral) to achieve accuracy of more than 0.670.

5.4.5 Exploration of Acoustic Cues Related to Emotions Eerola, Lartillot, and Toiviainen [13] used the two data sets described in Sec-tion 5.3.2for exploring features that are indicative of the emotions using their feature extraction toolbox [40]. This is in some sense the acoustic-feature-version of the music structure studies of emotion representation in music mentioned inSection 5.2.6.

Eerola, Lartillot, and Toiviainen extracted a total of 29 features from half-overlapping segments in duration of 0.046 seconds. The features extracted pertained to dynamics, timbre, harmony, register, rhythm, and articulation in part using feature extraction methods [17, 24, 39, 60, 77].

These features were then analyzed for their usefulness in prediction of emotional labels through regression. Three regression models were examined:

Multiple Linear Regression (MLR) analysis, Principal Component Analysis (PCA), and Partial Least Squares (PLS) regression analysis [87], where five components were used for MLR and PCA and two for PLS. Also, the possi-bility of using power transformation (i.e., raising to the power of some λall the components) was considered. The performance of these analysis methods was compared using the R-squared statistics (R²):

1−

P(x−x⁰)²

P(x−x)², (5.5)

where the sums are over all data points,x⁰ is the predicted value forx, andx is the average of all values.

The best results were obtained with PLS with power transform both in the case of three-dimensional emotional space (Valence: 0.72, Activity: 0.85, and Tension: 0.71) and in the case of five emotional categories (Angry: 0.70, Scary: 0.74, Happy: 0.68, Sad: 0.69, and Tender: 0.58).

Multiple linear regression analysis selects components from the original audio feature representation that are the most collectively effective in pre-dicting the values. For “Anger,” the five features selected were: Fluctuation peaks, Key clarity, Roughness, Spectral centroid variance, and Tonal novelty;

for Tenderness, the five features selected were: Root mean square variance, Key clarity, Majorness, Spectral centroid, and Tonal novelty.

5.4.6 Prediction of Emotion Model Parameters

Targeting the prediction of affect-arousal coordinates from audio, Yang et al. introduced the use of regression for mapping high-dimensional acoustic

features to the Barrett-Russell two-dimensional space [91]. Support vector regression (SVR) and a variety of ensemble boosting algorithms, including AdaBoost.RT [70], were applied to the regression problem, using one ground-truth coordinate label for each of 195 music clips. Features were extracted using publicly available extraction tools such as PsySound [9] and Marsyas [81], totaling 114 feature dimensions. To reduce the data to a tractable number of dimensions, PCA was applied prior to regression. This system achieves an R²(coefficient of determination) score of 0.58 for arousal and 0.28 for valence.

Schmidt et al. and Han et al. each began their investigation with a quan-tized representation of the Barrett-Russell affect-arousal space and employed SVMs for classification [69, 22]. Citing unsatisfactory results (with Schmidt obtaining 50.2% on a four-way classification of affect-arousal quadrants, based upon data collected by theMoodSwings game, and Han obtaining 33% accu-racy in an 11-class problem), both research teams moved to regression-based approaches. Han reformulated the problem using regression, mapping the pro-jected results into the original mood categories, employing SVR and GMM regression methods. Using 11 quantized categories with GMM regression they obtain a peak performance of 95% correct classification.

Schmidt et al. also approached the problem using both SVR and MLR.

Their highest performing system obtained 13.7% average error distance in a unit-normalized Barrett-Russell space [69]. Schmidt et al. also introduced the idea of modeling collected human responses in the affect-arousal space as a parameterized stochastic distribution, noting that for most popular music segments the collected points are well-represented by a single 2-D Gaussian.

They first perform parameter estimation in order to determine the ground-truth parameters,N(µ,Σ) and then employ MLR, PLS, and SVR to develop parameter prediction models. They have also performed experiments to

Dans le document EDITED BY (Page 176-195)