Proposed Solutions - Instrument Recognition

Instrument Recognition

4.4 Proposed Solutions

This section presents a brief description of several instrument recognition al-gorithms proposed in the literature. This survey was designed to be as com-prehensive as possible, but in an area with such a large number of publica-tions, inevitably some of them will not be included. There are some theses and dissertations dedicated to this subject [19, 42, 61, 69], however, they were not included here—the articles that summarize them are presented instead.

In the description of the algorithms presented in the following, the accura-cies achieved by them are not presented because, due to the reasons stated in Section 4.3.4, a direct comparison among the methods is very difficult.

Furthermore, the accuracies only make sense together with a complete un-derstanding of the characteristics of each method, which cannot be achieved without reading the original works.

Instrument recognition methods can be grouped according to different cri-teria. Here, the only criterion used is if the method deals only with monophonic signals, or if it is able to deal with polyphonic signals. Other characteristics of the algorithms are presented in two tables presented later in this section.

4.4.1 Monophonic Case

This subsection provides a brief description of algorithms proposed to deal with monophonic signals, which include both isolated tones and solo phrases.

Although the monophonic case tends to be simpler than the polyphonic one, the problem is still far from being solved, and many articles on this subject are still being published, as will be seen in the following. The algorithms presented next are ordered chronologically.

The early work by Kaminskyj and Materka [37] tacked the problem of identifying isolated tones of four instruments from very different families. The features consist of 80 short-term root mean square energy. The number of features is then reduced to three by means of Principal Component Analysis (PCA). The authors tested two classification schemes, an Artificial Neural Network (ANN) and a Nearest Neighbor Classifier (k-NN withk= 1). They remark that the results were surprisingly good as they only used temporal features.

Kaminskyj and Voumard [38] proposed a so-called multistage intelligent hybrid classification system. The algorithm extracts seven temporal and spec-tral features and use an ANN and ak-NN withk= 1 as classifiers. The article only proposes the system, thus no results are presented.

Martin and Kim [62] proposed one of the first algorithms to use a hierar-chical structure with multiple decision nodes. The authors justified this choice by arguing that human listeners recognize objects and stimuli taxonomically.

In total, 31 features describing temporal and spectral characteristics of the

signals were extracted, and then the Fisher multiple discriminant analysis was used at each decision node to reduce the number of features, keeping only the most relevant ones. The authors tested both a Maximum a Posteriori (MAP) classifier and ak-NN classifier. The experiments were performed using isolated notes sampled from 14 instruments.

The early work proposed by Brown [8] focused on the discrimination of two very similar instruments (oboe and saxophone). The feature set is composed by 18 cepstral coefficients. It employs a probabilistic classification scheme based onk-means clustering and on Gaussian Mixture Models (GMM) used to calculate the probability density functions that describe the data. An inter-esting aspect of this work is that, instead of using samples from standardized databases, the authors used instrument excerpts from real recordings.

Kashino and Murase [40] proposed an algorithm that does not rely on the extraction of features to perform the instrument recognition. Instead, it uses an adaptive method for template matching that can cope with variability in musical sounds. The algorithm also includes a musical context integration step, which improves the accuracy in more than 20%. The tests were per-formed using real musical signals containing three instruments (violin, flute, and piano), but each part of the signal contains only one instrument, thus the algorithm works in a monophonic context.

Eggink and Klapuri [17] proposed a method based on the extraction of several temporal and cepstral features. The algorithm employs a hierarchical classification approach and, for each node, it uses either a Gaussian or a k-NN classifier depending on the characteristics of the decision to be made. The algorithm was tested using isolated tones from 30 instruments. The authors obtained better results using a flat classification approach, but they remark that the hierarchical approach can be advantageous in the classification of larger data sets with more instruments.

The work by Brown et al. [9] is very similar to that published by Brown [8].

Here, four very similar wind instruments are considered, and the feature set is composed by cepstral coefficients, bin-to-bin differences of the constant-Q coefficients, autocorrelation coefficients, and moments of time wave. The classification scheme is similar to that used by Brown [8] and briefly described above. A very thorough study about the feature dependence of the results was carried out. This work also used instrument excerpts from real recordings.

The main motivation of the algorithm proposed by Eronen [20] was the assessment of the effectiveness of several features in the task of instrument recognition. The classifier used in the tests is ak-NN, and a total of 29 instru-ments extracted from several databases were taken into account. The author concluded that features based on warped linear prediction (WLP) and on cepstral coefficients are effective, and that the best results were achieved by augmenting the cepstral coefficients with features describing additional char-acteristics of the tones.

The method proposed by Kostek and Czyzewski [51] extracts 37 features—

14 based on Fourier analysis and 23 based on wavelet analysis—and feeds

them to an artificial neural network, which performs the classification. Tests were performed using a database specially built for this method. Although 21 instruments were considered in total, the tests were performed separately considering only groups of four instruments.

Agostini et al. [2] used a total of 27 instruments to test the discrimi-nation capabilities of a number of spectral features found in the literature, and to test the effectiveness of four classifiers—Support Vector Machines (SVM), quadratic discriminant analysis (QDA), canonical discriminant anal-ysis (CDA), and k-NN. They concluded that the most informative features are the mean of the inharmonicity, the mean and standard deviation of the spectral centroid, and the mean of the energy contained in the first partial.

They also concluded that SVM and QDA are the best classifiers, but they remark that the closeness of performances among all classifiers indicate that a properly feature selection is more critical than the choice of a classification system.

In the work of Costantini et al. [12], a number of features are extracted from preprocessed versions of the signals. Preprocessing strategies based on fast Fourier transform (FFT), Constant-Q Frequency Transform, and cepstral coefficients are tested. The method uses Min-Max Neuro-Fuzzy Networks as classification model, which is synthesized using adaptive resolution training techniques. The algorithm was tested against samples from six different in-struments.

Eronen [21] uses Mel-frequency cepstral coefficients and their derivatives as features. Those features are transformed to a base with maximal statistical independence using independent component analysis (ICA). Continuous Den-sity Hidden Markov Models (HMMs) discriminatively trained were used as classification system. The algorithm was tested using two groups of data, one containing isolated tones from 27 harmonic instruments, and one containing samples from five percussive instruments.

The algorithm proposed by Piccoli et al. [71] uses the first 18 MFCC as features. Two different artificial neural networks, a MLP and a Time-Delay Neural Network (TDNN), were tested, with a slight advantage for the second one. The experiments were performed using isolated tones from nine instru-ments.

Essid et al. [23] proposed an algorithm focused on the instrument recogni-tion on real solo phrases. The authors chose to use only features known to be robust, resulting in a feature set containing only Mel-frequency cepstral coef-ficients (MFCCs), their derivatives, and some audio spectrum flatness (ASF) features. Different features were chosen for each possible pair of instruments.

The algorithm uses a SVM as classifier, for which different kernels were tested.

The method was tested with solo phrases of 10 instruments, all extracted from real recordings.

A study presented by the same authors [22] focuses on the use of sim-ple features to perform the instrument recognition. The proposed algorithm extracts 47 features, and SVM is used for classification. Experiments were

carried out using solo phrases performed by amateur musicians and sound samples from 10 instruments, all extracted from commercial recordings. The authors conclude that the combination of cepstral coefficients with features describing the audio signal spectral shape is very effective in the recognition of instruments belonging to different classes.

The main objective of Kitahara et al. [46] was to develop a method capable of identifying the category (family) of an instrument that was not present in the training data (unregistered). First, the method tries to determine if a given instrument is registered; if so, the name of the instrument is identified, if not the category of the instrument is estimated. The method uses 18 features selected from a larger set of 129 elements, and uses a musical instrument hierarchy (MIH) for the category-level identification.

This proposal by Krishna and Sreenivas [53] aims to identify instruments in isolated notes and solo phrases. The features the method uses are linear predictive coefficients called line spectral frequencies (LSF) [11], which can be seen as characteristic short-term spectral envelopes, but MFCC and linear prediction cepstral coefficients (LPCCs) are also used for comparison. This choice of LSF as features was motivated by one of the major objectives of the authors, which is keeping the scalability of their method. The performances of GMM andk-NN classifiers were tested, with a slight advantage for the GMM.

The experiments used isolated tones from 14 instruments, and also some short segments of solo phrases.

Livshin and Rodet [58] proposed a method to identify seven instruments in solo recordings. They initially extracted 62 features, and then applied the Gradual Descriptor Elimination (GDE) algorithm to reduce such a set to 20.

Using the reduced feature set resulted in an accuracy 3% worse than using the whole set, but the authors argue that such a reduction made it possible for the algorithm to be used in real-time applications. The classification scheme con-sists in a combination of LDA and ak-NN classifier. All tests were performed using excerpts extracted from real recordings, and they also perform some tests with duets to show that their method can be useful in the polyphonic case.

An article by Tindale et al. [78] presents one of the few methods that deal specifically with the recognition of drum sounds. A number of temporal features and the energies of four subbands feed an artificial neural network responsible for the final classification. Several experiments were performed using drum samples generated by the authors.

Kaminskyj and Czaszejko [39] proposed an algorithm that uses 710 features selected from a set of 2,804 elements by means of PCA. They tested three types of classification architectures (hierarchical, hybrid, and flat), with k-NN classifiers being used at the decision nodes. The tests were performed using isolated tones from 19 instruments. It was concluded that, although the hierarchical and hybrid structures perform better than the flat one, such a gain is too small in comparison with the added computational effort to justify their use.

This algorithm proposed by Kitahara et al. [47] takes into consideration the pitch dependency of timbre of musical instruments. The method extracts 129 features from an instrument sound and then reduces the dimensionality of the feature space into 18 dimensions. After that, an F0-dependent mean function and an F0-normalized covariance are calculated. The key idea underlying those two parameters is to represent the features as a function of the fundamental frequency of the instruments. The final classification is given by the Bayes decision rule (BDR). The algorithm was tested using isolated tones of 19 musical instruments.

An article by Pruysers et al. [72] uses Morlet Wavelet Analysis and Wavelet Packet Analysis to generate features that, combined with six other features from a previous work, are submitted to a single stage classifier. In this clas-sifier, each feature has its ownk-NN classifier, and the individual results are combined by a so-calledk-NN result combiner. The experiments used samples from 19 instruments. The authors came to the conclusion that both the pro-posed wavelet-based features are useful and, additionally, they complement each other in an effective way.

Benetos et al. [7] used a branch-and-bound search to select a subset of relevant features from a full set, which is composed by 41 features. They also present four classifiers based on the non-negative matrix factorization (NMF), the best of which achieving a performance only slightly worse than that achieved using GMM and HMM classifiers. The authors remark that their experiments employed unsupervised classification, in contrast to the su-pervised GMM and HMM classifiers. The experiments were carried out using samples from six instruments.

Chetry and Sandler [11] proposed an instrument recognition method whose features consist of line spectral frequencies. Two classification procedures, k-Means and SVM, were investigated using solo phrases of six instruments extracted from commercial recordings. The authors conclude that the SVM performs slightly better, and that better results are achieved if the models are trained using solo phrases that have been recorded in various acoustic conditions.

As in earlier works by the same authors, the recognition in this algorithm proposed by Essid et al. [26] is performed over solo phrases from real record-ings. The algorithm employs two feature selection techniques, the inertia ratio maximization with feature space projection and the genetic algorithms. A se-lection of the most relevant features is performed separately for each possible pair of instruments. Hence, the algorithm uses a one versus one classification strategy: a winner instrument is taken for each possible pair by means of ei-ther a SVM or a GMM, and the final classification is determined according to a majority vote rule. The authors performed a thorough study about several aspects of the algorithm using solo phrases extracted from real recordings.

In another algorithm proposed by Essid et al. [24], a large set of 540 fea-tures was considered, and an automatic feature selection was used to fetch the most useful ones. Since this algorithm uses a hierarchical approach, recognition

decisions are performed throughout a number of nodes. The authors tested two different hierarchical structures, one following standard instrument families, and other generated automatically by means of an agglomerative hierarchical clustering. They concluded that spreading related instruments over distant nodes may actually improve the recognition accuracy.

Fragoulis et al. [27] tackle the very specific problem of discriminating be-tween piano and guitar notes. Although very dissimilar in terms of construc-tion and way of playing, the timbre generated by those instruments are actu-ally quite similar, making this pair one of the most difficult to be discerned by an instrument recognition algorithm. The authors created three discriminative features and inferred three different empirical classification criteria to perform the classification. They remark that a successful discrimination between piano and guitar is strictly related to the non-tonal spectral content of each note.

Mazarakis et al. [64] proposed an algorithm that uses a Time-Encoded Signal Processing (TESP) method to produce simple matrices from complex sound waveforms. Those matrices are submitted to a so-called Fast Artificial Neural Network (FANN), which performs the instrument recognition. The experiments were carried out using signals generated by five different syn-thesizers (19 instruments), and also extracted from the Iowa database (20 instruments).

As in other studies, Simmermacher et al. [74] first extract a large set of features, which is reduced using feature selection techniques. Three classifi-cation schemes were tested: k-NN, Multilayer Perceptron (MLP) ANN, and SVM. The best results were achieved using the MLP. The experiments in-cluded tests using isolated tones from 19 instruments, and tests with solo phrases representing four instruments.

Tan and Sen [77] present a study on the use of the attack transient envelope in the recognition of musical instruments. The classification is based on a pattern matching algorithm called Dynamic Time Warping (DTW). Several experiments were performed with samples from two instruments (cello and violin) and, according to the authors, the results indicate that the attack transient can indeed be useful in instrument recognition.

This method proposed by Ihara et al. [34] is based on the extraction of a large number of features (1,102), on the reduction of that number by applying two-dimension reduction techniques (PCA and LDA), and on SVM to perform the classification. The most important claim made by the authors is that the log-power spectrum suffices to represent characteristics that are essential in instrument recognition. The method was tested using samples from eight instruments collected from commercial recordings.

Deng et al. [13] focus on the feature selection instead of presenting a new complete classifier. The authors use machine learning techniques to select and evaluate features extracted using a number of different schemes. The tests used individual note samples extracted from 20 instruments, and also solo phrases of four instruments. The authors found out that the best features are the log attack time (LAT), the harmonic deviation, and the standard deviation of the

flux. They remarked that there is significant redundancy between and within the features extraction schemes commonly used, and further studies will be necessary in order to improve this crucial stage in instrument recognition.

The strategy proposed by Loughran et al. [60] extracts features based on temporal and spectral envelopes, and also on the evolution of the centroid.

A MLP artificial neural network is then applied. Only isolated tones of three instruments were considered in the experiments.

The purpose of an article by Joder et al. [36] was to show that midterm temporal properties of the signal, which are usually ignored, can actually carry relevant information that may be useful in several tasks of music information retrieval and data mining. The proposed algorithm has the following steps:

a preprocessing stage; a feature extraction stage in which 30 features chosen from the original 162-element set are calculated; an early temporal integra-tion stage in which the informaintegra-tion carried by the features are summed up according to a higher time scale; a sonic unit segmentation aiming to obtain semantically meaningful segments, which are used as time frame for the early temporal integration; a normalization step; and a classification/late temporal integration stage, in which the decisions made by the classifier are combined (integrated) in some effective way. The article presents extensive tests using solo phrases from eight instruments. The authors conclude that the best re-sults are obtained combining early and late integration over sonic units, and using a SVM with dynamic alignment kernels as classifier.

An article by Kramer and Hein [52] is one of the few studies that deal exclusively with the identification of percussive instruments. The algorithm extracts 100 conventional features, and an evolutionary model is applied in order to derive optimal subsets of different sizes. The final instrument identi-fication is performed by a SVM. The experiments were performed using real percussion excerpts contaminated with noise.

Table 4.1 summarizes all methods described in this subsection. The first column contains the first author and the year of the publication; the second

Dans le document EDITED BY (Page 137-151)