• Aucun résultat trouvé

ECE 301 Projects Fall 2003

N/A
N/A
Protected

Academic year: 2022

Partager "ECE 301 Projects Fall 2003"

Copied!
245
0
0

Texte intégral

(1)

Collection Editors:

Mark Husband Adan Galvan Charlet Reedstrom

Richard Baraniuk

(2)
(3)

Collection Editors:

Mark Husband Adan Galvan Charlet Reedstrom

Richard Baraniuk Authors:

Amit Aggarwal Mitali Banerjee Richard Baraniuk

Jason Buck

Venkat Chandrasekaran Pranav Chitkara

Melodie Chu Ian Clark Kyle Clarkson Krzysztof Cyran

Chris Forbis Cosme Garza Elizabeth Gregory

Erlend Hansen Christopher Hunter

Don Johnson Ajay Kalia Chris Martinez Gareth Middleton

Tom Mowad Sivakiran Nagisetty

Arnab Nandi Genaro Picazo

Jason Sedano Chris Sramek

Mark Yeh

Online:

< http://cnx.org/content/col10223/1.5/ >

(4)

This selection and arrangement of content as a collection is copyrighted by Richard Baraniuk. It is licensed under the Creative Commons Attribution 1.0 license (http://creativecommons.org/licenses/by/1.0).

Collection structure revised: January 22, 2004 PDF generated: October 25, 2012

For copyright and attribution information for the modules contained in this collection, see p. 225.

(5)

1 Content-Based Image Querying with Complex Wavelets

1.1 Content-Based Image Querying with Complex Wavelets . . . 1

1.2 Content-Based Image Querying with Complex Wavelets: Discrete Wavelet Trans- form . . . 3

1.3 Old-School Image Querying . . . .. . . 4

1.4 Content-Based Image Querying with Complex Wavelets: The Complex Discrete Wavelet Transform . . . 6

1.5 Image Querying with Complex Wavelets: The 2D Discrete Fourier Transform . . . 7

1.6 The Complex Wavelet Approach . . . .. . . 8

1.7 Image Querying with Complex Wavelets: Results of Experiments . . . 9

1.8 Image Querying with Complex Wavelets: Summary and Potential Future Work . . . 26

2 Formant Analysis and Vowel Detection 2.1 Formant Analysis and Vowel Detection . . . 27

2.2 Background on Formants . . . 27

2.3 Methods . . . 29

2.4 Vowel Detection Results . . . 32

2.5 Conclusions . . . 35

2.6 The Team . . . 36

3 Music Classication by Genre 3.1 Music Classication by Genre . . . 37

3.2 Project Summary: Music Classication by Genre . . . 38

3.3 Music Classication by Genre: System Diagram . . . 39

3.4 Introduction to Digital Signal Processing . . . 40

3.5 Music Classication by Genre: Bandwidth . . . 40

3.6 Music Classication by Genre: Frequency Cuto . . . .. . . 44

3.7 Music Classication by Genre: Frequency Smoothness . . . 45

3.8 Music Classication by Genre: Beat Detection . . . 48

3.9 Ideal Filters . . . 50

3.10 Music Classication by Genre: High Pass Filter . . . 51

3.11 Music Classication by Genre: Power Spectral Density . . . 53

3.12 Music Classication by Genre: Total Power . . . 55

3.13 Neural Networks . . . 58

3.14 Music Classication by Genre: System Performance . . . 61

3.15 Back propagation mathematics . . . 69

3.16 Chris Hunter . . . .. . . 70

3.17 Melodie Chu . . . 75

3.18 Mitali Banerjee . . . 76

3.19 Jordan Mayo . . . .. . . 77

4 Guitar Distortion: Rocking the Digital World 4.1 Guitar Distortion Distortion: Rocking in the Digital World . . . 79

4.2 Guitar Distortion: Basic Concepts . . . 79

4.3 The Problems with Distortion . . . 80

4.4 Approaching Good Distortion . . . 80

(6)

iv

5.1 Time Domain Pitch Correction . . . 85

5.2 Results . . . 91

5.3 Proj Intro . . . 92

5.4 Pitch Detection Algorithms . . . .. . . 94

5.5 Frequency Domain Pitch Correction . . . 99

5.6 Examples and Code . . . 102

6 RADAR Simulation in MATLAB 6.1 RADAR: Introduction and Problem . . . 105

6.2 Background . . . 105

6.3 Approach for Range . . . 110

6.4 Range Results . . . .. . . 115

6.5 RADAR: Velocity Analysis . . . 123

6.6 Results for Velocity . . . .. . . 127

6.7 RADAR Conclusion . . . 127

6.8 Our Group . . . 128

7 A Fishy Solution to the Worst Job in Science 7.1 A Fishy Solution To The Worst Job In Science - Introduction . . . .. . . 131

7.2 Intensity Test for Fish Classication . . . 133

7.3 Length/Width Ratio Test for Fish Classication . . . 135

7.4 Fin Detection Test for Fish Classication . . . 137

7.5 Feature Detection Test for Fish Classication . . . .. . . 139

7.6 Miscellaneous Code . . . .. . . 146

7.7 Results of the Fish Classication Project . . . 152

7.8 Conclusion and Future Improvement of the Fish Classication Project . . . 157

8 Emotion Detection: How Are You Feeling Today? 8.1 Introduction . . . .. . . 159

8.2 Background . . . 159

8.3 Approach . . . 160

8.4 Results . . . .. . . 165

8.5 Problem . . . .. . . 167

8.6 Conclusion . . . 167

9 Image Compression through Sparse Approximation 9.1 Image Compression through Sparse Approximation . . . 171

9.2 Background . . . 172

9.3 Procedure . . . 173

9.4 Results . . . .. . . 173

9.5 Conclusions . . . 178

9.6 Code . . . 179

9.7 Team . . . .. . . 185

9.8 References and Thanks . . . 187

10 Discrete Multitone (DMT) 10.1 DMT: Introduction . . . 189

10.2 DMT: Implementation . . . 191

10.3 DMT: A/D and D/A Conversion . . . .. . . 193

10.4 DMT: Serial/Parallel, Parallel/Serial . . . 196

10.5 DMT: Constellation Mapping . . . 196

10.6 DMT: Mirror/IFFT, De-Mirror/FFT . . . .. . . 200

10.7 DMT: Cyclic Prex . . . 203

10.8 DMT: The Channel . . . 206

(7)

10.9 DMT: Equalization and Approximation . . . 209

10.10 DMT: Results and Conclusions . . . 212

10.11 DMT: Group Members . . . 217

Bibliography . . . 221

Index . . . 223

Attributions . . . .225

(8)

vi

(9)

Chapter 1

Content-Based Image Querying with Complex Wavelets

1.1 Content-Based Image Querying with Complex Wavelets

1

1.1.1 Introduction

Thanks to the growth of the World Wide Web over the past decade or so, vast amounts of information are available to anyone in possession of a personal computer with a modem and an Internet connection. Tasks such as nding a favorite poem have been made easy by search engines like Google. One can simply type in a few lines from the poem, and then it's just a matter of sorting through a few top matches before one has the entire poem on the screen.

While searching textual media is fairly trivial, looking for an image that you have seen before can be a huge problem. If you remember seeing an interesting painting, say Leonardo da Vinci's Mona Lisa, after walking through a museum, and you'd like to nd information on it online, unless you have a word or phrase associated with the painting, such as da Vinci or Mona Lisa, it would be dicult to nd any information about the particular work of art. You might be able to nd the painting online in some subject-specic database such as an online art gallery; however, such databases for most subjects are fairly uncommon.

Example 1.1

1This content is available online at <http://cnx.org/content/m11694/1.7/>.

(10)

2 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS Mona Lisa

(11)

When in search of this work of art, while one may not have textual information related to the painting, one usually does have some information about the image in question; that is, the person has a coarse-scale idea of what the Mona Lisa, for instance, looks like. This information should be fairly useful for nding an actual image of the Mona Lisa, but given current techniques, searches for visual data break down as eective strategies when the database size increases to even a small fraction of the number of images on the World Wide Web.

1.1.2 Our Goal

We would like to come up with some sort of a scheme that allows a user to search through a large database of images. The system would likely work by having the user enter a query image, a low-detail, coarse-scale version of the image he or she would like to nd, and then returning small thumbnails of several matching images for the user to skim over. Ideally, we would like such a system to satisfy several properties.

Firstly, our algorithm should be reasonably fast and ecient. It's fairly obvious that this property is desirable for any algorithm, but would be especially so in our case, where it is likely that such a system is used on a search engine such as Google where there would potentially be thousands, if not millions, of query images entered every minute.

Our algorithm should also be well suited to matching coarse-scale versions of images to high detail versions of the same image. Users should be able to sketch an image in a simple drawing application where a lot of detail is not easy to add to the query image. They should also be able to enter images that have been digitized by the use of a scanner, which we assume introduces blurriness and additional noise such as scratches, dust, etc, to the extent that they would nd it highly useful to search for a higher-resolution version of the image online.

Ideally, we would also like our algorithm to be able to handle ane transformations, such as translation, rotation, and scaling. It is unreasonable to expect a user to be able to draw parts of an image in exactly the same region that they appear in the original image. While these three transformations are all important components of an image querying system, we made the decision to focus on translation because it seems like the most likely type of error that a user would make.

1.1.3 Past Work

We structure our approach after that of Jacobs, Finkelstein, and Salesin, who, while at the University of Washington, published a paper on Fast Multiresolution Image Querying, which used the wavelet basis to decompose images to provide a low-resolution version of an image which is highly eective for image matching.

The primary drawback is that the approach is ineective for detecting shifts of an image since the separable discrete wavelet basis is not shift-invariant. Therefore, we propose the use of the complex discrete wavelet basis which possesses a high degree of shift-invariance in its magnitude. When coupled appropriately with the two-dimensional Discrete Fourier Transform, the two-dimensional Complex Discrete Wavelet Transform allows us to match shifted versions of an image with a signicantly higher degree of certainty than does the approach of Jacobs, et al.

1.2 Content-Based Image Querying with Complex Wavelets: Dis- crete Wavelet Transform

2

Over the last couple of decades, wavelets have provided a novel method for analyzing mathematical functions.

They have been useful in both pure and applied mathematics (as in harmonic analysis), as well as in electrical engineering. They have turned out to be powerful for proving theorems, and have many interesting properties.

(12)

4 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS Ideas such as multiresolution versions of an image at various resolutions blend naturally with our intuition about the resolution of an image. Due to this property of multiresolution, wavelets are also useful in characterizing the structure of an image. The coarse-scale wavelet coecients contain a lot of information about the image structure. In addition, wavelets provide a sparse basis for most natural images, and hence are useful in image compression.

DWT of Lena

Figure 1.2: This gure illustrates what we mean by multiresolution.

One can dene a separable two-dimensional wavelet basis as a series of one-dimensional wavelet trans- formations along the rows, and then along the columns. Such a basis provides all the advantages of one- dimensional wavelets, but have the same disadvantage in that they do not oer shift-invariance. That is, an image and its shifted version would not have any noticeable correlation in the wavelet domain.

1.3 Old-School Image Querying

3

Jacobs et al. propose an algorithm where sparse signatures for images in the database are rst created.

When a user inputs a query image, the signature of the query is computed and compared to signatures in

3This content is available online at <http://cnx.org/content/m11695/1.2/>.

(13)

the database.

The signatures are computed as follows:

1. Compute the discrete wavelet transform of the image.

2. Set all but the highest magnitude wavelet coecients to 0.

3. Of the remaining coecients, quantize the positive coecients to +1 and the negative ones to 1.

A Basic Sketch of the Algorithm

Figure 1.3

(14)

6 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS coecients worked well for scanned images, while the top 40 coecients gave best results for hand-drawn images.

The signatures in our implementation were compared using the generic L1 norm of the dierence between signature matrices. Jacobs et al. use the non-intuitive Lq norm, which somehow weights the coecients corresponding to dierent scales dierently. This idea denitely carries some merit, but Jacobs et al. do not provide a very good explanation of this scheme, and we don't believe that it will improve the performance of their querying algorithm signicantly.

Our implementation of their algorithm is available on Owlnet at

∼venkatc/elec301/tmproject/code/dwt The m-le for generating signatures is sig_gen.m, and the metric function for comparing signatures is metric.m.

1.4 Content-Based Image Querying with Complex Wavelets: The Complex Discrete Wavelet Transform

4

Because of its desirable multiresolution properties, the two-dimensional wavelet transform happens to be highly applicable to many areas, very notably to the eld of image processing. However, its lack of shift- invariance tends to be a major inconvenience, and a transform that provides multiresolution as well as shift-invariance would be highly useful almost everywhere wavelets are used. Complex wavelets are an answer to this problem, and a solid mathematical foundation that allowed practical use of complex wavelets in image processing was originally set up in 1997 by Nick Kingsbury of Cambridge University.

The complex two-dimensional wavelet transform provides all of the advantages that the separable dis- crete wavelet transform provides multiresolution, sparse representation, and useful characterization of the structure of an image. What makes the complex wavelet basis exceptionally useful for our purposes is that it provides a high degree of shift-invariance in its magnitude. A drawback to this transform is that it is four-times redundant. That is, if you have an original N x N image, and take the DWT, you get back N x N numbers, whereas using the CDWT, you get back 4 N x N numbers. So, for the price of four-times redundancy, you get a high degree of shift-invariance in magnitude which seems like a reasonable tradeo for applications that need a shift-invariant, multiresolution transform.

4This content is available online at <http://cnx.org/content/m11696/1.3/>.

(15)

A 1-dimensional complex wavelet

Figure 1.4: Figure generated using Ivan Selesnick's code.

1.5 Image Querying with Complex Wavelets: The 2D Discrete Fourier Transform

5

The two-dimensional Discrete Fourier Transform is another important transform in image processing. It is taken by applying the one-dimensional transform to each row, and then to each column, as seems to be the common practice for increasing the dimension of transforms in signal and image processing.

The 2D DFT has many properties that are useful in image processing; however, most useful is its shift invariance. The DFT of an image and its shifted version dier only by the multiplication with a complex

(16)

8 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS High frequency basis function.

Figure 1.5: Real part of a 2D DFT basis function

1.6 The Complex Wavelet Approach

6

Our basic approach remains the same as that proposed by Jacobs et al. We compute the signatures of images in the database and compare them to the signature of the query. However, we propose a novel method to compute the signatures:

1. Compute the CDWT of the image and nd the magnitudes of the coecients.

2. Set all but the highest magnitude coecients to 0.

3. Set the remaining coecients to +1. (No way to use 1 since magnitudes of complex numbers are always positive).

4. Compute the two-dimensional DFT of each subband.

6This content is available online at <http://cnx.org/content/m11697/1.2/>.

(17)

Figure 1.6

After step 3, the signature matrices correspond to the major feature points in an image. The +1's characterize the image structure. Note that due to the high degree shift-invariance oered by the CDWT, the signature of a shifted image after step 3 will just be a shifted version of the signature of the original image (after step 3). Now, computing the DFT in each subband gets rid of these shift eects, since the magnitude of the DFT of both the signatures (after step 3) will be the same. In this manner, our proposed algorithm incorporates the multiresolution characteristics of the CDWT in addition to accounting for translations in the query image. We compare signatures by computing the L1 norm of the dierence between the signature of an image in the database and that of its query.

An implementation of our algorithm is available on Owlnet at

∼venkatc/elec301/tmproject/code/cdwt The m-le for generating signatures is sig_gen.m, and the metric function for comparing signatures is metric.m.

1.7 Image Querying with Complex Wavelets: Results of Experiments

7

We decided that for testing purposes we would test images from three categories:

(18)

10 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS 3. shifted images

From each category, we performed ve tests using query images based on images from our database. For every test image we used, except for hand drawn image 5 (shown below), the expected image was returned in the top three percent of matched images when using the complex wavelet algorithm. Results using the real wavelet algorithm were similar, except for on shifted images, where the query images had signicantly lower rankings (on the range of top 15%) than from our algorithm.

hand drawn 1

Figure 1.7

actual image

Figure 1.8

(19)

hand drawn 2

Figure 1.9

actual image

Figure 1.10

(20)

12 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS hand drawn 3

Figure 1.11

actual image

Figure 1.12

(21)

hand drawn 4

Figure 1.13

actual image

Figure 1.14

(22)

14 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS hand drawn 5

Figure 1.15

(23)

actual image

Figure 1.16

(24)

16 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS blurred 1

Figure 1.17

(25)

actual image

Figure 1.18

blurred 2

Figure 1.19

(26)

18 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS actual image

Figure 1.20

(27)

blurred 3

Figure 1.21

(28)

20 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS actual image

Figure 1.22

(29)

blurred 4

Figure 1.23

actual image

Figure 1.24

blurred 5

Figure 1.25

(30)

22 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS actual image

Figure 1.26

shifted 1

Figure 1.27

actual image

Figure 1.28

(31)

shifted 2

Figure 1.29

actual image

Figure 1.30

shifted 3

Figure 1.31

(32)

24 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS actual image

Figure 1.32

shifted 4

Figure 1.33

actual image

Figure 1.34

(33)

shifted 5

Figure 1.35

actual image

Figure 1.36

(34)

26 CHAPTER 1. CONTENT-BASED IMAGE QUERYING WITH COMPLEX WAVELETS

1.8 Image Querying with Complex Wavelets: Summary and Poten- tial Future Work

8

1.8.1 Conclusions

Our work shows promising results in nding shifted versions of an image in a reasonably sizeable database, in addition to nding blurred and hand-drawn images. The approach of Jacobs, et al. has a signicantly lower hit rate for shifted images.

1.8.2 Possible Future Work

While our approach is more accurate for shifted images, it is somewhat less space ecient and takes somewhat longer to compute signatures using our scheme. Therefore, something we would like to see is a transform that does the work of the 2D CDWT and the DFT together. While this doesn't greatly eect search time, this would speed up preprocessing time by requiring only a single computation per image, and webcrawlers could be twice as eective at nding images and adding them to the database of image signatures. It could also lead to a more natural way of performing image querying.

We would also like to see how well our algorithm could be optimized, though this was not a major goal of our project. For example, what kind of space versus time versus accuracy tradeo would be made by taking the hundred highest magnitude coecients versus the top sixty?

A nal aim of potential future research could be to apply this sort of querying scheme with low resolution queries to databases of other kinds of data. Video might be an option, though it is unclear as to how the query data would be created. Music might be a more reasonable option, where the user could hum a favorite song which could be matched to a MIDI version of the song. The hum could represent a coarse-scale version of the desired song just as the query image represented a coarse-scale version of the desired image.

8This content is available online at <http://cnx.org/content/m11698/1.3/>.

(35)

Chapter 2

Formant Analysis and Vowel Detection

2.1 Formant Analysis and Vowel Detection

1

2.1.1 Abstract

In the past few years, Voice Recognition software applied to word processing and other computer tasks has received much media hype. One of the methods employed by this software is the use of formants, harmonics found in vowels due to resonance in the vocal tract, that are inherent in speech. To limit the amount of variables, only ve fundamental vowel sounds in a group member's speech were recorded and analyzed to serve as a database for vowel detection. The frequency values of the formants and their magnitudes were logged. Simple words with these fundamental vowel sounds were recorded and analyzed to detect the vowels.

The rst detection method employed was a Cartesian distance method. The distance between an ideal vowel's formants and the possible vowel's formants in the word were calculated. The shortest distance, which was also below a dened threshold, signied the vowel our input contained. When this proved inaccurate with certain input vowels, a second and ultimately more eective method was implemented. This method employed systematically eliminating the vowels that the input vowel could not be. This was done by comparing the rst and second formants with those in the database. With this method, it is possible that an input vowel could be detected as two or more vowels. However, with proper threshold leveling, we found this could be largely circumvented.

2.1.1.1 Motivation

Our motivation to work on this project was to gain insight on the nature of speech recognition systems. We now realize how dicult it is to build such a system reliably.

2.2 Background on Formants

2

Formants are the resonant frequencies of the vocal tract when vowels are pronounced. While vowels are attributed to this periodic resonance, consonants are not periodic. They are produced by restriction of air ow with the mouth, tongue, and jaw.

Linguists classify each type of speech sound (called phenomes) into dierent categories. In order to identify each phenome, it is oftentimes useful to look at its spectrogram or frequency response where one can nd the characteristic formants. Formants can be found where there are large concentrations or peaks of energy in the spectrogram reading of a voiced sample. In other words, a formant is a frequency range in

(36)

28 CHAPTER 2. FORMANT ANALYSIS AND VOWEL DETECTION Although all phenomes have their own formants, vowel sound formants are usually the easiest to identify.

Almost all formants have the trait of waxing and waning in energy in all frequencies, which is caused by the repeated closing and opening of the human vocal tract. On average, this repeated closing and opening occurs at a rate of 125 times per second in an adult male and 250 times per second in an adult female. This rate gives the sensation of pitch (higher frequencies result in higher pitches). Formant values can vary widely from person to person, but the spectrogram reader learns to recognize patterns which are independent of particular frequencies and which identify the various phonemes with a high degree of reliability. For instance, in the vowels, the rst formant (F1) can vary from 300 Hz to 1000 Hz. The lower it is, the closer the tongue is to the roof of the mouth. The vowel /i:/ as in the word 'beet' has one of the lowest F1 values - about 300 Hz; in contrast, the vowel /A/ as in the word 'bought' (or 'Bob' in speakers who distinguish the vowels in the two words) has the highest F1 value - about 950 Hz.

(a) (b)

Figure 2.1: (a) Vowel 'A' Vocal Model (b) Vowel 'E' Vocal Model

(a) (b)

Figure 2.2: (a) Vowel 'I' Vocal Model (b) Vowel 'O' Vocal Model

(37)

(a) (b) Figure 2.3: (a) Vowel 'U' Vocal Model (b) Consonant Vocal Model

2.3 Methods

3

GoalAnalyze an input speech sample and return the vowels that are present.

Approach

Vowels are highly periodic, so they have distinctive Fourier representations. That is, there are large values at a particular frequency, in this case the lower end of the spectrum. By using Fourier analysis on an input signal, we will be able to detect via matched lters the input vowel sound.

Initially, we decided to build a database of the ve fundamental vowel sounds. We used the MATLAB program. A project member recorded a voice sample of each vowel several times, ran the samples through the auto-regressive lter, and then calculated the rst two formant frequencies from the frequency response of the vowel. Each voice sample was recorded at 8 kHz, and 256-sample windows were input into the auto- regressive model. The purpose of the auto-regressive model on each window was to get the transfer function of the vocal tract and output the frequency response of each voice sample. After the database was built, the next step was to record several samples of words or phrases and input them into the lter. To lter out the consonants, our program checked the magnitude values of the frequency response of each window. Normally, consonants will have signicantly lower magnitudes than vowel sounds, and our program utilized a threshold to lter out only consonants. Next, we used a type of match lter to determine which vowel sound the sample corresponded to. We did this by setting up a series of ve ags in our program, one for each vowel.

At rst, when each window came through, all the ags were set to true. The program then began comparing the known formant frequencies of each vowel to the voice sample. If the sample did not pass a threshold of a known vowel formant frequency, then the ag of that vowel was set to false. If there were multiple ags set true when comparing the rst formant frequency of the voice sample, then the program then moved on to compare the second formants. After each 256 window was processed, we used a smoother to eliminate anomalies (due to unclear pronunciation, noise in the sample, etc.) and then output each vowel. Our nal code used to detect vowels.4

(38)

30 CHAPTER 2. FORMANT ANALYSIS AND VOWEL DETECTION

Figure 2.4: Flowchart of Approach

2.3.1 Auto Regressive Model

In our project, the only data for the vocal tract that we have is the windowed sound chunk that was produced at a particular time. Assuming a standard impulse input, the autoregressive model will take this chunk and compute a model for the vocal tract at the particular moment the sound was uttered. The vocal tract can be modeled simply as a series of linked cylindrical tubes, with the formants appearing due to the transition between these dierent tubes. Since the autoregressive model for this model of the vocal tract produces an all-pole transfer function (because we only have the output), ideally we should notice peaks at all of the particular resonant frequencies. These peaks do appear, and they are our formants.

2.3.2 Hamming Window

Our windowing method that we used was a hamming window; you can see a very similar window, the hanning window, in the images below. The hamming window looks roughly like one period of a sine wave, as opposed to a rectangular window. This tapering at the ends is needed because otherwise you get anomalous behavior in the frequency domain. A hamming or hanning window provides a truer representation of the frequency content of the signal.

(39)

Figure 2.5: The top waveform is a segment 1024 samples long taken from the beginning of the "Rice University" phrase. Computing gure 1 involved creating frames, here demarked by the vertical lines, that were 256 samples long and nding the spectrum of each. If a rectangular window is applied (corresponding to extracting a frame from the signal), oscillations appear in the spectrum (middle of bottom row).

Applying a Hanning window gracefully tapers the signal toward frame edges, thereby yielding a more accurate computation of the signal's spectrum at that moment of time. (From Spectrograms5)

Figure 2.6: In comparison with the original speech segment shown in the upper plot, the non-overlapped Hanning windowed version shown below it is very ragged. Clearly, spectral information extracted from the bottom plot could well miss important features present in the original. (From Spectrograms6)

(40)

32 CHAPTER 2. FORMANT ANALYSIS AND VOWEL DETECTION Final code - formants.m7

2.4 Vowel Detection Results

8

Our project produced largely successful results. We achieved awless output for a variety of two syllable words that, as a whole, contained all of our database vowels. We were also successful with some three and four syllable words.

Result Table

Input Output

Biblioteca9 CiCiCoCeCaC

Loteria10 CoCeuCiCaC

Mexico11 CeCiCoC

Santiago12 CaCiCaCoC

Santa Fe13 CaCa CeC

Cabo14 CaCoC

Dime15 CiCeC

Tito16 CiCoC

Papi CaCiC

Arturo17 CaCuCoC

Alejandro CaoCeCaCoC

Dame una camisa18 CaCe CiuCa CaCiCaC Me gusta Rich B19 Ce CuCa CiCiC

Table 2.1

note: C represents a string of 1, 2, or more non-vowels and a,e,i,o,and u are the actual vowels detected. Also, "Me gusta Rich B" had to be parsed together.

• 'Biblioteca' and 'Santiago' demonstrate superuous consonant placement between vowels.

• 'Una' illustrates diculty in vowel detection because the second formant in the vowel sound was not present.

• 'Loteria' and 'Alejandro' demonstrate the errors caused by 'R' and 'L' respectively.

7http://cnx.org/content/m11734/latest/formants.txt

8This content is available online at <http://cnx.org/content/m11751/1.2/>.

9http://cnx.org/content/m11751/latest/biblioteca.wav

10http://cnx.org/content/m11751/latest/loteria.wav

11http://cnx.org/content/m11751/latest/mexico.wav

12http://cnx.org/content/m11751/latest/santiago.wav

13http://cnx.org/content/m11751/latest/santafe.wav

14http://cnx.org/content/m11751/latest/cabo.wav

15http://cnx.org/content/m11751/latest/dime.wav

16http://cnx.org/content/m11751/latest/tito.wav

17http://cnx.org/content/m11751/latest/arturo.wav

18http://cnx.org/content/m11751/latest/dame.wav

19http://cnx.org/content/m11751/latest/sillysentence.wav

(41)

2.4.1 Problems

• A relatively minor problem we encountered was the placement of consonants at the beginning and end of word, regardless of the beginning or ending sound being a consonant or vowel. A good example is the word Arturo, which begins and ends with a vowel sound, though our program returns a consonant at beginning and end. This is because of the dead space that is inherent at the start and end of le, due to the delay between recording beginning and the speech sample starting (and similarly at the end). The simplest way we could have amended this would have been to manually crop the les, so that no dead space was found.

• Occasionally our vocal tract model did not suciently emphasize the second formant in 'I' at a fre- quency far enough away from the third for there to be a peak at the frequency value we associated with the second one. As a result, the third formant was sometimes detected as the second. We never got this problem ironed out, and it caused confusion between I's and U's in our lter. A possible method of correcting this would be to apply a dierentiator to adjacent frequency values of our frequency response. When the dierence levels o or goes negative with a suciently high magnitude value, we could add that point as a formant peak. In the image below, one can visually tell that there is likely a peak around 1950 Hz, but there is no expressed peak, so our detection program passed over it.

Figure 2.7: Example of loss of 2nd formant in Vowel 'I'

(42)

34 CHAPTER 2. FORMANT ANALYSIS AND VOWEL DETECTION the vocal tract as L's and R's were being pronounced. As you can tell, they are highly similar to the frequency response of the vocal tract when vowels are being produced. Without drastically changing the focus of our project, the only method to amend this would be to have more intricate threshold values.

(a) (b)

Figure 2.8: (a) Consonant 'L' Vocal Model (b) Consonant 'R' Vocal Model

• · Often in direct transition from vowel to vowel with no consonant between, a consonant value was returned between the two vowels. This can be seen below in the three images showing the transition from the second I in biblioteca to the 'O'. The rst image is the 'I', the third is the 'O', and the second is the transition between them. The transitional frequency response is not suciently similar to either the 'I' or 'O', so it gets classied as a consonant. Currently anything that does not match one of our ve vowels gets classied as a consonant. A possible means of circumventing this would be to add a transitional character to our database, in this case and 'IO' database character. Or we could have direct consonant recognition (a broad class, not specic consonants) and then classify vowels that don't match our database as unknowns, rather than just pooling them with consonants. <i-bib.g>,

<between.g>,<o-bib.g>

(a) (b)

Figure 2.9: (a) Second 'I' in Biblioteca (b) Transition between vowels

(43)

Figure 2.10: 'O' in Biblioteca

2.5 Conclusions

20

Through formant analysis, we were able to successfully identify vowels in multi-syllabic words and short sentences. Formant analysis along with a match lter method to detect vowels proved to be eective, although a few problems were encountered. After this project experience, we can understand the complications inherent in any full speech recognition system.

2.5.1 Improvements

• The most immediately obvious improvement we could make would be to increase the vowel database.

This would require much more intricate ltering and thresholding, but would ultimately make for a much more robust system.

• Another addition we could make would be support for multiple speakers. We would have to abandon our current method of absolute value thresholding and implement a system that utilized a formant ratio

(44)

36 CHAPTER 2. FORMANT ANALYSIS AND VOWEL DETECTION settled on the window size and auto-regressive model we would use in order to increase the focus of our project. It is possible that more eective results could be obtained with a dierent window size or an optimized auto-regressive model.

• Another obvious improvement would be to boost the sampling rate of the input signal. This would result in signicantly higher resolution, and allow much more precise tracking of subtle nuances in the dynamic change of formants. This would also possibly allow the beginning of consonant detection (or at least classication) as our research has told us that by carefully tracking changes to vowel formant values just before and after a consonant sound reveals certain properties of the consonant.

• Even without the boost in frequency of the sampling rate, we could likely add detection for 'R', 'L', and 'S' consonant sounds. We found all three of these to have strong resonant frequencies ('R' and 'L' as discussed in our problems section; 'S' had dened formants at high frequencies).

2.5.2 References and Acknowledgements

Digital Bubble Bath21 - 1996 Elec 431 Group Dr. Baraniuk

2.6 The Team

22

• Pranav Chitkara ([email protected]) - Coding, Poster, Connexions Modules

• Chris Forbis ([email protected]) - Coding, Poster, Connexions Modules

• Mark Yeh ([email protected]) - Coding, Poster, Connexions Modules

21http://www.owlnet.rice.edu/elec431/projects96/digitalbb/

22This content is available online at <http://cnx.org/content/m11739/1.3/>.

(45)

Chapter 3

Music Classication by Genre

3.1 Music Classication by Genre

1

• Project Summary (Section 3.2)

• System Diagram (Section 3.3)

• Bandwidth (Section 3.5)

• Frequency Cuto (Section 3.6)

• Frequency Smoothness (Section 3.7)

• Beat Variation (Section 3.8)

• High Pass Filter (Section 3.10)

• Power Spectral Density2

• Total Power (Section 3.12)

• Neural Networks (Section 3.13)

• System Performance (Section 3.14)

• Overall Results3

• Chris Hunter (Section 3.16)

• Melodie Chu (Section 3.17)

• Mitali Banerjee (Section 3.18)

• Jordan Mayo (Section 3.19)

(46)

38 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

3.2 Project Summary: Music Classication by Genre

4

Widespread access to the Internet popularized digital music. People download large collections of music les sorted into directory structures by artist or genre. For example, a student at Rice University may want to search a library of les stored on the computer of a student in Bremen, Germany for classical music.

Language dierences and foreign preferences for le naming would make it dicult for the Rice student to determine music genres. A collection of lters to classify music based on DSP analysis tools would allow users to search a collection of les and extract only those that have certain chosen characteristics.

We designed a classication system that analyzes the contents of a .wav music le in order to sort it into specic categories: classical, jazz, country, rap, punk, and techno. In order to classify music samples, we examine characteristics in both the time and frequency domains:

• bandwidth

• beat(tempo) variability

• high pass ltering

• number of FFT coecients above threshold

• power spectral density

• smoothness in frequency domain

• total power

Then a neural network classies each song based on its similarity to other songs in various genres.

Previous classication projects have directly analyzed song clips in neural networks. However, we take a slightly dierent approach by providing the neural network with the previously listed DSP characteristics that represent the song. This method proves 84% accurate, having most diculty classifying techno music.

4This content is available online at <http://cnx.org/content/m11661/1.3/>.

(47)

3.3 Music Classication by Genre: System Diagram

5

Music Classication by Genre System Diagram

Figure 3.1: Music Matcher, a collection of scripts and functions, takes a .wav le input, digitally processes it, and creates an output vector characteristic of the sample. A neural network is trained with 20 songs in each genre. Then it analyzes the new song vectors for patterns and predicts an output classication genre.

Music Matcher takes a .wav le, analyzes it, and outputs a music genre. Our system breaks up a .wav le into twenty .5 second windows. From here, the DSP functions are called for each of the twenty windows.

Each one of these twenty windows is analyzed by seven DSP functions:

• Bandwidth

• Power Spectral Density

• Total Power (L-2 norm / L-innity norm)

• Spectrogram Smoothness

(48)

40 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE The values returned from each of these functions is averaged over all twenty windows to give an average value for each song as well as a standard deviation, which tells us how these qualities change over time. That way, our DSP vector has some measure of how each of the functions changed with time.

First, the neural network is trained with 120 songs, 20 of each genre. After we train the neural network, we give it songs it has never seen, and the output of the system is the classication of genre that the neural network determines.

3.4 Introduction to Digital Signal Processing

6

Not only do we have analog signals signals that are real- or complex-valued functions of a continuous variable such as time or space we can dene digital ones as well. Digital signals are sequences, functions dened only for the integers. We thus use the notations(n)to denote a discrete-time one-dimensional signal such as a digital music recording ands(m, n)for a discrete-"time" two-dimensional signal like a photo taken with a digital camera. Sequences are fundamentally dierent than continuous-time signals. For example, continuity has no meaning for sequences.

Despite such fundamental dierences, the theory underlying digital signal processing mirrors that for ana- log signals: Fourier transforms, linear ltering, and linear systems parallel what previous chapters described.

These similarities make it easy to understand the denitions and why we need them, but the similarities should not be construed as "analog wannabes." We will discover that digital signal processing is not an approximation to analog processing. We must explicitly worry about the delity of converting analog signals into digital ones. The music stored on CDs, the speech sent over digital cellular telephones, and the video carried by digital television all evidence that analog signals can be accurately converted to digital ones and back again.

The key reason why digital signal processing systems have a technological advantage today is the com- puter: computations, like the Fourier transform, can be performed quickly enough to be calculated as the signal is produced,7 and programmability means that the signal processing system can be easily changed.

This exibility has obvious appeal, and has been widely accepted in the marketplace. Programmability means that we can perform signal processing operations impossible with analog systems (circuits). We will also discover that digital systems enjoy an algorithmic advantage that contributes to rapid processing speeds: Computations can be restructured in non-obvious ways to speed the processing. This exibility comes at a price, a consequence of how computers work. How do computers perform signal processing?

3.5 Music Classication by Genre: Bandwidth

8

Bandwidth refers to how spread-spectrum the signal is and what frequencies are present. If a signal is composed of many high frequencies, the bandwidth will be large. However, if the signal is composed of mostly low frequencies, the bandwidth will be small. After taking the shifted FFT of windows of the music vector, we nd the last frequency component above a certain cuto threshold, which is the bandwidth of the signal. Because classical music is composed of harmonic instruments, its bandwidth will be smaller and it will have fewer frequency components. However, hard music like punk or rap has lots of non-sinusoidal drumbeats, which will create more frequency components and their bandwidth will be larger.

6This content is available online at <http://cnx.org/content/m10781/2.3/>.

7Taking a systems viewpoint for the moment, a system that produces its output as rapidly as the input arises is said to be a real-time system. All analog systems operate in real time; digital ones that depend on a computer to perform system computations may or may not work in real time. Clearly, we need real-time signal processing systems. Only recently have computers become fast enough to meet real-time requirements while performing non-trivial signal processing.

8This content is available online at <http://cnx.org/content/m11672/1.3/>.

(49)

Figure 3.2: Bandwidth for classical music.

(50)

42 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.3: Bandwidth for jazz music.

3.5.1 Results

The bandwidth of songs in the frequency domain is very good at distinguishing jazz. However, the bandwidth of punk, techno, and country, all hover around the same value. Rap has most of its power in low frequencies, and those coecients will be large. Therefore, the bandwidth will be small because the spread of the frequency components is localized in the low frequencies. Jazz, however, has high and low frequencies, so there could be a frequency component in the high frequencies that is large, increasing the bandwidth.

We also give the neural network a measure of how bandwidth changes over the time period of a song.

The standard deviations are good at telling jazz and country apart from the other genres, but no one genre stands out.

(51)

Figure 3.4: Overall, bandwidth is a good detector for jazz and rap, but poorer in distinguishing between classical, punk, techno, and country, which all have about the same bandwidth.

(52)

44 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.5: Variation of bandwidth across time.

3.6 Music Classication by Genre: Frequency Cuto

9

Like most of our other lters, the frequency cuto had mixed results that varied on the genre in question.

Some samples it is readily able to identify, while others if nds quite dicult to pin point directly. For instance, if you fed the lter a sample of classical and a sample of techno, it would have no problem telling you the dierence between them. This is because techno has a majority of its energy concentrated at only a few frequencies while classical has its power spread more evenly over a wider band. On the other hand if you were to input samples of punk and country, the lter might tell you that The Ramones sound like Hank Williams. Looking at these results though is not the whole story. A more telling relationship is isolated when the Standard Deviations of these outputs are analyzed. It becomes dicult to isolate any one genre but it does separate them into two main categories:

1. Classical, Punk and Country 2. Techno, Jazz, and Rap

Group one consists of the genres who retained only 40-50 coecients above the thresh hold, while the genres of group two consistently preserved at least 90 coecients per sample. This wide gap between them should paint a fairly clear picture of the dierences between genres with respect to their cuto frequencies. This

9This content is available online at <http://cnx.org/content/m11684/1.1/>.

(53)

alone isn't very helpful, but when used in conjunction with other lters, this could prove to the rst step in a very powerful tool to help classify music.

Figure 3.6

3.7 Music Classication by Genre: Frequency Smoothness

10

A spectrogram is a tool that belongs to a set of tools called time-frequency representations. Music, on a CD, is a time-vector. Performing an FFT of this time-vector would give us its frequency content. However, a single FFT would lose all time information since it gives us the frequency content of the time-vector as a whole. We need something like an instantaneous frequency response so we have both frequency and time information. A spectrogram essentially breaks a signal up into many dierent time-vectors and performs FFTs of each. These FFTs are then placed as columns in the spectrogram. In the end, we have a time- frequency representation of our music.

10This content is available online at <http://cnx.org/content/m11671/1.2/>.

(54)

46 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.7

(55)

Figure 3.8

This is a spectrogram of a techno song and a classical song. freqsmooth.m quanties the dierences seen in these spectrograms. To do this, freqsmooth calculates the variance in the indices of the max values of each column. In other words, a song with a clear, loud melody will show small variance in these indices while a song with a harder-to-identify melody will show a large variance.

3.7.1 Results

While freqsmooth does give a dierent value for each genre, it also gives a radically dierent value for songs within a given genre. In other words, it does not give a good representation of a genre as a whole. Given the plus and minus standard deviation bars, each genre overlaps heavily.

(56)

48 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.9

3.8 Music Classication by Genre: Beat Detection

11

Beat detection emphasizes the sudden impulses of sound in the song and then nds the fundamental period at which these impulses appear. It convolves a signal with itself and nds frequency peaks. Then it measures the distance between these frequency peaks. This is done by breaking the signal into frequency bands, extracting the envelope of these frequency-banded signals, dierentiating them to emphasize sudden changes in sound, and running the signals through a lter to choose the highest energy result as the tempo. Variation in tempo, found by detecting beat in dierent windows of the song, helps determine musical genres.

The lter can only separate rap from all other genres eectively, because it has the steadiest backbeat, consistent across the genre! Classical and jazz have too much variability, which makes sense, considering that each piece is often long and divided into sections.

11This content is available online at <http://cnx.org/content/m11685/1.2/>.

(57)

Figure 3.10

(58)

50 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.11

3.9 Ideal Filters

12

There are four fundamental lters. They are

• Lowpass blocks high frequencies, allowing low frequencies through

• Highpass blocks low frequencies, allowing high frequencies through

• Bandpass blocks all frequencies except those within a certain range

• Bandstop blocks only the frequencies within a certain range, allowing all others to pass through

12This content is available online at <http://cnx.org/content/m10103/2.7/>.

(59)

(a) (b)

(c) (d)

Figure 3.12: Ideal frequency domain representations of the four fundamental lters. (a) Lowpass (b) Highpass (c) Bandpass (d) Bandstop

Another term one may come across in the study of lters is an "allpass" lter. This is one that allows all frequencies through. The only meaningful eect an allpass lter can have is on the phase of the signal.

3.10 Music Classication by Genre: High Pass Filter

13

Like most of the lters run on our music samples, the High Pass lter does a good job of identifying some genres, while it has diculty with others. Like one would expect, classical had the smallest error of any genre tested. This should be intuitive since it uses the lower frequency part of the spectrum. One can think of classical music as being very uid with few sudden changes in frequency. Conversely, punk and jazz had the highest amount of error, which is a good indication of higher frequencies being utilized. Compared to classical music, these genres are much less uid and often exhibit rapid changes in tempo. Somewhere between these two extremes are techno, rap and country. The lter has an especially tough time telling the dierence between the latter two. Who would have ever thought Garth Brooks and Tupac might get confused with one another. Overall, while the lter cannot explicitly identify the dierent genres, it does give the user a starting point to isolate between two main groups. This means that another tool must be used along side the high pass lter in order to obtain an eective music matcher.

13This content is available online at <http://cnx.org/content/m11683/1.2/>.

(60)

52 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.13

(61)

Figure 3.14

3.11 Music Classication by Genre: Power Spectral Density

14

Our program essentially breaks the time-domain signal into windows and computes the norm squared of the FFT of each window. It then averages the magnitude squared of the FFT coecients of each window, then represents it in decibels. We then have a vector approximately length 100 that represents the power in the frequency domain. This is a measure of exactly what frequencies are present and at what magnitude.

Rather than using a single number to characterize the whole signal, our power spectral density program returns a vector representing more subtle changes in the spectrum. The decibel scale helps distinguish and dierentiate between genres even further, fanning out the dierences between genres.

14This content is available online at <http://cnx.org/content/m11674/1.2/>.

(62)

54 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.15

3.11.1 Results

The power spectral density was great at showing patterns between genres. Rap has the most distinct pattern, with a sudden downward slope (red). Classical also had a distinctive pattern, with the smallest power at all frequencies. Jazz, punk, and country are all near each other, but at higher frequencies, begin to fan out.

Looking closely at the envelopes, techno spans the largest area, encapsulating almost all of jazz, punk, and country. This is one reason why techno could not be distinguished very well from those genres.

(63)

Figure 3.16

3.12 Music Classication by Genre: Total Power

15

The power in a signal is the norm squared of the frequency components of the signal. The vectors are rst normalized to the maximum value in the vector such that we are not analyzing loudness, but more accurately, the L-2 norm divided by the L-innity norm. It measures how many harmonics are present in the signal and how much of each harmonic. In our case, the music samples have a wide range of total power: classical piano has low power with few harmonics, whereas punk has high power. You can see from these two plots of the spectrum, on the same scale, that jazz has much smaller power. Jazz has fewer frequency components of smaller power.

15This content is available online at <http://cnx.org/content/m11673/1.3/>.

(64)

56 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.17

(65)

Figure 3.18

3.12.1 Results

The total power of a signal changes radically between genres. Jazz has the lowest total power, while punk tops the list. Punk also has the lowest standard deviation; there should be very little confusion with the rest of the genres. Techno is the least discernable: its standard deviation encapsulates all the other genres. Looking closely at the graph, the spread of the standard deviations of classical and country does not encapsulate any other genres, so they should be easily identied.

The standard deviations of rap and techno are very distinct, whereas the others are all about the same value. Although the average total power of techno may not be a good indicator, the standard deviation should be able to pick out techno.

(66)

58 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.19

3.13 Neural Networks

16

At their core, neural networks are pattern recognition systems. They predict an output given a sequence of inputs and their corresponding classication. They are based on biological nervous systems, in which there are many inputs and numerous outputs to a single neuron. On the highest level, the neural network is a primitive learning machine that can be used to process data such as stock market quotes, DNA sequences, and in our case, music classication.

Neural networks are systems that take a lengthy input, process the data, and predict an output.

16This content is available online at <http://cnx.org/content/m11667/1.2/>.

(67)

Figure 3.20

The processing is done by multiple, weighted layers of nodes. Each node is connected to every node in the next layer, and at each interface between nodes are connecting bers weighted by a sum. Neural networks are given a vector of inputs, usually longer than the output. The rst layer of nodes is the same length as the input. The nodes at each successive layer sum their inputs, weight the sum, and produce an output. The output of the nal layer is the output of the system. In this manner, an output is predicted given an input.

The remaining question is how the weights are determined. The use of neural networks is twofold: you must rst "train" the network by giving it inputs and their corresponding outputs, and then you may test the network by giving it inputs with no outputs. The training determines the weighting on the nodes. For example, we train the neural network by giving it the vectors of signal processing data (bandwidth, power spectral density, etc.) as well as the corresponding classication of music. Classical music is denoted as [1 0 0 0 0 0], jazz is denoted as [0 1 0 0 0 0], etc., as shifted delta functions.

There are many methods to train neural networks, but the one we use is called backpropogation. The neural network takes the input and feeds it through the system, evaluating the output. It then changes the weights in order to get a more accurate output. It continues to run the inputs through the network multiple times until the error between its output and the output you gave it is below a dened tolerance level.

(68)

60 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE learning rate after each iteration, changing it to remain relatively constant. For instance, if the learning curve is too steep, and the network is learning too quickly, it decreases its learning rate, and vice-versa. This is a graph of the error (learning rate) versus time:

Figure 3.21: Error in the neural network decreases with each successive iteration.

(69)

3.14 Music Classication by Genre: System Performance

17

Figure 3.22: Performance of the neural network improved with successive inputs of training vectors.

It begins to recognize characteristics of each music genre!

17This content is available online at <http://cnx.org/content/m11690/1.3/>.

(70)

62 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.23: When tested with the training vectors, the system is 87.5% accurate. Higher accuracy implies that the system has memorized the training set and is unable to generalize when given new inputs.

Lighter background stripes indicate greater certainty in identication, while increasingly darker hues note greater uncertainty. Horizontal black bars indicate actual genre, and the stems indicate predicted genre.

These plots show how well the rst three genres separate in the output of the network. Even the testing vectors are separated by a high degree of condence.

(71)

Figure 3.24: Spatial separation, weighted by sureness level, of classical (red), jazz (blue), and rap (green) in the training vectors.

(72)

64 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.25: Spatial separation, weighted by sureness level, of punk (red), techno (blue), and country (green) in the training vectors.

(73)

Figure 3.26: Though worthy of the Museum of Modern Art, this depicts the output of the neural network for each of the six genres.

(74)

66 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.27: Provided songs that the network has never seen, it performs perfectly and with high condence for rap, while classication of techno is comparably poor. However, the system is aware of this: error coincides with lack of condence. Lighter background stripes indicate greater certainty in identication, while increasingly darker hues note greater uncertainty. Horizontal black bars indicate actual genre, and the stems indicate predicted genre.

(75)

Figure 3.28: Spatial separation, weighted by sureness level, of classical (red), jazz (blue), and rap (green) in the training vectors.

(76)

68 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.29: Spatial separation, weighted by sureness level, of punk (red), techno (blue), and country (green) in the training vectors.

Our system successfully determined the genre of the vast majority of the test songs. Not only did the system choose a genre, it quantied its output with a level of sureness. When the system was in error, there was a corresponding uncertainty. The genre that gave our system the most diculty was techno. The output of the DSP functions for techno had a very high standard deviation, making it hard for the neural network to distinguish its pattern from those of the other genres. Given an unknown song, there is an 84% chance that the system can determine the genre successfully.

(77)

3.15 Back propagation mathematics

18

3.15.1 Error denition

The back propagation method is the example of the wide class of training methods based on the information covered in the gradient of error function. The independent variables in this minimization are weights of neural network and the considered error to be minimized is the root mean square one.

Let us consider the training set composed of L ordered pairs, of the following form:

{ x(1), d(1)

, x(2), d(2)

, ..., x(L), d(L)

} Furthermore, let us dene the total errorE generated on outputs of neural network after presenting the entire training set, as:

E=

L

X

l=1

E(l)

where:

E(l)=

M

X

m=1

Em(l)= 1 2

M

X

m=1

d(l)m −y(l)m2

As was already told, the independent variables in the minimization of errorE are weightswij Since even for the relatively small networks the number of weigths is big, in real applications, the training of the neural network is the minimization of the scalar eld over the vector space with hundreds or (more often) thousands dimensions. One of the minizmiazation techniques for such problem is the steapest descent method

Z 1

0

x2dx

X

n=1

2×21/2(26390n+ 1103) (4n)!

9801×3964n(n!)4−1

X

n=1

2√

2 (26390n+ 1103) (4n)!

98013964nn!4

X

n=1

2√

2 (26390n+ 1103) (4n)!

98013964nn!4

X

n=1

2×21/2(26390n+ 1103) (4n)!

9801×3964n(n!)4−1

18This content is available online at <http://cnx.org/content/m11120/2.1/>.

(78)

70 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

3.16 Chris Hunter

19

Figure 3.30: Chris in a drysuit.

Hi everyone, I am an Electrical Engineering major and a member of the Class of 2006 at Rice University.

This bio is an informal section about my interests and my life. If you would like a resume, please feel free to email me at [email protected]

I was born in Tampa Bay, FL, but I moved to Austin, TX quick enough such that I don't even remember Florida. Austin, in my humble opinion, is one of the most gorgeous places in the country. The weather is great and the people are friendly. If you ever feel like visiting the city, email me and I can show you around.

19This content is available online at <http://cnx.org/content/m11663/1.4/>.

(79)

I am a person of many interests. Let's see, I have had classical piano training for about 11 or 12 years.

While I do not continue formal training, I have branched out to music composition. If you would like to hear a few of my works, stop by http://www.mp3.com/chrishunter. On the more active side of my hobbies, I enjoy power-kiting (large pull-you-o-the-ground kites) and wakeskating/wakeboarding. The latter is my most recent passion, but I'm o to a quick start. In case you are curious, wakeskating is like skateboarding on the water (i.e. you are not bound to the board like you are on a wakeboard). Thankfully, Austin is one of the top places for water sports thanks to our ample natural lakes and mild climate. I am by no means a great wakeskater or wakeboarder, but here are a few pictures of my adventures:

Figure 3.31: Wakeboarding on Lake Travis.

(80)

72 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.32: Cold water requires a wetsuit.

(81)

Figure 3.33: Colder water requires a drysuit.

(82)

74 CHAPTER 3. MUSIC CLASSIFICATION BY GENRE

Figure 3.34: Let's not talk about it.

Références

Documents relatifs