Audio Representations - Audio Feature Extraction

Audio Feature Extraction

2.1 Audio Representations

To analyze music stored as a recorded audio signal we need to devise rep-resentations that roughly correspond to how we perceive sound through the auditory system. At a fundamental level, such audio representations will help determine when things happen in time and how fast they repeat (frequency).

Therefore, the foundation of any audio analysis algorithm is a representa-tion that is structured around time and frequency. In audio signal processing an important property of time and frequency transform is invertibility which means that the original audio signal or a very close approximation can be reconstructed from the values of the transform. As the goal of audio feature extraction is analysis, typically a lot of information needs to be discarded and therefore perfect reconstruction is not as important.

Before discussing time and frequency representations, we briefly dis-cuss how audio signals are represented digitally. Sound is created when air molecules are set into motion by some kind of vibration. The resulting changes in air pressure can be represented as a continuous signal over time. In order to represent the continuous process in a finite amount of memory the contin-uous signal is sampled at regular periodic intervals. The resulting sequence of samples which still has continuous values is then converted to a sequence of discretized samples through the process of quantization. For example CD quality audio has a sampling rate of 44,100 Hz and a dynamic range of 16 bit. This means each second of sound is represented as 44,100 samples equally spaced in time and each one of those samples is represented by 16 bits. One fundamental result in signal processing is the Nyquist-Shannon sampling the-orem [59] which states that if a functionx(t) contains no frequencies higher thanBHertz, it can be completely reconstructed from a series of points spaced

2B seconds apart. What this means is that if the highest frequency we are interested in is B Hertz then we need to sample the signal at 2B Hertz or higher. As the data rates of audio signals are very high this has important implications. For example telephone quality speech typically has a sampling rate of 16,000 Hz whereas CD audio has 44,100 Hz.

The short-time Fourier transform (STFT) is arguably the most common time-frequency representation and has been widely used in many domains in addition to music processing. In addition, other audio feature representations such as the Mel-frequency cepstral coefficients (MFCCs) and chroma are based on the STFT. An important factor in the wide use of the STFT is the high speed with which it can be computed in certain cases when using the fast Fourier transform algorithm.

2.1.1 The Short-Time Fourier Transform

The fundamental idea behind the short-time Fourier transform (STFT) as well as many other time-frequency representations is to express a signal as a linear combination of basic elementary signals that can be more easily understood and manipulated. The resulting representation contains information about how the energy of the signal is distributed in both time and frequency. The STFT is essentially a discrete Fourier transform (DFT) adapted to provide localization in time. The DFT has its origins in the Fourier series in which any complicated continuous periodic function can be written as an infinite discrete sum of sine and cosine signals. Similarly, the DFT can be viewed as a

(a) Basis function (b) Time-domain waveform

Figure 2.1

Example of discrete Fourier transform. Using the DFT, a short sequence of time-domain samples such as the one shown in (b) is expressed as a linear combination of simple sinusoidal signals such as the one shown in (a). Win-dowing applies a bell-shaped curve to reduce artifacts due to discontinuities when the waveform is periodically repeated (c) and (d).

similar process of representing any finite, discrete signal (properties required for processing by a computer) by a finite, discrete sum of discretized sine and cosine signals.Figure 2.1(a) shows such a simple sinusoidal signal.

It is possible to calculate the DFT of an entire audio clip and show how the energy of the signal is distributed among different frequencies. However, such an analysis would provide no information about when these frequencies start and stop in time. The idea behind the STFT is to process small segments of the audio clip at a time and the DFT of each segment. The output of the DFT is called aspectrum. The resulting sequence of spectrums (or spectra) contains information about time as well as frequency. The process of obtaining a small segment from a long audio signal can be viewed as a multiplication of the original audio signal with signal that has the value 1 during the time period of interest and the value 0 outside it. Such a signal is called a rectangular window.

Any type of Fourier analysis assumes infinite periodic signals so processing finite signals is done by effectively repeating them to form a periodic signal.

If the finite signal analyzed has been obtained by rectangular windowing then there will be a large discontinuity at the point where the end of the signal is connected to the start of the signal in the process of periodic repetition. This discontinuity will introduce significant energy in all frequencies distorting the analysis. Another way to view this is that the sharp slope of the rectangular window causes additional frequencies in the entire spectrum. This distortion of the spectrum is called spectral leakage. To reduce the effects of spectral leakage, instead of using a rectangular window, a nonnegative smooth “bell-shaped” curve is used. There are several variants named after the person who proposed them with slightly different characteristics. Examples include:

Hann, Hamming, and Blackman. Figures 2.1(c) and 2.1(d) show the effect of windowing on a time-domain waveform and a single sinusoid respectively.

Figure 2.2 shows the spectra of a mixture of two sinusoids with frequencies of 1500 Hz and 3000 Hz sampled at 22050 Hz and weighted with rectangular, Hamming, and Hann windows. As can be seen, windowing makes the spectrum less spread and closer to the ideal theoretical result (which should be two single peaks at the corresponding frequencies).

Formally, the DFT is defined as:

X[k] =

N−1

t=0

x[t]e^−jkt^2π^N k= 0...N −1 (2.1) whereX[k] is the complex number corresponding to frequency binkor equiv-alently a frequency ofk^F_N^s whereFs is the sampling frequency andN is the number of frequency bins as well as the number of the time-domain samples x[n].

The notation above is based on Euler’s relation:

e^jθ=cosθ+jsinθ (2.2)

Figure 2.2

Effect of windowing to the magnitude spectrum of the mixture of two sinu-soids.

Therefore, one can think of the representation x[t] as a weighted sum (with weightsX[k]) of sinusoidal signals of a particular frequencyk^F_N^s with magni-tudes:

|X[k]|=p

Re(X[k])²+Im(X[k])² (2.3) and phases:

φ[k] =−arctan

Im(X[k]) Re(X[k])

(2.4) The obvious implementation of the DFT requires N multiplications for each frequency bin resulting in a computational complexity ofO(N²). It turns out that it is possible to calculate the DFT much faster with complexity O(NlogN) if the sizeN is a power of 2. For example, for 512 points (a very common choice in audio signal processing) the FFT is approximately 56.8 times faster than the direct FFT (512²/512 log₂512).

The identity of a sound is mostly affected by the magnitude spectrum although the phase plays an important role especially near transient parts of the signal such as percussion hits. Therefore, in the majority of audio feature extraction for analyzing music only the magnitude spectrum is considered.

The human ear has a remarkably large dynamic range in audio perception.

The ratio of sound intensity of the quietest sound that the ear can hear to the loudest sound that can cause permanent damage exceeds a trillion (10¹²).

Psychophysic experiments have determined that humans perceive intensities approximately on a logarithmic scale in order to cope with this large dynamic

Figure 2.3

Two magnitude spectra in dB corresponding to 20-msec excerpts from music clips.

range. The decibel is commonly used in acoustics to quantify sound levels relative to a 0 dB reference (defined as the sound pressure level of the softest sound a person with average hearing can detect). The difference in decibels between two sounds playing with powerP1 and andP2 is:

10 log₁₀P₂ P1

(2.5) For example if the second sound has twice the power the difference is:

10 log₁₀ P2

= 10 log₁₀2 = 3dB (2.6)

The base-10 logarithm of 1 trillion is 12 resulting in an audio level of 120 dB.

Digital samples measure levels of sound pressure. The power in a sound wave is the square of pressure. Therefore, the difference between two sounds with pressurep₁andp₂ is:

10 log₁₀P2

P₁ = 10 log₁₀p²₂

p¹₁ = 20 log₁₀p2

p₁ (2.7)

The spectrum in dB can then be defined as:

|X[k]|dB = 20 log₁₀(|X[k] +) (2.8)

whereis a very small number to avoid taking the logarithm of 0.Figure 2.3 shows two magnitude spectra in dB corresponding to 20-millisecond (msec) frames from two different pieces of music.

2.1.2 Filterbanks, Wavelets, and Other Time-Frequency Representations

In signal processing, a filterbank refers to any system that separates the in-put signal into several subbands using an array (bank) of filters. Typically, each subband corresponds to a subset of frequencies. There is no requirement that the subband-frequency ranges are not overlapping or that the transfor-mation is invertible. The STFT, wavelets, and many other types of signal decompositions can all be viewed as filterbanks with specific constraints and characteristics.

A common use of filterbanks in audio processing is as approximations to how the human auditory system processes sound. For example from exper-iments in psychophysics it is known that the sensitivity of the human ear to different frequencies is not uniformly spaced along the frequency axis. For example, a common approximation is the Mel filterbank in which filters are uniformly spaced before 1 kHz and logarithmically spaced after 1 kHz. In contrast the DFT can be thought of as a filterbank with very narrow-band fil-ters that are linearly spaced between 0 Hz andF_s/2 whereF_sis the sampling frequency.

Music and audio signals in general change over time therefore we are inter-ested in their frequency content locally in time. The STFT addresses this by windowing a section of the signal and then taking its DFT. The time and fre-quency resolution of the STFT is fixed based on the size of the window and the sampling rate. Increasing the size of the window can make the estimation of frequencies more precise (high-frequency resolution) but makes the detection of when they take place (time resolution) less accurate. This time-frequency trade-off is fundamental to any type of time-frequency analysis.

Wavelet analysis is performed by using variable time-frequency resolution so that low frequencies are more precisely detected (high-frequency resolution) but are not very accurately placed in time (low-time resolution) and high frequencies are less precisely detected (low-frequency resolution) but are more accurately placed in time (high-time resolution). The most common dyadic type of the discrete wavelet transform can be viewed as a filterbank in which each filter has half the frequency range of the filter with the closest center frequency that is higher. Essentially, this means that each filter spans an octave in frequency and has half/twice the bandwidth of the corresponding adjacent filters. As an example, the filters in a dyadic discrete wavelet transform for a sampling rate of 10,000 Hz would have the following bandwidths 2500–5000 Hz, 1250–2500, 725–1250, and so on. Each successive band (from high-center frequency to low) can be represented by half the number of samples of the previous one according to the Nyquist-Shannon sampling theorem. That way,

the discrete wavelet transform is a transform that has the same number of coefficients (similar to the DFT) as the original discrete signal.

Both the DFT and the DWT are mathematical and signal-processing tools that do not take into account the characteristics of the human auditory sys-tem. There are also alternative representations that take this information into account typically termedauditory models. They are informed by experiments in psychoacoustics and our increasing knowledge of the biomechanics of the human ear and differ in the amount of detail and accuracy of their modeling [61]. The more efficient versions are constructed on top of the DFT but the more elaborate models do direct filtering followed by stages simulating the inner hair cell mechanism of the ear. In general, they tend to be heavier com-putationally and for some tasks such as automatic music genre classification so far they do not show any advantages compared to more direct approaches.

However, they exhibit many of the perception characteristics of the human ear so it is likely that they will be used especially in tasks that require rich, structured representations of audio signals [50].

Dans le document EDITED BY (Page 70-77)