• Aucun résultat trouvé

EXTRACTING THE KEY FROM MUSIC

8.2 Musical pitch and key

8.3.1 Chroma spectrum

As described in Section 2, a chroma represent one of the twelve note posi-tions within an octave. By mapping the musical pitches along different octaves in a spectral representation to their respective chromas using the rules of A440-equal temperament (see Equation 8.1), we arrive at a chroma spectrum. In this way, the chroma spectrum summarizes the harmonic content of a music sample as a compact 12-element feature vector.

General model. A chromaci, wherei=1,...,12, represents a set of pitches {pik|k=1,...,K}, whereKdenotes the number of octaves, that have the same positioniwithin a given octave, but that differ in octave numberk. To arrive at a likelihood score for a single chroma, we have to collect the likelihood scores for all pitches sharing the same chroma in the music signal. The pitches can originate from any music instrument, from a melody, a chord, or musical background. We will extend on an existing spectral model that copes with the correlation in spectral energy between the pitches of a leading soloist and musical background [Shalev-Shwartz et al., 2002].

LetO=o1...oT be a sequence of vectors of lengthT. Each vector repre-sents a spectrum representation of the music signal over a short time frame.

The problem can be formulated as finding the likelihoodP(ot;ci)of ‘hearing’

a chromaciin vectorot.

We denote the spectral distribution of the music signal at frequency f as S(f). Part of it consists of the spectral content of the chromaci to be found, denoted asC(f). As an ideal simplification, we model a single tone of a music instrument as a harmonic series. Its spectrum contains high energy bursts at integral multiplesnpik, for integern referring to the index of the harmonic.

The spectral content ofcican then be modelled as a combination of harmonic series for all pitches piksharingci. In other words,

C(f) =

K

k=1

N n=1

Gik(n)δ(npik−f), (8.2) whereNdenotes the number of harmonics,Gik(n)is the amplitude gain for then-th harmonic of pitch pik, andδ()is Dirac’s delta function, which is 1 at the origin and zero elsewhere. Note that we ignore the fact that the harmonics of different pitches may coincide.

Another part ofS(f) consists of the spectral content of the musical back-ground. We assume that the spectral energy of this background affects the Extracting the Key from Music

124

entire spectrumS(f), though it matches the energy of the pitches to be found.

Its spectral energy at frequency f is denoted asη(f). The spectral content of the music signal at frequency f is then modelled as,

S(f) =

K

k=1

N n=1

Gik(n)(η(f) +δ(npik−f)). (8.3) With some rearrangement, the background energy level at frequency f, η(f), is,

η(f) =S(f)Kk=1Nn=1Gik(n)δ(npik−f)

Kk=1Nn=1Gik(n) . (8.4) The characteristics of the musical background,η(f), are unknown. A sim-ple assumption is to model it as a random variable from a zero-mean multi-variate Gaussian process with statistical independence at all frequencies f and equal varianceν. Then, the joint conditional probability density function (pdf), given the unknown variables, is given by,

f(η|ν) = 1

(2πν)L/2eη

2

2ν , (8.5)

where L denotes the spectrum resolution (or, the number of independent observations) and · 2is the l2-norm.

The maximum likelihood (ML) estimator [Eliason, 1993] for the unknown variables can be obtained by maximizing the log-likelihood function with re-spect to the unknown variables:

(ν,ˆ Gˆik(n))ML=arg max

ν,Gik(n)Lη(ν,Gik(n)) where (8.6) Lη(ν,Gik(n))=−L

2log(2πν)−η2

. (8.7)

By maximizing the log-likelihood function for the unknown background varianceν, one obtains

νˆ=η2

L . (8.8)

A sensible choice for the unknown harmonic amplitudesGik(n)is by letting them correspond with the spectral peaksS(npik)at the harmonics of the pitch frequency pik. This is sensible because we can expect that the musical back-ground levelη(f)is relatively small compared to the energy at the harmonics of pik. By substituting the estimates forGik(n)andνand using Equation 8.4, the log-likelihood score becomes

Steffen Pauws

125

Lη(ν,ˆ Aˆ(n)) = −L

2(log(2π) +log(η2)log(L) +1)

= c+L 2log

C2 N2

, (8.9)

whereC2denotes the energy of the spectrum from Equation 8.2 in which the gainsGik(n)are substituted by their estimates, andN2denotes the energy in the musical background spectrumN(f) =S(f)−C(f).

We consider the musical background only as a random part of the model.

Consequently, the likelihood as provided by Equation 8.9 also constitutes the likelihood for finding the chroma in the spectrum. In other words,

P(ot;ci)∝ C2

N2. (8.10)

The right-hand expression of Equation 8.10 represents the calculation of an element of the chroma spectrum vector corresponding to chromaci. The chroma spectrum of an observation sequenceO=o1...oT with multiple non-overlapping observation vectors is defined as the mean vector of all individual chroma spectra for eachot.

Implementation. To identify the musical pitches more easily, the subhar-monic sum spectrum [Hermes, 1988] is used as a spectral representation in Equation 8.10. All harmonics are resolved (or, folded) to their fundamental by harmonic compression (i.e., by multiplying the frequency scale by an integral factor). In other words,

H(f) =

N

n=1

hn1W(n f)A(n f), (8.11) whereNis the number of harmonics,h≤1 is a factor controlling the contribu-tion of each harmonic to its fundamental,W(·)is an auditory sensitivity filter, andA(·)is the amplitude spectrum.

In the calculation of Equation 8.11, the following properties are imple-mented for reducing computing time resources and increasing frequency reso-lution.

1. The input music signal is partitioned in non-overlapping time frames of 100 milliseconds.

2. The signal is low-pass filtered and downsampled to cut off spectral con-tent above 5 kHz for performing harmonic compression over 6 octaves.

It is assumed that harmonics above 5 kHz do not contribute significantly Extracting the Key from Music

126

to the pitches below 5 kHz, though we might miss highly-tuned instru-ments. A Hamming-window 1024-point FFT is used to compute the amplitude spectrum.

3. Spectral components (i.e., the peaks) are enhanced to cancel out spurious peaks that do not contribute to pitches.

4. Only a limited number of harmonically compressed spectra are added.

We useN=15. Spectral components at higher frequencies contribute less to pitch than spectral components at lower frequencies. We use h=0.75.

5. Harmonic compression on a linear frequency scale is implemented as a harmonic shift in the logarithmic frequency scale by using s=log2f. Instead of Equation 8.11, we use

H(s) =

N

n=1

hn1W(s+log2n)A(s+log2n). (8.12) To achieve a higher frequency resolution, interpolation is used in the logarithmic frequency scale. In total, 171 (1024/6) points per octave are interpolated over 6 octaves by a cubic spline method. If we would use the original frequency resolution provided to us by FFT, we would get a frequency resolution of 9.77 Hz (10,000/1024). The lowest octave A0-A1 that we consider has a frequency range of 27.5 Hz, so this resolution would be far too low as it covers about 35% of this octave.

6. A weighting function is used to model the human auditory sensitivity for frequencies below 1250 Hz. We use a raised arc-tangent function.

The chroma spectrum for each frame is computed by locating and re-mapping the spectral regions in the harmonically compressed spectrum that correspond with each chroma in A-440 equal temperament. For the chroma C, this comes down to the spectral regions centered around the pitch frequencies for C1 (32.7 Hz), C2 (65.4 Hz), C3 (130.8 Hz), C4 (261.6 Hz), C5 (523.3 Hz) and C6 (1046.5 Hz). The width of each spectral region is taken as a half semi-tone around this center to reduce the effects of ‘slightly mistuned’ pitches. The amplitudes in all spectral regions are combined to form one chroma region.

The chroma spectrum elements are computed using Equation 8.10. Adding and averaging the chroma spectra over all frames results in a chroma spectrum for the complete music sample (use Figure 8.1 and 8.2 for an example).