Chroma spectrum - Musical pitch and key - EXTRACTING THE KEY FROM MUSIC

EXTRACTING THE KEY FROM MUSIC

8.2 Musical pitch and key

8.3.1 Chroma spectrum

As described in Section 2, a chroma represent one of the twelve note posi-tions within an octave. By mapping the musical pitches along different octaves in a spectral representation to their respective chromas using the rules of A440-equal temperament (see Equation 8.1), we arrive at a chroma spectrum. In this way, the chroma spectrum summarizes the harmonic content of a music sample as a compact 12-element feature vector.

General model. A chromac_i, wherei=1,...,12, represents a set of pitches {pik|k=1,...,K}, whereKdenotes the number of octaves, that have the same positioniwithin a given octave, but that differ in octave numberk. To arrive at a likelihood score for a single chroma, we have to collect the likelihood scores for all pitches sharing the same chroma in the music signal. The pitches can originate from any music instrument, from a melody, a chord, or musical background. We will extend on an existing spectral model that copes with the correlation in spectral energy between the pitches of a leading soloist and musical background [Shalev-Shwartz et al., 2002].

LetO=o₁...o_T be a sequence of vectors of lengthT. Each vector repre-sents a spectrum representation of the music signal over a short time frame.

The problem can be formulated as ﬁnding the likelihoodP(o_t;c_i)of ‘hearing’

a chromaciin vectorot.

We denote the spectral distribution of the music signal at frequency f as S(f). Part of it consists of the spectral content of the chromac_i to be found, denoted asC(f). As an ideal simpliﬁcation, we model a single tone of a music instrument as a harmonic series. Its spectrum contains high energy bursts at integral multiplesnpik, for integern referring to the index of the harmonic.

The spectral content ofc_ican then be modelled as a combination of harmonic series for all pitches piksharingci. In other words,

C(f) =

∑

k=1

∑

N n=1

G_ik(n)δ(np_ik−f), (8.2) whereNdenotes the number of harmonics,Gik(n)is the amplitude gain for then-th harmonic of pitch p_ik, andδ()is Dirac’s delta function, which is 1 at the origin and zero elsewhere. Note that we ignore the fact that the harmonics of different pitches may coincide.

Another part ofS(f) consists of the spectral content of the musical back-ground. We assume that the spectral energy of this background affects the Extracting the Key from Music

124

entire spectrumS(f), though it matches the energy of the pitches to be found.

Its spectral energy at frequency f is denoted asη(f). The spectral content of the music signal at frequency f is then modelled as,

S(f) =

∑

k=1

∑

N n=1

Gik(n)(η(f) +δ(npik−f)). (8.3) With some rearrangement, the background energy level at frequency f, η(f), is,

η(f) =S(f)−∑^Kk=1∑^Nn=1Gik(n)δ(npik−f)

∑^K_k=1∑^N_n=1G_ik(n) . (8.4) The characteristics of the musical background,η(f), are unknown. A sim-ple assumption is to model it as a random variable from a zero-mean multi-variate Gaussian process with statistical independence at all frequencies f and equal varianceν. Then, the joint conditional probability density function (pdf), given the unknown variables, is given by,

f(η|ν) = 1

(2πν)^L^/²e⁻^η

2ν , (8.5)

where L denotes the spectrum resolution (or, the number of independent observations) and · ²is the l2-norm.

The maximum likelihood (ML) estimator [Eliason, 1993] for the unknown variables can be obtained by maximizing the log-likelihood function with re-spect to the unknown variables:

(ν,ˆ Gˆ_ik(n))ML=arg max

ν,Gik(n)L_η(ν,G_ik(n)) where (8.6) L_η(ν,G_ik(n))=−L

2log(2πν)−η²

2ν . (8.7)

By maximizing the log-likelihood function for the unknown background varianceν, one obtains

νˆ=η²

L . (8.8)

A sensible choice for the unknown harmonic amplitudesGik(n)is by letting them correspond with the spectral peaksS(np_ik)at the harmonics of the pitch frequency pik. This is sensible because we can expect that the musical back-ground levelη(f)is relatively small compared to the energy at the harmonics of p_ik. By substituting the estimates forG_ik(n)andνand using Equation 8.4, the log-likelihood score becomes

Steffen Pauws

125

L_η(ν,ˆ Aˆ(n)) = −L

2(log(2π) +log(η²)−log(L) +1)

= c+L 2log

C² N²

, (8.9)

whereC²denotes the energy of the spectrum from Equation 8.2 in which the gainsG_ik(n)are substituted by their estimates, andN²denotes the energy in the musical background spectrumN(f) =S(f)−C(f).

We consider the musical background only as a random part of the model.

Consequently, the likelihood as provided by Equation 8.9 also constitutes the likelihood for ﬁnding the chroma in the spectrum. In other words,

P(o_t;c_i)∝ C²

N². (8.10)

The right-hand expression of Equation 8.10 represents the calculation of an element of the chroma spectrum vector corresponding to chromac_i. The chroma spectrum of an observation sequenceO=o1...oT with multiple non-overlapping observation vectors is deﬁned as the mean vector of all individual chroma spectra for eachot.

Implementation. To identify the musical pitches more easily, the subhar-monic sum spectrum [Hermes, 1988] is used as a spectral representation in Equation 8.10. All harmonics are resolved (or, folded) to their fundamental by harmonic compression (i.e., by multiplying the frequency scale by an integral factor). In other words,

H(f) =

∑

n=1

hⁿ⁻¹W(n f)A(n f), (8.11) whereNis the number of harmonics,h≤1 is a factor controlling the contribu-tion of each harmonic to its fundamental,W(·)is an auditory sensitivity ﬁlter, andA(·)is the amplitude spectrum.

In the calculation of Equation 8.11, the following properties are imple-mented for reducing computing time resources and increasing frequency reso-lution.

1. The input music signal is partitioned in non-overlapping time frames of 100 milliseconds.

2. The signal is low-pass ﬁltered and downsampled to cut off spectral con-tent above 5 kHz for performing harmonic compression over 6 octaves.

It is assumed that harmonics above 5 kHz do not contribute signiﬁcantly Extracting the Key from Music

126

to the pitches below 5 kHz, though we might miss highly-tuned instru-ments. A Hamming-window 1024-point FFT is used to compute the amplitude spectrum.

3. Spectral components (i.e., the peaks) are enhanced to cancel out spurious peaks that do not contribute to pitches.

4. Only a limited number of harmonically compressed spectra are added.

We useN=15. Spectral components at higher frequencies contribute less to pitch than spectral components at lower frequencies. We use h=0.75.

5. Harmonic compression on a linear frequency scale is implemented as a harmonic shift in the logarithmic frequency scale by using s=log₂f. Instead of Equation 8.11, we use

H(s) =

∑

n=1

hⁿ⁻¹W(s+log₂n)A(s+log₂n). (8.12) To achieve a higher frequency resolution, interpolation is used in the logarithmic frequency scale. In total, 171 (1024/6) points per octave are interpolated over 6 octaves by a cubic spline method. If we would use the original frequency resolution provided to us by FFT, we would get a frequency resolution of 9.77 Hz (10,000/1024). The lowest octave A0-A1 that we consider has a frequency range of 27.5 Hz, so this resolution would be far too low as it covers about 35% of this octave.

6. A weighting function is used to model the human auditory sensitivity for frequencies below 1250 Hz. We use a raised arc-tangent function.

The chroma spectrum for each frame is computed by locating and re-mapping the spectral regions in the harmonically compressed spectrum that correspond with each chroma in A-440 equal temperament. For the chroma C, this comes down to the spectral regions centered around the pitch frequencies for C1 (32.7 Hz), C2 (65.4 Hz), C3 (130.8 Hz), C4 (261.6 Hz), C5 (523.3 Hz) and C6 (1046.5 Hz). The width of each spectral region is taken as a half semi-tone around this center to reduce the effects of ‘slightly mistuned’ pitches. The amplitudes in all spectral regions are combined to form one chroma region.

The chroma spectrum elements are computed using Equation 8.10. Adding and averaging the chroma spectra over all frames results in a chroma spectrum for the complete music sample (use Figure 8.1 and 8.2 for an example).

Dans le document Intelligent Algorithms in Ambient and Biomedical Computing (Page 136-139)