DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM PATTERN RECOGNITION

(1)

HAL Id: hal-02457638

https://hal.archives-ouvertes.fr/hal-02457638

Submitted on 28 Jan 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

DEEP-RHYTHM FOR TEMPO ESTIMATION AND

RHYTHM PATTERN RECOGNITION

Hadrien Foroughmand, Geoffroy Peeters

To cite this version:

Hadrien Foroughmand, Geoffroy Peeters.

DEEP-RHYTHM FOR TEMPO ESTIMATION AND

RHYTHM PATTERN RECOGNITION. International Society for Music Information Retrieval

(IS-MIR), Nov 2019, Delft, Netherlands. �hal-02457638�

(2)

RECOGNITION

Hadrien Foroughmand

IRCAM Lab - CNRS - Sorbonne Université

hadrien.foroughmand@ircam.fr

Geoffroy Peeters

LTCI - Télécom Paris - Institut Polytechnique de Paris

geoffroy.peeters@telecom-paris.fr

ABSTRACT

It has been shown that the harmonic series at the tempo frequency of the onset-strength-function of an audio sig-nal accurately describes its rhythm pattern and can be used to perform tempo or rhythm pattern estimation. Recently, in the case of multi-pitch estimation, the depth of the input layer of a convolutional network has been used to represent the harmonic series of pitch candidates. We use a similar idea here to represent the harmonic series of tempo candi-dates. We propose the Harmonic-Constant-Q-Modulation which represents, using a 4D-tensors, the harmonic se-ries of modulation frequencies (considered as tempo fre-quencies) in several acoustic frequency bands over time. This representation is used as input to a convolutional net-work which is trained to estimate tempo or rhythm pattern classes. Using a large number of datasets, we evaluate the performance of our approach and compare it with previous approaches. We show that it slightly increases Accuracy-1 for tempo estimation but not the average-mean-Recall for rhythm pattern recognition.

1. INTRODUCTION

Tempo is one of the most important perceptual elements of music. Today numerous applications rely on tempo information (recommendation, playlist generation, syn-chronization, dj-ing, audio or audio/video editing, beat-synchronous analysis). It is therefore crucial to develop algorithms to correctly estimate it. The automatic estima-tion of tempo from an audio signal has been one of the first research carried on in Music Information Retrieval (MIR) [11]. 25 years later it is still a very active research subject in MIR. This is due to the fact that tempo estima-tion is still an unsolved problem (outside the prototypical cases of pop or techno music) and that recent deep-learning approaches [3] [37] bring new perspectives to it.

Tempo is usually defined as (and annotated as) the rate at which people tap their foot or their hand when listen-ing to a music piece. Several people can therefore perceive different tempi for the same piece of music. This is due to the hierarchy of the metrical structure in music (to deal

c

Hadrien Foroughmand, Geoffroy Peeters. Licensed un-der a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Hadrien Foroughmand, Geoffroy Peeters. “Deep-Rhythm for tempo estimation and rhythm pattern recognition”, 20th In-ternational Society for Music Information Retrieval Conference, Delft, The Netherlands, 2019.

with this ambiguity the research community has proposed to consider octave errors as correct) and due to the fact that without the cultural knowledge of the rhythm pattern(s) be-ing played, it can be difficult to perceive “the” tempo (or even “a” tempo). This last point, of course, opens the door to data-driven approaches, which can learn the specifici-ties of the patterns. In this work, we do not deal with the inherent ambiguity of tempo and consider the values pro-vided by annotated datasets as ground-truth. The method we propose here belong to the data-driven systems in the sense that we learn from the data. It also considers both the tempo and rhythm pattern in interaction by adequately modeling the audio content. The tempo of a track can of course vary along time, but in this work we focus on the estimation of constant (global) tempi and rhythm patterns. In the following section, we summarize works related to tempo and rhythm pattern estimation from audio. We refer the reader to [12, 32, 44] for more detailed overviews. 1.1 Related works

Tempo estimation. Early MIR systems encoded

do-main knowledge (audio, auditory perception and musical knowledge) by hand-crafting signal processing and statis-tical models (hidden Markov, dynamic Bayesian network). Data were at most used to manually tune some parame-ters (such as filter frequencies or transition probabilities). Early techniques for beat tracking and/or tempo estimation belong to this category. Their overall flowchart is a multi-band separation combined with an onset strength function which periodicity is measured. For example, Scheirer [36] proposes the use of band-pass filters combined with res-onating comb filters and peak picking; Klapuri [18] uses the resonating comb filter bank which is driven by the bandwise accent signals, the main extension is the tracking of multiple metrical levels; Gainza and Cole [8] propose a hybrid multiband decomposition where the periodicities of onset functions are tracked in several frequency bands using autocorrelation and then weighted.

Since the pioneer works of [11], many audio datasets have been annotated into tempo. This therefore encour-ages researchers to develop data-driven systems based on

machine learning and deep learning techniques. Such

machine-learning models are K-Nearest-Neighbors (KNN) [40], Gaussian Mixture Model (GMM) [33, 43], Support Vector Machine (SVM) [4, 9, 34], bags of classifiers [20], Random Forest [38] or more recently deep learning mod-els. The first use of deep learning for tempo estimation

(3)

was proposed by Böck et al. [3] who proposed a deep Re-current Neural Network (bi-LSTM) to predict the position of the beats inside the signal. This output is then used as the input of a bank of resonating comb filters to de-tect the periodicity and so the tempo. This technique still achieves the best results today in terms of Accuracy2. Re-cently, Schreiber and Müller [37] proposed a “single step approach” for tempo estimation using deep convolutional networks. The network design is inspired by the flowchart of handcrafted systems: the first layer is supposed to mimic the extraction of an onset-strength-function. Their system uses as input Mel-spectrograms and the network is trained to classify the tempo of an audio excerpt into 256 tempo classes (from 30 to 285 BPM), it shows very good results in terms of Class-Accuracy and Accuracy1.

Rhythm pattern recognition. While tempo and

rhythm pattern are closely interleaved, the recognition of rhythm pattern has received much less attention. This is probably due to the difficulty of creating datasets annotated in such rhythm pattern (defining the similarity between patterns — outside the trivial identity case — remains a difficult task). To create such a dataset, one may con-sider the equivalence between the rhythm pattern and the related dance (such as Tango): Ballroom [12],

Extended-ballroom [24] and Greek-dances [14]. Systems to

rec-ognize rhythm pattern from the audio are all very dif-ferent: Foote [7] defines a beat spectrum computed with a similarity matrix of MFCCs, Tzanetakis [42] defines a beat histogram computed from an autocorrelation function, Peeters [32] proposes a harmonic analysis of the rhythm pattern, Holzapfel [15] proposes the use of the scale trans-form (which allows to get a tempo invariant representa-tion), Marchand [23, 25] extends the latter by combining it with the modulation spectrum and adding correlation coef-ficients between frequency bands. For a recognition task on the Extended-ballroom and Greek-dances, Marchand can be considered as the state-of-the-art.

1.2 Paper proposal and organization

In this paper we present a new audio representation, the Harmonic-Constant-Q-Modulation (HCQM), to be used as input to a convolutional network for global tempo and

rhythm pattern estimation. We name it Deep Rhythm

since it uses the depth of the input and a deep network. The paper is organized as follows. In §2, we describe the motivation for (§2.1) and the computation details of (§2.2) the HCQM. We then describe the architecture of the con-volutional neural network we used (§2.3) and the training process (§2.4). In §3, we evaluate our proposal for the task of global tempo estimation (§3.1) and rhythm pattern recognition (§3.2) and discuss the results.

2. PROPOSED METHOD 2.1 Motivations

From Fourier series, it is known that any periodic

sig-nal x(t) with period T0 (of fundamental period f0 =

1/T0) can be represented as a weighted sum of sinusoidal

components which frequencies are the harmonics of f0:

ˆ

xf0,a(t) =

PH

h=1ahsin(2πhf0t + φh).

For the voiced part of speech or pitched musical in-strument, this leads to the so-called “harmonic sinusoidal model” [26, 39] that can be used for audio coding or transformation. This model can also be used to estimate

the pitch of a signal [21]: estimating the f0 such that

ˆ

xf0,a(t) ' x(t). The values ah can be estimated by

sampling the magnitude of the DFT at the

correspond-ing frequencies ah,f0 = |X(hf0)|. The vector af0 =

{a1,f0· · · aH,f0} represents the spectral envelope of the

signal and is closely related to the timbre of the audio sig-nal, hence the instrument playing. For this reason, these values are often used for instrument classification [29].

For audio musical rhythm, Peeters [30] [31] [32] pro-poses to apply such an harmonic analysis to an

onset-strength-function . The periodT0is then defined as the

du-ration of a beat. a1,f0 then represents the DFT magnitude

at the 4th-note level, a2,f0 at the 8

th_{-note level,} _a3,f

0 at

the8th_{-note-triplet level, while}_a₁

2,f0 represent the binary

grouping of the beats anda1

3,f0 the ternary one. Peeters

considers that the vectora is representative of the specific

rhythm and that thereforea_f₀represents a specific rhythm

played at a specific tempof0. He proposes the following

harmonic series: h ∈ {1 4, 1 3, 1 2, 2 3, 3 4, 1, 1.25, 1.33, . . . 8}

Using this, he shows - in [32] that given the tempof0, the

vectoraf0can be used to classify the different rhythm

pat-tern; - in [30], that given manually-fixed prototype vectors

a, it is possible to estimate the tempo f0(looking for the

f such thataf ' a); - in [31] that the prototype vectors a

can be learned (using simple machine-learning) to achieve

the best tempo estimationf0.

The method we propose in this paper is in the

continua-tion of this last work (learning the valuesa to estimate the

tempo or the rhythm class) but we adapted it to the deep learning formalism recently proposed by Bittner et al. [2] where the depth of the input to a convolutional network is

used to represent the harmonic seriesaf. In [2], a

constant-Q-transform (timeτ and log-frequency f ) is expanded to

a third dimension which represent the harmonic seriesaf

of eachf (with h ∈ [1

2, 1, 2, 3, 4, 5]). When f = f0,af

will represent the specific harmonic series of the musical

instrument (plus an extra value at the 1₂f position used to

avoid octave errors). Whenf 6= f0,af will represent

(al-most) random values. In [2], the goal is to estimate the parameters of a filter such that when multiplied with this

third dimension af it will provide very different values

when f = f0 or when f 6= f0. This filter will then be

convolved over all log-frequencies f and time τ to

esti-mate thef0’s. This filter is trained using annotated data.

In [2], there is actually several of such filter; they consti-tute the first layer of a convolutional network. In practice,

in [2], theah,f are not obtained as |X(hf )|; but by

stack-ing in depth several CQTs each startstack-ing at different

mini-mal frequencieshfmin. The representation is denoted by

Harmonic Constant-Q Transform (HCQT):Xhcqt(f, τ, h).

(4)

STFT _{Onset Strength Function} _HCQT

Raw audio _{STFT in frequency bands}

x(t) <latexit sha1_base64="gHIePtuoIc2473k2RR8MvnjyLHw=">AAACx3icjVHLTsJAFD3UF+ILcemmkZjghrS40CWJG91hIo9EiGnLAA19ZTolEOLCH9Clfo8/YfwD/QD33hlKohKj07Q9c+45Z+bO2JHnxsIwXjPa0vLK6lp2PbexubW9k98tNOIw4Q6rO6EX8pZtxcxzA1YXrvBYK+LM8m2PNe3hmaw3R4zHbhhciUnEOr7VD9ye61hCUuOSOLrJF42yoYa+CMwUFKuFdunj+aFdC/MvaKOLEA4S+GAIIAh7sBDTcw0TBiLiOpgSxwm5qs5wixx5E1IxUljEDunbp9l1ygY0l5mxcju0ikcvJ6eOQ/KEpOOE5Wq6qicqWbK/ZU9VptzbhP52muUTKzAg9i/fXPlfn+xFoIdT1YNLPUWKkd05aUqiTkXuXP/SlaCEiDiJu1TnhB3lnJ+zrjyx6l2eraXqb0opWTl3Um2Cd7lLumDz53UugkalbB6XzUuzWK1gNrLYxwFKdJ8nqOIcNdQpe4B7POJJu9BCbaSNZ1Itk3r28G1od59RwpPI</latexit> t <latexit sha1_base64="n/Crv317OzvKiIiDcLEyF/0sYBk=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVZK60I1YEMRlC/YBtUiSTmtomoSZiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujJeEgZC2/ZYzFhaXllfyq+ba+sbmVmF7pyHilPus7sdhzFueK1gYRKwuAxmyVsKZO/RC1vQG56revGNcBHF0JUcJ6wzdfhT0At+VRNXkTaFol2w9rHngZKB49mqeJi8TsxoXJrhGFzF8pBiCIYIkHMKFoKcNBzYS4joYE8cJBbrOcA+TvCmpGClcYgf07dOsnbERzVWm0G6fVgnp5eS0cECemHScsFrN0vVUJyv2t+yxzlR7G9Hfy7KGxErcEvuXb6b8r0/1ItHDie4hoJ4Szaju/Cwl1aeidm596UpSQkKcwl2qc8K+ds7O2dIeoXtXZ+vq+rtWKlbN/Uyb4kPtki7Y+Xmd86BRLjlHJafmFCtlTEcee9jHId3nMSq4RBV1nf2IJzwbF0ZoCCOdSo1c5tnFt2E8fAJmOJKa</latexit> ⌧ <latexit sha1_base64="L2lFTqJkYi2LvCci7Xyg57PtzM0=">AAACx3icjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsuNFdBfuAtkiSTtvQvJhMiqW48Afc6t6P8FOKf6B/4Z1pCmoRnZDkzLnnnJk748S+lwjTfMtpC4tLyyv5VX1tfWNzq7C9U0uilLus6kZ+xBuOnTDfC1lVeMJnjZgzO3B8VncGF7JeHzKeeFF4I0Yxawd2L/S6nmsLSbWEnd4WimbJVMOYB1YGiuev+ln8MtErUWGCFjqI4CJFAIYQgrAPGwk9TVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZs2MDWkuMxPldmkVn15OTgMH5IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF6eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wHtaOSdVyyrq1i+QjTkcce9nFI93mCMi5RQZWy+3jEE561Ky3ShtrdVKrlMs8uvg3t4ROwyJPq</latexit> ⌧ <latexit sha1_base64="L2lFTqJkYi2LvCci7Xyg57PtzM0=">AAACx3icjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsuNFdBfuAtkiSTtvQvJhMiqW48Afc6t6P8FOKf6B/4Z1pCmoRnZDkzLnnnJk748S+lwjTfMtpC4tLyyv5VX1tfWNzq7C9U0uilLus6kZ+xBuOnTDfC1lVeMJnjZgzO3B8VncGF7JeHzKeeFF4I0Yxawd2L/S6nmsLSbWEnd4WimbJVMOYB1YGiuev+ln8MtErUWGCFjqI4CJFAIYQgrAPGwk9TVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZs2MDWkuMxPldmkVn15OTgMH5IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF6eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wHtaOSdVyyrq1i+QjTkcce9nFI93mCMi5RQZWy+3jEE561Ky3ShtrdVKrlMs8uvg3t4ROwyJPq</latexit> ⌧ <latexit sha1_base64="L2lFTqJkYi2LvCci7Xyg57PtzM0=">AAACx3icjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsuNFdBfuAtkiSTtvQvJhMiqW48Afc6t6P8FOKf6B/4Z1pCmoRnZDkzLnnnJk748S+lwjTfMtpC4tLyyv5VX1tfWNzq7C9U0uilLus6kZ+xBuOnTDfC1lVeMJnjZgzO3B8VncGF7JeHzKeeFF4I0Yxawd2L/S6nmsLSbWEnd4WimbJVMOYB1YGiuev+ln8MtErUWGCFjqI4CJFAIYQgrAPGwk9TVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZs2MDWkuMxPldmkVn15OTgMH5IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF6eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wHtaOSdVyyrq1i+QjTkcce9nFI93mCMi5RQZWy+3jEE561Ky3ShtrdVKrlMs8uvg3t4ROwyJPq</latexit> ⌧0_{= 1} <latexit sha1_base64="E5naABUQ17PS4hYfpHM5W1FBtqU=">AAACynicjVHLSsNAFD2NrxpfVZdugkV0VZK60E2x4MaFiwr2AW2RJJ3W0LyYTIRS3PkDbnXpR/gpxT/Qv/DONAW1iE5Icubcc87MnXFi30uEab7ltIXFpeWV/Kq+tr6xuVXY3mkkUcpdVncjP+Itx06Y74WsLjzhs1bMmR04Pms6w3NZb94xnnhReC1GMesG9iD0+p5rC6KaHWGnhxXrplA0S6YaxjywMlA8e9Ur8ctEr0WFCTroIYKLFAEYQgjCPmwk9LRhwURMXBdj4jghT9UZ7qGTNyUVI4VN7JC+A5q1MzakucxMlNulVXx6OTkNHJAnIh0nLFczVD1VyZL9LXusMuXeRvR3sqyAWIFbYv/yzZT/9cleBPo4VT141FOsGNmdm6Wk6lTkzo0vXQlKiImTuEd1TthVztk5G8qTqN7l2dqq/q6UkpVzN9Om+JC7pAu2fl7nPGiUS9ZxybqyitUypiOPPezjiO7zBFVcoIa66vIRT3jWLjWujbTxVKrlMs8uvg3t4ROJm5Sd</latexit> ⌧ 0_{= T} <latexit sha1_base64="VE5gjAFfIct+PF8iLSbeZHnCY0M=">AAACynicjVHLSsNAFD3GV42vqks3wSK6Kkld6KZYcOPCRYW+oC2SpNMamheTiVCKO3/ArS79CD+l+Af6F96ZpqAW0QlJzpx7zpm5M07se4kwzbcFbXFpeWU1t6avb2xubed3dhtJlHKX1d3Ij3jLsRPmeyGrC0/4rBVzZgeOz5rO8ELWm3eMJ14U1sQoZt3AHoRe33NtQVSzI+z0qFy7yRfMoqmGMQ+sDBTOX/Vy/DLRq1F+gg56iOAiRQCGEIKwDxsJPW1YMBET18WYOE7IU3WGe+jkTUnFSGETO6TvgGbtjA1pLjMT5XZpFZ9eTk4Dh+SJSMcJy9UMVU9VsmR/yx6rTLm3Ef2dLCsgVuCW2L98M+V/fbIXgT7OVA8e9RQrRnbnZimpOhW5c+NLV4ISYuIk7lGdE3aVc3bOhvIkqnd5traqvyulZOXczbQpPuQu6YKtn9c5DxqlonVStK6tQqWE6chhHwc4pvs8RQWXqKKuunzEE561K41rI208lWoLmWcP34b28Ancu5TA</latexit> ⌧0_{= 1} <latexit sha1_base64="E5naABUQ17PS4hYfpHM5W1FBtqU=">AAACynicjVHLSsNAFD2NrxpfVZdugkV0VZK60E2x4MaFiwr2AW2RJJ3W0LyYTIRS3PkDbnXpR/gpxT/Qv/DONAW1iE5Icubcc87MnXFi30uEab7ltIXFpeWV/Kq+tr6xuVXY3mkkUcpdVncjP+Itx06Y74WsLjzhs1bMmR04Pms6w3NZb94xnnhReC1GMesG9iD0+p5rC6KaHWGnhxXrplA0S6YaxjywMlA8e9Ur8ctEr0WFCTroIYKLFAEYQgjCPmwk9LRhwURMXBdj4jghT9UZ7qGTNyUVI4VN7JC+A5q1MzakucxMlNulVXx6OTkNHJAnIh0nLFczVD1VyZL9LXusMuXeRvR3sqyAWIFbYv/yzZT/9cleBPo4VT141FOsGNmdm6Wk6lTkzo0vXQlKiImTuEd1TthVztk5G8qTqN7l2dqq/q6UkpVzN9Om+JC7pAu2fl7nPGiUS9ZxybqyitUypiOPPezjiO7zBFVcoIa66vIRT3jWLjWujbTxVKrlMs8uvg3t4ROJm5Sd</latexit> ⌧0_{= T} <latexit sha1_base64="VE5gjAFfIct+PF8iLSbeZHnCY0M=">AAACynicjVHLSsNAFD3GV42vqks3wSK6Kkld6KZYcOPCRYW+oC2SpNMamheTiVCKO3/ArS79CD+l+Af6F96ZpqAW0QlJzpx7zpm5M07se4kwzbcFbXFpeWU1t6avb2xubed3dhtJlHKX1d3Ij3jLsRPmeyGrC0/4rBVzZgeOz5rO8ELWm3eMJ14U1sQoZt3AHoRe33NtQVSzI+z0qFy7yRfMoqmGMQ+sDBTOX/Vy/DLRq1F+gg56iOAiRQCGEIKwDxsJPW1YMBET18WYOE7IU3WGe+jkTUnFSGETO6TvgGbtjA1pLjMT5XZpFZ9eTk4Dh+SJSMcJy9UMVU9VsmR/yx6rTLm3Ef2dLCsgVuCW2L98M+V/fbIXgT7OVA8e9RQrRnbnZimpOhW5c+NLV4ISYuIk7lGdE3aVc3bOhvIkqnd5traqvyulZOXczbQpPuQu6YKtn9c5DxqlonVStK6tQqWE6chhHwc4pvs8RQWXqKKuunzEE561K41rI208lWoLmWcP34b28Ancu5TA</latexit> HCQT HCQM X(f, ⌧ )

<latexit sha1_base64="kTsZ8ytb9JrVLZAWV8AB8QOlcWc=">AAACzHicjVHLSsNAFD2Nr1pftS7dBItQQUpSF7osuHElFexD2iJJOq1D8yKZCKV069aFW/0Zf0L8A/0A996ZpqAW0QlJzpx7zpm5M3bo8lgYxmtGW1hcWl7JrubW1jc2t/LbhUYcJJHD6k7gBlHLtmLmcp/VBRcua4URszzbZU17eCrrzVsWxTzwL8UoZF3PGvi8zx1LEHXVKvUPO8JKDq7zRaNsqKHPAzMFxWqhU/p4vu/UgvwLOughgIMEHhh8CMIuLMT0tGHCQEhcF2PiIkJc1RkmyJE3IRUjhUXskL4DmrVT1qe5zIyV26FVXHojcurYJ09AuoiwXE1X9UQlS/a37LHKlHsb0d9OszxiBW6I/cs3U/7XJ3sR6ONE9cCpp1AxsjsnTUnUqcid61+6EpQQEidxj+oRYUc5Z+esK0+sepdna6n6m1JKVs6dVJvgXe6SLtj8eZ3zoFEpm0dl88IsViuYjix2sYcS3ecxqjhDDXXK9vCARzxp55rQxtpkKtUyqWcH34Z29wn9X5We</latexit> X(b, ⌧ )<latexit sha1_base64="8JixcwbgKV44MEYZYDzkQjioW3Y=">AAACzHicjVHLSsNAFD2Nr1pftS7dBItQQUpSF7osuHElFexD2iJJOq2heTGZCKV069aFW/0Zf0L8A/0A996ZpqAW0QlJzpx7zpm5M3bkubEwjNeMtrC4tLySXc2trW9sbuW3C404TLjD6k7ohbxlWzHz3IDVhSs81oo4s3zbY017eCrrzVvGYzcMLsUoYl3fGgRu33UsQdRVq2QfdoSVHFzni0bZUEOfB2YKitVCp/TxfN+phfkXdNBDCAcJfDAEEIQ9WIjpacOEgYi4LsbEcUKuqjNMkCNvQipGCovYIX0HNGunbEBzmRkrt0OrePRycurYJ09IOk5YrqareqKSJftb9lhlyr2N6G+nWT6xAjfE/uWbKf/rk70I9HGienCpp0gxsjsnTUnUqcid61+6EpQQESdxj+qcsKOcs3PWlSdWvcuztVT9TSklK+dOqk3wLndJF2z+vM550KiUzaOyeWEWqxVMRxa72EOJ7vMYVZyhhjpl+3jAI560c01oY20ylWqZ1LODb0O7+wTzx5Wa</latexit> X<latexit sha1_base64="NAxEoOIIP3SgXzZNS4pCbcYGQCw=">AAACznicjVHLSsNAFD2Nr1pftS7dBItQQUpSF7osuHFZwT6gkZKk0xqaZkIyKZRS3Lpz5Vb/xZ8Q/0A/wL13pimoRXRCkjPnnnNn7r1O6HuxMIzXjLa0vLK6ll3PbWxube/kdwuNmCeRy+ou93nUcuyY+V7A6sITPmuFEbOHjs+azuBcxpsjFsUeD67EOGTXQ7sfeD3PtQVR7VaHl5xjS9jJUSdfNMqGWvoiMFNQrBas0sfzvVXj+RdY6ILDRYIhGAIIwj5sxPS0YcJASNw1JsRFhDwVZ5giR96EVIwUNrED+vZp107ZgPYyZ6zcLp3i0xuRU8cheTjpIsLyNF3FE5VZsr/lnqic8m5j+jtpriGxAjfE/uWbK//rk7UI9HCmavCoplAxsjo3zZKorsib61+qEpQhJE7iLsUjwq5yzvusK0+sape9tVX8TSklK/duqk3wLm9JAzZ/jnMRNCpl86RsXprFagWzlcU+DlCieZ6iigvUUFcdf8AjnrSaNtKm2u1MqmVSzx6+Le3uEzBVlnw=</latexit> o(b, ⌧ )

Xhcqm( , ⌧0, b, h, ) <latexit sha1_base64="4tDzCJ+3h9uO8RuPPw38AZvX4bI=">AAAC4XicjVG7SsRAFD0b3+/10WkRXESFZUm00HLBxlLBfYArSzKOZjAvk4kgi42dndj6A7Yqfov4B2pv753ZCD4QnZDkzLnnnJk748a+SKVlPRWMnt6+/oHBoeGR0bHxieLkVD2NsoTxGov8KGm6Tsp9EfKaFNLnzTjhTuD6vOEebah644QnqYjCHXka873AOQzFgWCOJKpdnGu2Ox47Ds6WWrEnymZLOtli2XTLpldebhdLVsXSw/wJ7ByUqjNvD+nrvbcVFR/Rwj4iMGQIwBFCEvbhIKVnFzYsxMTtoUNcQkjoOscZhsmbkYqTwiH2iL6HNNvN2ZDmKjPVbkar+PQm5DSxQJ6IdAlhtZqp65lOVuxv2R2dqfZ2Sn83zwqIlfCI/cv3ofyvT/UicYB13YOgnmLNqO5YnpLpU1E7Nz91JSkhJk7hfaonhJl2fpyzqT2p7l2draPrz1qpWDVnuTbDi9olXbD9/Tp/gvpKxV6t2Nt2qbqC7hjELOaxRPe5hio2sYUaZZ/jBre4M5hxYVwaV12pUcg90/gyjOt3bF+dlw==</latexit> h<latexit sha1_base64="RPyJnQzk1j2sIm0otPy5j4t40dA=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCOKyBfuAWiRJp23oNAmTiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujBfzIJG2/ZYzFhaXllfyq+ba+sbmVmF7p55EqfBZzY94JJqemzAehKwmA8lZMxbMHXqcNbzBhao37phIgii8lqOYtYduLwy6ge9Koqr920LRLtl6WPPAyUDx/NU8i18mZiUqTHCDDiL4SDEEQwhJmMNFQk8LDmzExLUxJk4QCnSd4R4meVNSMVK4xA7o26NZK2NDmqvMRLt9WoXTK8hp4YA8EekEYbWapeupTlbsb9ljnan2NqK/l2UNiZXoE/uXb6b8r0/1ItHFqe4hoJ5izaju/Cwl1aeidm596UpSQkycwh2qC8K+ds7O2dKeRPeuztbV9XetVKya+5k2xYfaJV2w8/M650H9qOQcl5yqUywfYTry2MM+Duk+T1DGFSqo6exHPOHZuDS4kRjpVGrkMs8uvg3j4RNJuJKO</latexit> b <latexit sha1_base64="MX6uDOXtgmyvJQvul84/DkKOUMg=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVZK60I1YEMRlC/YBtUiSTmtomoSZiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujJeEgZC2/ZYzFhaXllfyq+ba+sbmVmF7pyHilPus7sdhzFueK1gYRKwuAxmyVsKZO/RC1vQG56revGNcBHF0JUcJ6wzdfhT0At+VRNW8m0LRLtl6WPPAyUDx7NU8TV4mZjUuTHCNLmL4SDEEQwRJOIQLQU8bDmwkxHUwJo4TCnSd4R4meVNSMVK4xA7o26dZO2MjmqtMod0+rRLSy8lp4YA8Mek4YbWapeupTlbsb9ljnan2NqK/l2UNiZW4JfYv30z5X5/qRaKHE91DQD0lmlHd+VlKqk9F7dz60pWkhIQ4hbtU54R97Zyds6U9QveuztbV9XetVKya+5k2xYfaJV2w8/M650GjXHKOSk7NKVbKmI489rCPQ7rPY1RwiSrqOvsRT3g2LozQEEY6lRq5zLOLb8N4+AQ7eJKI</latexit> <latexit sha1_base64="B/cf9entUO82rlrrXFfjNnSlIYw=">AAACx3icjVHLSsNAFD2Nr1qrVl0KEiyCq5LUhS4LbnTXgn2ALZKk03ZoXiSTYikuXLh1q5/SPxH/QPEnvDNNQS2iE5KcOfecM3Nn7NDlsTCM14y2tLyyupZdz23kN7e2Czu7jThIIofVncANopZtxczlPqsLLlzWCiNmebbLmvbwXNabIxbFPPCvxDhkHc/q+7zHHUtIqh0O+E2haJQMNfRFYKagWMlPax8PB9NqUHhBG10EcJDAA4MPQdiFhZiea5gwEBLXwYS4iBBXdYY75MibkIqRwiJ2SN8+za5T1qe5zIyV26FVXHojcuo4Ik9AuoiwXE1X9UQlS/a37InKlHsb099OszxiBQbE/uWbK//rk70I9HCmeuDUU6gY2Z2TpiTqVOTO9S9dCUoIiZO4S/WIsKOc83PWlSdWvcuztVT9TSklK+dOqk3wLndJF2z+vM5F0CiXzJOSWTOLlTJmI4t9HOKY7vMUFVygijplD/CIJzxrl1qgjbTbmVTLpJ49fBva/SeD95RL</latexit> h <latexit sha1_base64="RPyJnQzk1j2sIm0otPy5j4t40dA=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCOKyBfuAWiRJp23oNAmTiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujBfzIJG2/ZYzFhaXllfyq+ba+sbmVmF7p55EqfBZzY94JJqemzAehKwmA8lZMxbMHXqcNbzBhao37phIgii8lqOYtYduLwy6ge9Koqr920LRLtl6WPPAyUDx/NU8i18mZiUqTHCDDiL4SDEEQwhJmMNFQk8LDmzExLUxJk4QCnSd4R4meVNSMVK4xA7o26NZK2NDmqvMRLt9WoXTK8hp4YA8EekEYbWapeupTlbsb9ljnan2NqK/l2UNiZXoE/uXb6b8r0/1ItHFqe4hoJ5izaju/Cwl1aeidm596UpSQkycwh2qC8K+ds7O2dKeRPeuztbV9XetVKya+5k2xYfaJV2w8/M650H9qOQcl5yqUywfYTry2MM+Duk+T1DGFSqo6exHPOHZuDS4kRjpVGrkMs8uvg3j4RNJuJKO</latexit> b <latexit sha1_base64="MX6uDOXtgmyvJQvul84/DkKOUMg=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVZK60I1YEMRlC/YBtUiSTmtomoSZiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujJeEgZC2/ZYzFhaXllfyq+ba+sbmVmF7pyHilPus7sdhzFueK1gYRKwuAxmyVsKZO/RC1vQG56revGNcBHF0JUcJ6wzdfhT0At+VRNW8m0LRLtl6WPPAyUDx7NU8TV4mZjUuTHCNLmL4SDEEQwRJOIQLQU8bDmwkxHUwJo4TCnSd4R4meVNSMVK4xA7o26dZO2MjmqtMod0+rRLSy8lp4YA8Mek4YbWapeupTlbsb9ljnan2NqK/l2UNiZW4JfYv30z5X5/qRaKHE91DQD0lmlHd+VlKqk9F7dz60pWkhIQ4hbtU54R97Zyds6U9QveuztbV9XetVKya+5k2xYfaJV2w8/M650GjXHKOSk7NKVbKmI489rCPQ7rPY1RwiSrqOvsRT3g2LozQEEY6lRq5zLOLb8N4+AQ7eJKI</latexit> <latexit sha1_base64="B/cf9entUO82rlrrXFfjNnSlIYw=">AAACx3icjVHLSsNAFD2Nr1qrVl0KEiyCq5LUhS4LbnTXgn2ALZKk03ZoXiSTYikuXLh1q5/SPxH/QPEnvDNNQS2iE5KcOfecM3Nn7NDlsTCM14y2tLyyupZdz23kN7e2Czu7jThIIofVncANopZtxczlPqsLLlzWCiNmebbLmvbwXNabIxbFPPCvxDhkHc/q+7zHHUtIqh0O+E2haJQMNfRFYKagWMlPax8PB9NqUHhBG10EcJDAA4MPQdiFhZiea5gwEBLXwYS4iBBXdYY75MibkIqRwiJ2SN8+za5T1qe5zIyV26FVXHojcuo4Ik9AuoiwXE1X9UQlS/a37InKlHsb099OszxiBQbE/uWbK//rk70I9HCmeuDUU6gY2Z2TpiTqVOTO9S9dCUoIiZO4S/WIsKOc83PWlSdWvcuztVT9TSklK+dOqk3wLndJF2z+vM5F0CiXzJOSWTOLlTJmI4t9HOKY7vMUFVygijplD/CIJzxrl1qgjbTbmVTLpJ49fBva/SeD95RL</latexit> b<latexit sha1_base64="MX6uDOXtgmyvJQvul84/DkKOUMg=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVZK60I1YEMRlC/YBtUiSTmtomoSZiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujJeEgZC2/ZYzFhaXllfyq+ba+sbmVmF7pyHilPus7sdhzFueK1gYRKwuAxmyVsKZO/RC1vQG56revGNcBHF0JUcJ6wzdfhT0At+VRNW8m0LRLtl6WPPAyUDx7NU8TV4mZjUuTHCNLmL4SDEEQwRJOIQLQU8bDmwkxHUwJo4TCnSd4R4meVNSMVK4xA7o26dZO2MjmqtMod0+rRLSy8lp4YA8Mek4YbWapeupTlbsb9ljnan2NqK/l2UNiZW4JfYv30z5X5/qRaKHE91DQD0lmlHd+VlKqk9F7dz60pWkhIQ4hbtU54R97Zyds6U9QveuztbV9XetVKya+5k2xYfaJV2w8/M650GjXHKOSk7NKVbKmI489rCPQ7rPY1RwiSrqOvsRT3g2LozQEEY6lRq5zLOLb8N4+AQ7eJKI</latexit>

f<latexit sha1_base64="o1heHs1ybhhUrc6SGM69uDHYIZo=">AAACxHicjVHLSsNAFD3GV61Vqy4FCRbBVUnqQpcFQVy2YB9QiyTptA6dPEgmQim6c+VW/6V/Iv6B4k94Z5qCWkQnJDlz7jln5s64keCJtKzXBWNxaXllNbeWXy9sbG4Vt3eaSZjGHmt4oQjjtuskTPCANSSXgrWjmDm+K1jLHZ6peuuWxQkPg0s5iljXdwYB73PPkUTV+9fFklW29DDngZ2BUrUwqX887E9qYfEFV+ghhIcUPhgCSMICDhJ6OrBhISKuizFxMSGu6wx3yJM3JRUjhUPskL4DmnUyNqC5yky026NVBL0xOU0ckickXUxYrWbqeqqTFftb9lhnqr2N6O9mWT6xEjfE/uWbKf/rU71I9HGqe+DUU6QZ1Z2XpaT6VNTOzS9dSUqIiFO4R/WYsKeds3M2tSfRvauzdXT9TSsVq+Zepk3xrnZJF2z/vM550KyU7eOyXbdL1QqmI4c9HOCI7vMEVVyghobOfsQTno1zQxiJkU6lxkLm2cW3Ydx/Ai2IkvY=</latexit> b<latexit sha1_base64="MX6uDOXtgmyvJQvul84/DkKOUMg=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVZK60I1YEMRlC/YBtUiSTmtomoSZiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujJeEgZC2/ZYzFhaXllfyq+ba+sbmVmF7pyHilPus7sdhzFueK1gYRKwuAxmyVsKZO/RC1vQG56revGNcBHF0JUcJ6wzdfhT0At+VRNW8m0LRLtl6WPPAyUDx7NU8TV4mZjUuTHCNLmL4SDEEQwRJOIQLQU8bDmwkxHUwJo4TCnSd4R4meVNSMVK4xA7o26dZO2MjmqtMod0+rRLSy8lp4YA8Mek4YbWapeupTlbsb9ljnan2NqK/l2UNiZW4JfYv30z5X5/qRaKHE91DQD0lmlHd+VlKqk9F7dz60pWkhIQ4hbtU54R97Zyds6U9QveuztbV9XetVKya+5k2xYfaJV2w8/M650GjXHKOSk7NKVbKmI489rCPQ7rPY1RwiSrqOvsRT3g2LozQEEY6lRq5zLOLb8N4+AQ7eJKI</latexit>

Figure 1. Flowchart of the computation of the Harmonic-Constant-Q-Modulation (HCQM). See text for details.

2.2 Input representation: the HCQM

As mentioned our goal is here to adapt the harmonic repre-sentation of the rhythm proposed in [30] [31] [32] to the deep learning formalism proposed in [2]. For this, the HCQT proposed by [2] is not applied to the audio sig-nal, but to a set of Onset-Strength-Function (OSF) which represent the rhythm content in several acoustic frequency bands. The OSFs are low-pass signals which temporal evo-lution is related to the tempo and the rhythm pattern.

We denote our representation by Harmonic-Constant-Q-Modulation (HCQM). As the Modulation Spectrum

(MS) [1] it represents, using a time/frequency (τ0_/φ)

rep-resentation, the energy evolution (low-pass signal) within

each frequency bandb of a first time/frequency (τ /f )

rep-resentation. However, while the MS uses two interleaved Short-Time-Fourier-Transforms (STFTs) for this, we use a Constant-Q transform for the second time/frequency rep-resentation (in order to obtain a better spectral resolution). Finally, as proposed by [2], we add one extra dimension to

represent the content at the harmonics of each frequencyφ.

We denote it byXhcqm(φ, τ0_{, b, h) where φ are the}

modu-lation frequencies (which correspond to the tempo

frequen-cies),τ0are the times of the CQT frames,b are the acoustic

frequency bands andh the harmonic numbers.

Computation. In Figure 1, we indicate the

computa-tion flowchart of the HCQM. Given an audio signalx(t),

we first compute its STFT, denoted byX(f, τ ). The

acous-tic frequencyf of the STFT are grouped into

logarithmic-spaced acoustic-frequency-bandsb ∈ [1, B = 8]. We

de-note it by X(b, τ ). The goal of this is to reduce the

di-mensionality while preserving the information of the spec-tral location of the rhythm events (kick patterns tend to be in low frequencies while hit-hat patterns in high

fre-quencies). For each band b, we then compute an

Onset-Strength-Function over timeτ , denoted by Xo(b, τ ).

For a specificb, we now consider the signal sb(τ ) =

Xo(b, τ ) and perform the analysis of its periodicities over

time τ . One possibility would be to compute a

time-frequency representationSb(φ, τ0_{) over tempo-frequencies}

φ and time frame τ0_{and then sample}_{Sb(φ, τ}0_{) at the}

posi-tionshφ with h ∈ {1

2, 1, 2, 3, 4, 5} to obtain Sb(φ, τ0, h).

This is the idea we used in [32]. However, in the present work, we use the idea proposed by [2]: we compute a set

of CQTs1_{, each one with a different starting frequency}

hφmin. We set φmin=32.7. Each of these CQTs gives

us Sb,h(φ, τ0_{) for one value of h. Stacking them over h}

therefore provides us withSb(φ, τ0, h). The idea proposed

by [2] therefore allows to mimic the sampling at thehφ

but provides the correct window length to achieve a correct

spectral resolution. We finally stack theSb(φ, τ0_{, h) over b}

to obtain the 4D-tensorsXhcqm(φ, τ0_{, b, h). The}

computa-tion parameters (sample rate, hop size, window length and FFT size) are set such that the CQT modulation frequency φ represents the tempo value in BPM.

Illustration. For easiness of visualisation (it is diffi-cult to visualize a 4D-tensor), we illustrate the HCQM

Xhcqm(φ, τ0_{, b, h) for a given τ}0 _{(it is then a 3D-tensor).}

Figure 2 [Left] represent Xhcqm(φ, b, h) on a real

au-dio signal with a tempo of 120 bpm. Each sub-figure

represent Xhcqm(φ, b, h) for a different value of h ∈

{1

2, 1, 2, 3, 4, 5}. The y-axis and x-axis are the tempo

fre-quencyφ and the acoustic frequency band b. The dashed

rectangle super-imposed to the sub-figures indicates the

slice of values Xhcqm(φ = 120bpm, b, h) which

corre-sponds to the ground-truth tempo. It is this specific pattern

overb and h that we want to learn using the filters W of

the first layer of our convolutional network.

2.3 Architecture of the Convolutional Neural Network The architecture of our network is both inspired by the one of [2] (since we perform convolutions over an input spec-tral representation and use its depth) and the one of [37] (since we perform a classification). However, it differs in the definition of the input and output.

Input. In [2], the input is the 3D-tensorXcqt(f, τ, h)

and the convolution is done overf and τ (with filters of

depthH). In our case, the input could be the 4D-tensors

Xhcqm(φ, τ0, b, h) and the convolution could be done over

φ, τ0 _and_{b (with filters of depth H). However, to}

sim-plify the computation, we reduce Xhcqm(φ, τ0_{, b, h) to a}

sequence overτ0_{of 3D-tensors}_{Xhcqm(φ, b, h)}2_{. The}

con-volution is then done overφ and b with filters of depth H.

Our goal is to learn filtersW narrow in φ and large in

b which represents the specific shape of the harmonic con-1_{For the STFT and CQT we used the librosa library [27].}

2_{Future works will concentrate in performing the convolution directly}

(5)

15360 256 C classes softmax FC Flatten 128*(4,6) 64*(4,6) 64*(4,6) 32*(4,6) 8*(120,6) 240 8 6 128 64 64 32 8 h<latexit sha1_base64="xYTuT2PjQU/U0m+6oAU+cMRingI=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCuKyBfsALZKk0zY0TcLMRChFf8CtfoEf4ycU/0D/wjvTFNQiOiHJmXPPOTN3xkvCQEjbfssZc/MLi0v5ZXNldW19o7C5VRdxyn1W8+Mw5k3PFSwMIlaTgQxZM+HMHXgha3j9c1Vv3DEugji6ksOEtQZuNwo6ge9Koqq920LRLtl6WLPAyUDx9NU8SV7GZiUujHGDNmL4SDEAQwRJOIQLQc81HNhIiGthRBwnFOg6wz1M8qakYqRwie3Tt0uz64yNaK4yhXb7tEpILyenhT3yxKTjhNVqlq6nOlmxv2WPdKba25D+XpY1IFaiR+xfvqnyvz7Vi0QHx7qHgHpKNKO687OUVJ+K2rn1pStJCQlxCrepzgn72jk9Z0t7hO5dna2r6+9aqVg19zNtig+1S7pg5+d1zoL6Qck5LDlVp1g+w2TksYNd7NN9HqGMS1RQ09mPeMKzcWGEhjDSidTIZZ5tfBvGwydx+JKu</latexit> 0,5 1 2 3 4 5 h<latexit sha1_base64="xYTuT2PjQU/U0m+6oAU+cMRingI=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCuKyBfsALZKk0zY0TcLMRChFf8CtfoEf4ycU/0D/wjvTFNQiOiHJmXPPOTN3xkvCQEjbfssZc/MLi0v5ZXNldW19o7C5VRdxyn1W8+Mw5k3PFSwMIlaTgQxZM+HMHXgha3j9c1Vv3DEugji6ksOEtQZuNwo6ge9Koqq920LRLtl6WLPAyUDx9NU8SV7GZiUujHGDNmL4SDEAQwRJOIQLQc81HNhIiGthRBwnFOg6wz1M8qakYqRwie3Tt0uz64yNaK4yhXb7tEpILyenhT3yxKTjhNVqlq6nOlmxv2WPdKba25D+XpY1IFaiR+xfvqnyvz7Vi0QHx7qHgHpKNKO687OUVJ+K2rn1pStJCQlxCrepzgn72jk9Z0t7hO5dna2r6+9aqVg19zNtig+1S7pg5+d1zoL6Qck5LDlVp1g+w2TksYNd7NN9HqGMS1RQ09mPeMKzcWGEhjDSidTIZZ5tfBvGwydx+JKu</latexit> Input _Output HCQM HCQM <latexit sha1_base64="vdGvxXXYzJ7SuDc3Wez6WmJf/74=">AAACx3icjVHLSsNAFD2Nr/quulQkWARXJdGFLotudNeCfUBbJEmn7dC8SCbFUly4cOtWP6V/Iv6B4k94Z5qCWkQnJDlz7jln5s7YoctjYRivGW1ufmFxKbu8srq2vrGZ29quxkESOaziBG4Q1W0rZi73WUVw4bJ6GDHLs11Ws/sXsl4bsCjmgX8thiFreVbX5x3uWEJSzbDHb3J5o2Cooc8CMwX54t64/PGwPy4FuRc00UYABwk8MPgQhF1YiOlpwISBkLgWRsRFhLiqM9xhhbwJqRgpLGL79O3SrJGyPs1lZqzcDq3i0huRU8cheQLSRYTlarqqJypZsr9lj1Sm3NuQ/naa5REr0CP2L99U+V+f7EWggzPVA6eeQsXI7pw0JVGnIneuf+lKUEJInMRtqkeEHeWcnrOuPLHqXZ6tpepvSilZOXdSbYJ3uUu6YPPndc6C6nHBPCmYZTNfPMdkZLGLAxzRfZ6iiEuUUKHsHh7xhGftSgu0gXY7kWqZ1LODb0O7/wSsN5Rr</latexit> Xhcqm( , b, h) <latexit sha1_base64="Af4a+zKom88nV7uGMQt7eZAprFU=">AAAC2HicjVHLSsNAFD2N79ZH1aWbYBUqSEl0ocuiG5cKVou2lGQ62sG8TCZCKYI7cesPuNUf8Af8CPEP9AN07Z0xglpEJyQ5c+45Z+bOuJEnEmlZTzljYHBoeGR0LF8Yn5icKk7P7CVhGjNeY6EXxnXXSbgnAl6TQnq8HsXc8V2P77snm6q+f8bjRITBruxGvOk7x4E4EsyRRLWKs/VWr8NO/fNyI+qIZdNd7iy1iiWrYulh9gM7A6Xqwuv9w1nhbTssPqKBNkIwpPDBEUAS9uAgoecQNixExDXRIy4mJHSd4xx58qak4qRwiD2h7zHNDjM2oLnKTLSb0SoevTE5TSySJyRdTFitZup6qpMV+1t2T2eqvXXp72ZZPrESHWL/8n0q/+tTvUgcYV33IKinSDOqO5alpPpU1M7NL11JSoiIU7hN9Zgw087Pcza1J9G9q7N1dP1ZKxWr5izTpnhRu6QLtn9eZz/YW6nYqxV7xy5VN/AxRjGHeZTpPtdQxRa2UaPsLm5wizvjwLgwLo2rD6mRyzyz+DaM63fAe5q7</latexit> <latexit sha1_base64="vdGvxXXYzJ7SuDc3Wez6WmJf/74=">AAACx3icjVHLSsNAFD2Nr/quulQkWARXJdGFLotudNeCfUBbJEmn7dC8SCbFUly4cOtWP6V/Iv6B4k94Z5qCWkQnJDlz7jln5s7YoctjYRivGW1ufmFxKbu8srq2vrGZ29quxkESOaziBG4Q1W0rZi73WUVw4bJ6GDHLs11Ws/sXsl4bsCjmgX8thiFreVbX5x3uWEJSzbDHb3J5o2Cooc8CMwX54t64/PGwPy4FuRc00UYABwk8MPgQhF1YiOlpwISBkLgWRsRFhLiqM9xhhbwJqRgpLGL79O3SrJGyPs1lZqzcDq3i0huRU8cheQLSRYTlarqqJypZsr9lj1Sm3NuQ/naa5REr0CP2L99U+V+f7EWggzPVA6eeQsXI7pw0JVGnIneuf+lKUEJInMRtqkeEHeWcnrOuPLHqXZ6tpepvSilZOXdSbYJ3uUu6YPPndc6C6nHBPCmYZTNfPMdkZLGLAxzRfZ6iiEuUUKHsHh7xhGftSgu0gXY7kWqZ1LODb0O7/wSsN5Rr</latexit> b <latexit sha1_base64="cxEs3YzUoW7FwsLwVkmEbjA8Ay8=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCuKyBfuAWiRJpzU0TcLMRChFf8CtfoEf4ycU/0D/wjvTFNQiOiHJmXPPOTN3xkvCQEjbfssZc/MLi0v5ZXNldW19o7C5VRdxyn1W8+Mw5k3PFSwMIlaTgQxZM+HMHXgha3j9c1Vv3DEugji6ksOEtQduLwq6ge9KoqreTaFol2w9rFngZKB4+mqeJC9jsxIXxrhGBzF8pBiAIYIkHMKFoKcFBzYS4toYEccJBbrOcA+TvCmpGClcYvv07dGslbERzVWm0G6fVgnp5eS0sEeemHScsFrN0vVUJyv2t+yRzlR7G9Lfy7IGxErcEvuXb6r8r0/1ItHFse4hoJ4Szaju/Cwl1aeidm596UpSQkKcwh2qc8K+dk7P2dIeoXtXZ+vq+rtWKlbN/Uyb4kPtki7Y+Xmds6B+UHIOS07VKZbPMBl57GAX+3SfRyjjEhXUdPYjnvBsXBihIYx0IjVymWcb34bx8AljuJKo</latexit> b <latexit sha1_base64="cxEs3YzUoW7FwsLwVkmEbjA8Ay8=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCuKyBfuAWiRJpzU0TcLMRChFf8CtfoEf4ycU/0D/wjvTFNQiOiHJmXPPOTN3xkvCQEjbfssZc/MLi0v5ZXNldW19o7C5VRdxyn1W8+Mw5k3PFSwMIlaTgQxZM+HMHXgha3j9c1Vv3DEugji6ksOEtQduLwq6ge9KoqreTaFol2w9rFngZKB4+mqeJC9jsxIXxrhGBzF8pBiAIYIkHMKFoKcFBzYS4toYEccJBbrOcA+TvCmpGClcYvv07dGslbERzVWm0G6fVgnp5eS0sEeemHScsFrN0vVUJyv2t+yRzlR7G9Lfy7IGxErcEvuXb6r8r0/1ItHFse4hoJ4Szaju/Cwl1aeidm596UpSQkKcwh2qc8K+dk7P2dIeoXtXZ+vq+rtWKlbN/Uyb4kPtki7Y+Xmds6B+UHIOS07VKZbPMBl57GAX+3SfRyjjEhXUdPYjnvBsXBihIYx0IjVymWcb34bx8AljuJKo</latexit>

Figure 2. [Left] Example of the HCQM for a real audio with tempo 120bpm. [Right] Architecture of our CNN.

tent of a rhythm pattern. We do the convolution over b

because the same rhythm pattern can be played with in-strument transposed in acoustic frequencies.

Output. The output of the network proposed by [2] is a 2D representation which represents a saliency map of the harmonic content over time. In our case, the outputs are

ei-ther theC = 256 classes of tempo (as in [37] we consider

the tempo estimation problem as a classification problem

into 256 tempo classes) or the C = 13 (for

extended-ballroom) orC = 6 (for Greek-dances) classes of rhythm

pattern. To do so, we added at the end of the network

pro-posed by [2] two dense layers, the last one with C units

and a softmax activation.

Architecture. In Figure 2 [Right], we indicate the

architecture of our network. The input is a 3D-tensor

Xhcqm(φ, b, h) for a given time τ0. The first layer is a

set of 128 convolutional filters of shape (φ = 4, b = 6)

(with depth H). As mentioned, the convolution is done

over φ and b. The shape of these filters has been chosen

such that they are narrow in tempo frequency φ (to

pre-cisely estimate the tempo) but cover multiple frequency

acoustic bands b (because the information relative to the

tempo/ rhythm cover several bands). As illustrated in Fig-ure 2 [Left] the goal of the filters is to identify the pattern

overb and h specific to φ = tempo frequency.

The first layer is followed by two convolutional layers

of 64 filters of shape(4, 6), one layer of 32 filters of shape

(4, 6), one layer of 8 filters of shape (120, 6) (this allows to track down the relationships between the modulation

fre-quencies φ). The output of the last convolution layer is

then flattened and followed by a dropout with p = 0.5

(to avoid over-fitting [41]), a fully-connected layer of 256

units, a last fully-connected layer ofC units with a softmax

activation to perform the classification intoC classes. The

softmax activation vector is denoted by y(τ0_{). The Loss}

function to be minimized is a categorical cross entropy. All layers are preceded by a batch normalization [16]. We used Rectified Linear Units (ReLU) [28] for all convo-lutional layers, and Exponential Linear Units (eLU) [5] for the first fully-connected layer.

2.4 Training

The inputs of our network are the 3D tensors HCQM

Xhcqm(φ, b, h) computed for all time τ0_{of the music track.}

On the other side, the datasets we will use for our

ex-periments only provide global tempi3 _{or global rhythm}

classes as ground-truths. We therefore have several HC-QMs for a given track which are all associated with the same ground-truth (multiple instance learning).

To fix the network hyper-parameters, we split the train-ing set into a train (90%) and a validation part (10%). We used the ADAM [17] optimizer to find the parame-ters of the network with a constant learning rate of 0.001,

β1 = 0.9 , β2 = 0.999 and = 1e − 8. We used

mini-batches of 256 HCQM with shuffle and a maximum of 20 epochs with early-stopping.

2.5 Aggregating decisions over time

Our network provides an estimation of the tempo or

rhythm class at each timeτ0. To obtain a global value we

aggregate the softmax activation vectors y(τ0) over time

by choosing the maximum of the vectory computed as the

average overτ0_{of the}_y(τ0_).

3. EVALUATION OF THE SYSTEM

We evaluate our proposed system for two tasks: - global tempo estimation (§3.1), - rhythm pattern recognition (§3.2). We used the same system (same input represen-tation and same network architecture) for the two tasks. However, considering that the class definitions are

differ-ent, we performed two independent trainings4_.

3.1 Global tempo estimation

Training and testing sets. To be able to compare our results with previous works, we use the same paradigm

(cross-dataset validation5) and the same datasets as [37]

which we consider here as the state-of-the-art. The train-ing set is the union of

3_{It should be noted that this does not always correspond to the reality}

of the track content since some of them have a tempo varying over time (tempo drift), have a silent introduction or a break in the middle.

4_{Future works, will consider training a single network for the two}

tasks or applying transfer learning of one task to the other.

5_{Cross-dataset validation uses separate datasets for training and}

test-ing; not only splitting a single dataset into a train and a test part.

(6)

• LMD tempo: it is a subset of the Lack MIDI dataset [35] annotated into tempo by [38]; it contains 3611 items of 30s excerpts of 10 different genres

• MTG tempo it is a subset of the GiantSteps MTG key dataset (MTG tempo) [6] annotated using a tapping method by [37]; it contains 1159 items of 2min ex-cerpts of electronic dance music;

• Extended Ballroom [24]: it contains 3826 items of 13 genres (it should be noted that we removed from the Ballroom test-set the items existing in the Extended Ballroom training-set).

The total size of the training set is 8596. It covers mul-tiple musical genres to favor generalization.

The test-sets are also the same as in [37] (see [38] for their details): - ACM-Mirum [33] (1410 items), -

IS-MIR04 [12] (464 items), Ballroom [12] (698 items),

-Hainsworth[13] (222 items), - GTzan-Rhythm [22] (1000

items), - SMC [14] (217 items), - Giantsteps Tempo [19] (664 items). We also added the RWC-popular [10] (100 items) for comparison with [3]. As in [37], Combined de-notes the union of all test-sets (except RWC popular).

Evaluation protocol. We measure the performances using the following metrics:

• Class-Accuracy: it measures the ability of our system to predict the correct tempo class (in our system we have 256 tempo classes ranging from 30 to 285bpm); • Accuracy1: it measures if our estimated tempo is

within ±4% of the ground-truth tempo

• Accuracy2: is the same as Accuracy1 but considering octave errors as correct

Results and discussions. The results are indicated in Tables 1, 2 and 3.

Validation of B and H. We first show that the

sepa-ration of the signal into multiple acoustic-frequency-bands b ∈ [1, B = 8] and the use of the harmonic depth H is beneficial. For this, we compare the results obtained using

-B = 8 and h ∈ {1

2, 1, 2, 3, 4, 5} (column “new")

-B = 8 but h ∈ {1} (column “h=1")

-B = 1 and h ∈ {1

2, 1, 2, 3, 4, 5} (column “b=1").

Results are only indicated for the “Combined”

test-sets. For all metrics (Class-Accuracy, Accuracy-1 and

Accuracy-2), the best results are obtained using B = 8

andh ∈ {1

2, 1, 2, 3, 4, 5}.

Comparison with state-of-the-art. We now compare our results with the state-of-the-art represented by the 3 following systems: - sch1 denotes the results published in [38], - sch2 in [37] and - böck in [3].

According to these Tables, we see that our method al-lows an improvement for two test-sets: - Ballroom in terms of Class-Accuracy (73.8) - Ballroom and Giantsteps in terms of Accuracy1 (92.6 and 83.6) and Accuracy2 (98.7 and 97.9). This can be explained by the fact that the mu-sical genres of these two test-sets are represented in our training set (by the Extended-Ballroom and MTG tempo respectively). It should be noted however that there is no intersection between the training and test sets. For the

IS-MIR04and GTZAN test-sets, our results are very close (but

lower) to the one of sch1. For the RWC popular test sets, our results (73.0 and 98.0) largely outperforms the ones of böck in terms of Accuracy-1 and Accuracy-2. Finally, if we consider the Combined dataset, our method slightly outperforms the other ones in terms of Accuracy1 (74.4).

The worst results of our method were obtained for the

SMC data-sets. This can be explained by the fact that

the SMC data-sets contains rhythm patterns very different from the ones represented in our training-sets.

While our results for Accuracy2 (92.0) are of course higher than our results for Accuracy1 (74.4), they are still lower than the ones of böck (93.6). One reason for this is that our network (as the ones of sch1 and sch2) is trained to discriminate BPM classes, therefore it doesn’t know any-thing about octave equivalences.

3.2 Rhythm pattern recognition

Training and testing sets. For the rhythm pattern recog-nition, it is not possible to perform cross-dataset valida-tion (as we did above) because the definivalida-tion of the rhythm pattern classes is specific to each dataset. Therefore, as in previous works [25], we perform a 10-fold cross-validation using the following datasets:

• Extended Ballroom [24]: it contains 4180 samples of C=13 rhythm classes: ’Foxtrot’, ’Salsa’, ’Vien-nesewaltz’, ’Tango’, ’Rumba’, ’Wcswing’, ’Quick-step’, ’Chacha’, ’Slowwaltz’, ’Pasodoble’, ’Jive’, ’Samba’, ’Waltz’

• Greek-dances [15]: it contains 180 samples of C=6 rhythm classes: ’Kalamatianos’, ’Kontilies’, ’Male-viziotis’, ’Pentozalis’, ’Sousta’ and ’Kritikos Syrtos’ Evaluation protocol. As in previous works [25], we measure the performances using the average-mean-Recall

R6_{. For a given fold}_{f , the mean-Recall Rf} _{is the mean}

over the classes c of the class-recall Rf,c. The average

mean-Recall R is then the average over f of the

mean-RecallRf. We also indicate the standard deviation overf

of the mean-RecallRf.

Results and discussions. The results are indicated in Table 4. We compare our results with the ones of [25], denoted as march, considered here representative of the current state-of-the-art.

Our results (76.4 and 68.3) are largely below the ones of march (94.9 and 77.2). This can be explained by the fact that the “Scale and shift invariant time/frequency” rep-resentation proposed in [25] takes into account the inter-relationships between the frequency bands of the rhythmic events while our HCQM does not.

To better understand these results, we indicate in Fig-ure 3 [Top] the confusion matrices for the Extended

Ball-room. The diagonal represents the Recall Rc of each

class c. We see that our system is actually suitable to

detect the majority of the classes: Rc ≥ 88% for 9

classes over 13. ChaCha, Waltz and WcSwing make R

completely drop. Waltz is actually estimated 97% of the 6_{The mean-Recall is not sensitive to the distribution of the classes. Its}

(7)

Datasets sch1 sch2 böck new h=1 b=1 ACMMirum 38.3 40.6 29.4 38.2 ISMIR04 37.7 34.1 27.2 29.4 Ballroom 46.8 67.9 33.8 73.8 Hainsworth 43.7 43.2 33.8 23.0 GTzan 38.8 36.9 32.2 27.6 SMC 14.3 12.4 17.1 7.8 Giantsteps 53.5 59.8 37.2 25.5 RWCpop X X X 66.0 Combined 40.9 44.8 31.2 36.8 29.8 32.4 Table 1. Class-Accuracy sch1 sch2 böck new h=1 b=1 72.3 79.5 74.0 73.3 63.4 60.6 55.0 61.2 64.6 92.0 84.0 92.6 65.8 77.0 80.6 73.4 71.0 69.4 69.7 69.7 31.8 33.6 44.7 30.9 63.1 73.0 58.9 83.6 X X 60.0 73.0 66.5 74.2 69.5 74.4 64.2 67.9 Table 2. Accuracy1 sch1 sch2 böck new h=1 b=1 97.3 97.4 97.7 96.5 92.2 92.2 95.0 87.1 97.0 98.4 98.7 98.7 85.6 84.2 89.2 82.9 93.3 92.6 95.0 89.1 55.3 50.2 67.3 50.7 88.7 89.3 86.4 97.9 X X 95.0 98.0 92.2 92.1 93.6 92.0 82.0 88.4 Table 3. Accuracy2

ChachaFoxtrot Jive Pasodob

le Quickst

ep

Rumba SalsaSamba_Slowwaltz Tango Vienne

sewaltz Walt z Wcswing Predicted label Chacha Foxtrot Jive Pasodoble Quickstep Rumba Salsa Samba Slowwaltz Tango Viennesewaltz Waltz Wcswing Tru e l ab el 0.45 0.00 0.13 0.00 0.00 0.13 0.28 0.00 0.00 0.02 0.00 0.00 0.00 0.000.950.00 0.00 0.00 0.02 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.000.970.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.000.980.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.010.970.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.02 0.02 0.01 0.000.930.01 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.02 0.01 0.00 0.00 0.000.950.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.02 0.00 0.000.920.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.000.940.02 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.02 0.01 0.00 0.00 0.060.880.01 0.00 0.00 0.00 0.02 0.02 0.02 0.06 0.00 0.00 0.02 0.08 0.020.770.00 0.00 0.00 0.00 0.030.970.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.04 0.13 0.13 0.04 0.00 0.26 0.00 0.04 0.00 0.22 0.0 0.2 0.4 0.6 0.8

ChachaFoxtrot Jive_Pasodoble_QuickstepRumba SalsaSamba_Slowwaltz Tango Vienne

sewaltz Walt z Wcswi ng Predicted label Chacha Foxtrot Jive Pasodoble Quickstep Rumba Salsa Samba Slowwaltz Tango Viennesewaltz Waltz Wcswing Tru e l ab el 0.930.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.000.960.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.01 0.010.960.00 0.00 0.01 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.06 0.04 0.020.740.02 0.02 0.00 0.00 0.00 0.08 0.02 0.02 0.00 0.00 0.01 0.00 0.000.940.03 0.00 0.02 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.030.930.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.06 0.00 0.02 0.11 0.45 0.34 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.03 0.01 0.010.940.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.020.970.00 0.03 0.03 0.01 0.00 0.00 0.01 0.00 0.01 0.000.890.00 0.01 0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.940.02 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.010.970.00 0.30 0.22 0.13 0.00 0.00 0.09 0.00 0.09 0.00 0.09 0.00 0.04 0.04 0.0 0.2 0.4 0.6 0.8

Figure 3. Confusion matrix for the Extended Ballroom.

[Top] using h ∈ {0.5, 1, 2, 3, 4, 5} [Bottom] using h ∈

{1 4, 1 3, 1 2, 2 3, 3 4, 1, 1.25, 1.33, . . . 8}

time as Pasadoble. This can be explained by the fact

that our current harmonic series h ∈ {1

2, 1, 2, 3, 4, 5}

does not represent anything about the 3/4 meter specific

to Waltz which would be represented by h = 1

3. To

verify our assumption, we redo the experiment using this time exactly the same harmonic series as proposed in

Datasets march new

Extended Ballroom 94.9 76.4 (33.1)

Greek-dances 77.2 68.3 (27.5)

Table 4. Average (std) RecallR for rhythm classification.

[32]: h ∈ {1 4, 1 3, 1 2, 2 3, 3 4, 1, 1.25, 1.33, . . . 8}. The

cor-responding confusion matrix is indicated in 3 [Bottom] where Waltz is now perfectly recognized (97%), however SlowWaltz is now recognized as Waltz in 97% of the cases which makes (while the system is better) the average mean-Recall actually decreases to 74.6%. The low results of Wcswing can be explained by the (too) small number of training examples (23).

4. CONCLUSION

In this paper we have presented a new approach for global tempo and rhythm pattern classification. We have proposed the Harmonic-Constant-Q-Modulation (HCQM) representation, a 4D-tensor which represents the harmonic series at candidate tempo frequencies of a multi-band Onset-Strength-Function. This HCQM is then used as in-put to a deep convolutional network. The filters of the first layer of this network are supposed to learn the spe-cific characteristic of the various rhythm patterns. We have evaluated our approach for two classification tasks: global tempo and rhythm pattern classification.

For tempo estimation, our method outperforms previ-ous approaches (in terms of Accuracy1 and Accuracy2) for the Ballroom (ballroom music) and Giant-steps tempo (electronic music) test-sets. Both test-sets represent music genres with a strong focus on rhythm. It seems therefore that our approach works better when rhythm patterns are clearly defined. Our method also performs slightly better (in terms of Accuracy1) for the Combined test-set.

For rhythm classification, our method doesn’t work as well as the state of the art [25]. However, the confusion matrices indicate that our recognition is above 90% for the majority of the classes of the Extended Ballroom.

More-over, we have shown that adapting the harmonic seriesh

can help improving the performances.

Among future works, we would like to study the use of the HCQM 4D tensors directly, to study other harmonic series and to study the joint training of (or transfer learning between) tempo and rhythm pattern classification. Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

(8)

Acknowledgement: This work was partly supported by European Union’s Horizon 2020 research and innovation program under grant agreement No 761634 (Future Pulse project).

5. REFERENCES

[1] Les Atlas and Shihab A Shamma. Joint acoustic and modulation frequency. Advances in Signal Processing, EURASIP Journal on, 2003:668–675, 2003.

[2] Rachel M Bittner, Brian McFee, Justin Salamon, Pe-ter Li, and Juan Pablo Bello. Deep salience represen-tations for f0 estimation in polyphonic music. In Proc. of ISMIR (International Society for Music Information Retrieval), Suzhou, China, 2017.

[3] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters. In Proc. of IS-MIR (International Society for Music Information Re-trieval), Malaga, Spain, 2015.

[4] Ching-Wei Chen, Markus Cremer, Kyogu Lee, Pe-ter DiMaria, and Ho-Hsiang Wu. Improving perceived tempo estimation by statistical modeling of higher-level musical descriptors. In Audio Engineering Soci-ety Convention 126. Audio Engineering SociSoci-ety, 2009. [5] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learn-ing by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.

[6] Angel Faraldo, Sergi Jorda, and Perfecto Herrera. A multi-profile method for key estimation in edm. In AES International Conference on Semantic Audio, Ilmenau, Germany, 2017. Audio Engineering Society.

[7] Jonathan Foote, Matthew L Cooper, and Unjung Nam. Audio retrieval by rhythmic similarity. In Proc. of IS-MIR (International Society for Music Information Re-trieval), Paris, France, 2002.

[8] Mikel Gainza and Eugene Coyle. Tempo detection using a hybrid multiband approach. Audio, Speech and Language Processing, IEEE Transactions on, 19(1):57–68, 2011.

[9] Aggelos Gkiokas, Vassilios Katsouros, and George Carayannis. Reducing tempo octave errors by period-icity vector coding and svm learning. In Proc. of IS-MIR (International Society for Music Information Re-trieval), Porto, Portugal, 2012.

[10] Masataka Goto, Hiroki Hashiguchi, Takuichi

Nishimura, and Ryuichi Oka. Rwc music database: Popular, classical and jazz music databases. In Proc. of ISMIR (International Society for Music Information Retrieval), Paris, France, 2002.

[11] Masataka Goto and Yoichi Muraoka. A beat tracking system for acoustic signals of music. In Proceedings of the second ACM international conference on Multime-dia, San Francisco, California, USA, 1994.

[12] Fabien Gouyon, Anssi Klapuri, Simon Dixon, Miguel Alonso, George Tzanetakis, Christian Uhle, and Pe-dro Cano. An experimental comparison of audio tempo induction algorithms. Audio, Speech and Language Processing, IEEE Transactions on, 14(5):1832–1844, 2006.

[13] Stephen Webley Hainsworth. Techniques for the Auto-mated Analysis of Musical Audio. PhD thesis, Univer-sity of Cambridge, UK, September 2004.

[14] Andre Holzapfel, Matthew EP Davies, José R Zap-ata, João Lobato Oliveira, and Fabien Gouyon. Se-lective sampling for beat tracking evaluation. Audio, Speech and Language Processing, IEEE Transactions on, 20(9):2539–2548, 2012.

[15] André Holzapfel and Yannis Stylianou. Scale trans-form in rhythmic similarity of music. Audio, Speech and Language Processing, IEEE Transactions on, 19(1):176–185, 2011.

[16] Sergey Ioffe and Christian Szegedy. Batch nor-malization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[17] Diederik P Kingma and Jimmy Ba. Adam: A

method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[18] Anssi P Klapuri, Antti J Eronen, and Jaakko T Astola. Analysis of the meter of acoustic musical signals. Au-dio, Speech and Language Processing, IEEE Transac-tions on, 14(1):342–355, 2006.

[19] Peter Knees, Angel Faraldo, Perfecto Herrera, Richard Vogl, Sebastian Böck, Florian Hörschläger, and Mick-ael Le Goff. Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections. In Proc. of ISMIR (International So-ciety for Music Information Retrieval), Malaga, Spain, 2015.

[20] Mark Levy. Improving perceptual tempo estimation with crowd-sourced annotations. In Proc. of ISMIR (In-ternational Society for Music Information Retrieval), Miami, Florida, USA, 2011.

[21] Robert C Maher and James W Beauchamp. Fundamen-tal frequency estimation of musical signals using a two-way mismatch procedure. JASA (Journal of the Acous-tical Society of America), 95(4):2254–2263, 1994. [22] Ugo Marchand, Quentin Fresnel, and Geoffroy Peeters.

(9)

beat, downbeat and swing annotations. In Late-Breaking/Demo Session of ISMIR (International Soci-ety for Music Information Retrieval), Malaga, Spain, October 2015. Late-Breaking Demo Session of the 16th International Society for Music Information Re-trieval Conference, 2015.

[23] Ugo Marchand and Geoffroy Peeters. The modulation scale spectrum and its application to rhythm-content description. In Proc. of DAFx (International Confer-ence on Digital Audio Effects), Erlangen, Germany, 2014.

[24] Ugo Marchand and Geoffroy Peeters. The extended ballroom dataset. In Late-Breaking/Demo Session of ISMIR (International Society for Music Information Retrieval), New York, USA, August 2016. Late-Breaking Demo Session of the 17th International So-ciety for Music Information Retrieval Conf.. 2016. [25] Ugo Marchand and Geoffroy Peeters. Scale and shift

invariant time/frequency representation using auditory statistics: Application to rhythm description. In 2016 IEEE 26th International Workshop on Machine Learn-ing for Signal ProcessLearn-ing (MLSP). IEEE, 2016. [26] R McAuley and T Quatieri. Speech analysis/synthesis

based on a sinusoidal representation. Acoustics, Speech and Signal Processing, IEEE Transactions on, 34:744– 754, 1986.

[27] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Austin, Texas, 2015.

[28] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proc. of ICMC (International Computer Music Conference), Haifa, Israel, 2010.

[29] Geoffroy Peeters. A large set of audio features for sound description (similarity and classification) in the cuidado project. Cuidado project report, Ircam, 2004. [30] Geoffroy Peeters. Template-based estimation of

time-varying tempo. Advances in Signal Processing, EURASIP Journal on, 2007(1):067215, 2006.

[31] Geoffroy Peeters. Template-based estimation of tempo: using unsupervised or supervised learning to create better spectral templates. In Proc. of DAFx (Interna-tional Conference on Digital Audio Effects), pages 209–212, Graz, Austria, September 2010.

[32] Geoffroy Peeters. Spectral and temporal periodicity representations of rhythm for the automatic classifica-tion of music audio signal. Audio, Speech and Lan-guage Processing, IEEE Transactions on, 19(5):1242– 1252, 2011.

[33] Geoffroy Peeters and Joachim Flocon-Cholet. Percep-tual tempo estimation using gmm-regression. In Pro-ceedings of the second international ACM workshop on Music information retrieval with user-centered and multimodal strategies, Nara, Japan, 2012.

[34] Graham Percival and George Tzanetakis. Streamlined tempo estimation based on autocorrelation and cross-correlation with pulses. Audio, Speech, and Language Processing, IEEE/ACM Transactions on, 22(12):1765– 1776, 2014.

[35] Colin Raffel. Learning-based methods for comparing sequences, with applications to audio-to-midi align-ment and matching. PhD thesis, Columbia University, 2016.

[36] Eric D Scheirer. Tempo and beat analysis of acoustic musical signals. JASA (Journal of the Acoustical Soci-ety of America), 103(1):588–601, 1998.

[37] Hendrik Schreiber and M Müller. A single-step ap-proach to musical tempo estimation using a convo-lutional neural network. In Proc. of ISMIR (Interna-tional Society for Music Information Retrieval), Paris, France, 2018.

[38] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo esti-mates using supervised learning. In Proc. of ISMIR (In-ternational Society for Music Information Retrieval), Suzhou, China, 2017.

[39] Xavier Serra and Julius Smith. Spectral modeling syn-thesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Computer Music Journal, 14(4):12–24, 1990.

[40] Klaus Seyerlehner, Gerhard Widmer, and Dominik Schnitzer. From rhythm patterns to perceived tempo. In Proc. of ISMIR (International Society for Music In-formation Retrieval), Vienna, Austria, 2007.

[41] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[42] George Tzanetakis and Perry Cook. Musical genre classification of audio signals. Speech and Audio Pro-cessing, IEEE Transactions on, 10(5):293–302, 2002. [43] Linxing Xiao, Aibo Tian, Wen Li, and Jie Zhou.

Us-ing statistic model to capture the association between timbre and perceived tempo. In Proc. of ISMIR (In-ternational Society for Music Information Retrieval), Philadelphia, PA, USA, 2008.

[44] Jose Zapata and Emilia Gómez. Comparative evalua-tion and combinaevalua-tion of audio tempo estimaevalua-tion ap-proaches. In Aes 42Nd Conference On Semantic Audio, Ilmenau, Germany, 2011.