HAL Id: hal-01186433
https://hal.archives-ouvertes.fr/hal-01186433
Submitted on 28 Aug 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Overview of the 2015 Workshop on Speech, Language
and Audio in Multimedia
Guillaume Gravier, Gareth J. F. Jones, Martha Larson, Roeland Ordelman
To cite this version:
Guillaume Gravier, Gareth J. F. Jones, Martha Larson, Roeland Ordelman. Overview of the 2015
Workshop on Speech, Language and Audio in Multimedia. ACM International Conference on
Multi-media, 2015, Brisbane, Australia. �10.1145/2733373.2806414�. �hal-01186433�
Overview of the 2015 Workshop on Speech, Language and
Audio in Multimedia
Guillaume Gravier
CNRS IRISA & Inria Rennes
guig@irisa.fr
Gareth J.F. Jones
Dublin City University School of Computing
gareth.jones@computing.dcu.ie
Martha Larson
Delft Univ. of Technology Multimedia Computing Group
m.a.larson@tudelft.nl
Roeland Ordelman
Univ. of Twente & Netherlands Institute for Sound and Vision
roeland.ordelman@utwente.nl
ABSTRACT
The Workshop on Speech, Language and Audio in Multime-dia (SLAM) positions itself at at the crossroad of multiple scientific fields—music and audio processing, speech process-ing, natural language processing and multimedia—to dis-cuss and stimulate research results, projects, datasets and benchmarks initiatives where audio, speech and language are applied to multimedia data. While the first two edi-tions were collocated with major speech events, SLAM’15 is deeply rooted in the multimedia community, opening up to computer vision and multimodal fusion. To this end, the workshop emphasizes video hyperlinking as an showcase where computer vision meets speech and language. Such techniques provide a powerful illustration of how multime-dia technologies incorporating speech, language and audio can make multimedia content collections better accessible, and thereby more useful, to users.
Categories and Subject Descriptors
H.3.1 [Information Systems]: Information Storage and Retrieval—Content Analysis and Indexing
Keywords
Multimedia, Audio, Speech, Language
1.
INTRODUCTION
The workshop on Speech, Language and Audio in Multi-media (SLAM) aims at bringing together researchers work-ing in the fields of speech, language and audio processwork-ing with application to the analysis of, indexing of and access to multimedia data. In the context of SLAM, we define multimedia data as content captured, and often also edited, to create a message intended for humans, often combining
MM’15October 26-30, 2015, Brisbane, Australia .
multiple modalities. Typical data containing audio or lan-guage and fitting this definition are broadcast videos, video lectures, music and clips, or social media data. Note that in most cases, speech, language and audio are important carriers of semantics in multimedia content. In particular, language is of utmost importance for understanding the na-ture of the message. Yet these sources of information are often overlooked, or remain poorly exploited by researchers working to develop new multimedia technologies.
SLAM is by nature interdisciplinary, existing at the in-tersection of multiple scientific communities, including the research area of multimedia. Apart from traditional work-shop objectives of sharing scientific results and visions, one of the goal of SLAM is to establish and make visible a cross-domain scientific community. To this end, SLAM gathers players from the fields of speech and audio processing, of natural language processing and of multimedia to share re-cent research results, discuss ongoing and future projects, explore potential areas for interdisciplinary collaboration or sharing or ideas, and develop new benchmarking initiatives of mutual interest to multimedia and language researchers.
SLAM’15 is the third edition of the workshop. The work-shop series was initiated in 2013 by the Intl. Speech Com-munication Association (ISCA) Special Interest Group on Speech and Language in Multimedia1, with support from the IEEE Special Interest Group on Audio and Speech Pro-cessing in Multimedia2. Previous editions [1, 6] were collo-cated with Interspeech, the premier international conference on speech communication yearly organized by ISCA. How-ever, the long-term goal is to establish SLAM as a regular workshop, collocating in alternance with major multimedia conferences and major speech and language conferences, as a bridge between these domains.
Organizing SLAM’15 as an ACM Multimedia event is thus in logical continuation from the preceding editions. Previ-ous efforts to highlight the importance of SLAM-related top-ics at ACM Multimedia include the workshop on Searching Spontaneously Conversational Speech [5] and the workshop on Audio and Multimedia Methods for Large-Scale Video Analysis [4]. SLAM transcends these efforts because it em-phasizes language and audio in addition to speech, goes be-yond the application of search and the large-scale issues, and 1http://slim-sig.irisa.fr
2
also puts an explicit focus on work in which speech, language or audio is one or multiple modalities.
In order to emphasize its connection with the multimedia community, SLAM’15 features a special focus on video hy-perlinking, a multimedia task that was recently introduced in international benchmarks, where computer vision and im-age processing meets speech, languim-age and audio. Details on the video hyperlinking session are given in Sec. 4.
2.
WORKSHOP SCOPE AND GOALS
Multimedia data are available in enormous volumes in a wide variety of formats and qualities, from professional con-tent to user-generated ones: Lectures, meetings, interviews, debates, conversational broadcast, podcasts, social videos on the Web, etc. A large number of those sources include audio, speech and language, which are considered as key modalities for structuring and understanding, e.g., speaker turns, topic segmentation, entities and keywords. Analyz-ing audio, speech and language in multimedia data raises specific challenges that arise both from the nature of the data (e.g., semantics, variety, variability, multimodality) and from the typical use-cases considered in multimedia appli-cations. Typical challenges specific to the nature of multi-media data include robustness in face of high variability in quality, efficiency to handle very large amounts of data, se-mantics shared across modalities, or potentially high error rates in transcription affecting spoken language processing. An important set of challenges also arises from the applica-tion scenarios in which users make use of multimedia. User needs evolve rapidly, often during the experience of inter-acting with multimedia data, and users have a difficult time expressing their own needs concretely, or formulating them explicitly in a way that can be directly exploited by a sys-tem. All of these factors may also make it difficult to create the large and representative data sets needed in order to develop solutions to multimedia challenges.
Worldwide, several national and international research pro-jects are focusing on audio analysis of multimedia data. Sim-ilarly, various benchmark initiatives have devoted effort to offering tasks related to multimodal multimedia challenges (e.g., TRECVid, CLEF, MediaEval). The aim of the work-shop is to support these efforts by drawing attention to their importance, and to the variety of interdisciplinary solutions that are needed in order to address current challenges in multimedia.
3.
TOPICS
The workshop embraces a broad range of contributions including research work, project descriptions, evaluation ini-tiatives, demonstrations and applications. The common de-nominator is that each emphasizes the added value of speech and/or language and/or audio for multimedia technology.
This year the workshop includes papers that address a wide variety of topics. Examples include: the exploitation of the structure or music, the challenge processing of spoken clinical reports, in a way that combines speech recognition and human checks, the use of multimodal data in training systems that are capable of detecting spoken words, and fi-nally, how audio information can be integrated into an over-all visual object retrieval system. A large portion of the workshop is devoted to the showcase of the task of video hyperlinking, which we turn to next.
4.
VIDEO HYPERLINKING CASE STUDY
A key feature of SLAM’15 is the focus on video hyper-linking, with a dedicated session and round-table. Video hyperlinking is the task of organizing a collection of videos with (hyper)links from segments of interest, also know as anchors, to relevant segments within the collection. The video hyperlinking task was recently introduced in the Me-diaEval international benchmarking initiative [3, 2], moving to TRECVid in 2015.
Interestingly in the context of SLAM, accurate video hy-perlinking approaches require the integration of multiple modalities, in particular image and language, where lan-guage sources are typically automatic transcripts, subtitles or program synopsis. The multimodal nature of the video hyperlinking task makes it an emblematic case study where the speech and language modalities are perfectly comple-mented by audio and vision. The workshop program fea-tures a number of contributions where audio and natural language processing are used for video hyperlinking, possi-bly in conjunction with images. A panel discussion focused on discussing the past, present and future of hyperlinking will conclude the workshop. The panel will aim at an un-derstanding of which approaches are most promising and how they can be evaluated. The goal is to shape research directions at the crossroad of the scientific communities in-volved in SLAM and nurturing future implementations of video hyperlinking benchmarks.
5.
ACKNOWLEDGEMENTS
We would like to thank the authors, presenters, and re-viewers, whose efforts made the workshop possible.
6.
REFERENCES
[1] F. Bechet and G. Gravier, editors. Proceedings of the 1st IEEE/ISCA International Workshop on Speech, Language and Audio in Multimedia, 2013.
[2] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman, S. Chen, and G. J. F. Jones. The Search and Hyperlinking task at MediaEval 2014. In Working Notes Proc. of the MediaEval Workshop, 2014. [3] M. Eskevich, G. J. Jones, R. Aly, R. J. Ordelman,
S. Chen, D. Nadeem, C. Guinaudeau, G. Gravier, P. S´ebillot, T. de Nies, P. Debevere, R. Van de Walle, P. Galuscakova, P. Pecina, and M. Larson. Multimedia information seeking through search and hyperlinking. In Proc. ACM Intl. Conf. on Multimedia Retrieval, pages 287–294, 2013.
[4] G. Friedland, D. Ellis, and F. Metze, editors.
Proceedings of the 2012 ACM International Workshop on Audio and Multimedia Methods for Large-scale Video Analysis, 2012.
[5] M. Larson, R. Ordelman, F. de Jong, J. K¨ohler and W. Kraaij, editors. Proceedings of the 3rd Workshop on Searching Spontaneous Conversational Speech, 2009. [6] T.-P. Tan, G. Gravier, and K. M. Siti, editors.
Proceedings of the 2nd IEEE/ISCA International Workshop on Speech, Language and Audio in Multimedia, 2014.