• Aucun résultat trouvé

Paired Supervised Learning and Unsupervised Pretraining of CNN-Architecture for Violence Detection in Videos

N/A
N/A
Protected

Academic year: 2022

Partager "Paired Supervised Learning and Unsupervised Pretraining of CNN-Architecture for Violence Detection in Videos"

Copied!
2
0
0

Texte intégral

(1)

Paired supervised learning and unsupervised pretraining of CNN-architecture for violence

detection in videos

?

Abel D´ıaz Berenguer1, Meshia C´edric Oveneke1, Mitchel Alioscha-Perez1, and Hichem Sahli1,2

1 Vrije Universiteit Brussel (VUB), Department of Electronics and Informatics (ETRO), VUB-NPU Joint Audio-Visual Signal Processing (AVSP) Research Lab.

Pleinlaan 2, 1050 Brussels, Belgium

2 Interuniversity Microelectronics Centre (IMEC), Kapeldreef 75, 3001 Heverlee, Belgium

{aberengu,mcovenek,maperezg,hsahli}@etrovub.be

Abstract. Recognizing violence in crowded scenes is a major challenge for automatic video surveillance. Indeed, there is a growing need of in- telligent surveillance systems to strengthen public safety. In this paper we propose an effective approach to recognize violence in crowded videos based on a shallow Convolutional Neural Network (CNN) that is pre- trained using an unsupervised layer-wise learning strategy. Afterwards, the pretrained hyper-parameters are fine-tuned to extract intermediate frame representations, which are subsequently aggregated via NetVLAD to obtain video representations to recognize violence in footage. Through experimental evaluation we validated that our proposal yields very com- petitive outcomes compared to results reported in the state-of-the-art.

Keywords: Convolutional Neural Networks ·feature representations · spatio-temporal aggregation·violence recognition.

1 Introduction

Automatic video analysis for violence recognition has a broad array of applica- tions in different domains. Over the years, several methods have been proposed to recognize violence in videos. However, current methods depend on hand-crafted techniques that hinder features generalization or data-driven approaches based on deep models pretrained using non crowded images, which limits their perfor- mance when dealing with crowded scenes. Our core contribution is a training strategy to pair supervised learning and unsupervised pretraining of a CNN- architecture, along with temporal pooling aggregation to attain a synthesized sequence representation for violence detection in crowded video footage.

?This work has been partly supported by the INNOVIRIS project “ADVISE – Anomaly Detection in Video Security Footage”, and the VUB-IRMO Joint PhD grant.

Copyright c2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0)

(2)

2 A. D´ıaz Berenguer et al.

2 Proposed approach

We propose a data-driven approach to extract CNN features with a shallow ar- chitecture from each frame, which are locally aggregated to encode video-level discriminative representations that are used by a classifier to recognize violence in videos. Our framework consists of four components: the Intermediate repre- sentation, the Spatio-temporal aggregation, the Context Gating and the Recog- nition. The video sequence is the input to the Intermediate representation which is employed to capture frame-level descriptors using a CNN. This CNN model is initially pretrained with an unsupervised representation learning strategy [2]

to discover nuances directly from crowded scenes. Afterwards, we adopt a gen- eralization of NetVLAD [3] to aggregate the frame-level CNN information while neglecting redundant features. The Spatio-temporal aggregation yields a synthe- sized representation of the video footage, whose components are subsequently recalibrated with the Context Gating. Finally, the sequence representation is fed into a classifier to recognize would the input video has violent content.

3 Experimental results

To evaluate the performance of our proposal we carried out experimental evalu- ation on the Violent Flows [1] dataset. The proposed approach, using unsuper- vised pretrained CNN and NetVLAD, overcomes the results reported in state- of-the-art studies and achieves an average accuracy score of 95.94%±2.43 in the challenging dataset.

4 Conclusions

In this paper, we have presented an approach to recognize violent crowded scenes.

Our proposal is based on frame-level CNN features to attain an intermediate spatial representation followed by an aggregation technique to obtain a repre- sentation of the footage content. Experimentation on a publicly available dataset proved that automatic crowded scenes analysis can benefits from data-driven do- main specific representations while not such deep CNN structures are required to achieve state-of-the-art performance.

References

1. Hassner, T., Itcher, Y., Kliper-Gross, O.: Violent flows: Real-time detection of violent crowd behavior. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. pp. 1–6 (June 2012).

https://doi.org/10.1109/CVPRW.2012.6239348

2. Oveneke, M.C., Aliosha-Perez, M., Zhao, Y., Jiang, D., Sahli, H.: Efficient convolu- tional auto-encoding via random convexification and frequency-domain minimiza- tion. In: NIPS 2016 International Workshop on Efficient Methods for Deep Neural Networks (EMDNN) (2016)

3. Zhong, Y., Arandjelovic, R., Zisserman, A.: Ghostvlad for set-based face recognition.

CoRRabs/1810.09951(2018), http://arxiv.org/abs/1810.09951

Références

Documents relatifs

Considering the effect of supervised scenario, we show that from 50% of the original test set size, we surpass state-of-the-art performance on Avenue, Ped1, and ShanghaiTech..

So, attending to multiple relevant dimensions is something experienced listeners do continuously and it would be extremely important to be able to do this when acquiring new phonetic

The appropriate residual deep network model adopted and customized to categorize skyline scenes to urban or rural was an effective line that outperforms

The Inverse Reinforcement Learning (IRL) [15] problem, which is addressed here, aims at inferring a reward function for which a demonstrated expert policy is optimal.. IRL is one

Due to the clear superiority of CNN-based methods in any image classification problem to date, we adopt a CNN framework for the frame classification is- sue attached to the

We have designed a new convolutional neural network architecture that is trained for place recognition in an end-to-end manner from weakly supervised Street View Time Machine data..

This effective inference signal helps the associative network determine the correct response to a stimulus different from the one in the current trial, and therefore implements

We train a linear classifier on the frozen features using 1h of speech from the Common Voice database in different languages.. We also report a supervised model trained entirely