• Aucun résultat trouvé

Cross-modal interaction in deep neural networks for audio-visual event classification and localization

N/A
N/A
Protected

Academic year: 2021

Partager "Cross-modal interaction in deep neural networks for audio-visual event classification and localization"

Copied!
243
0
0

Texte intégral

Loading

Figure

Figure 1.1. Model of a biological neural (a) and an artificial neuron, also called perceptron (b)
Figure 3.1. Localization of objects/actions. On one hand, the objects are outlined in boxes and associated with a class (a)
Table 3.3 summarizes the different audio-visual datasets.
Table 4.1. List of class with respective number of subclasses, possible positions in room 1 and 2.
+7

Références

Documents relatifs

The audio modality is therefore as precise as the visual modality for the perception of distances in virtual environments when rendered distances are between 1.5 m and 5

In the context of DOA estimation of audio sources, although the output space is highly structured, most neural network based systems rely on multi-label binary classification on

In Experiment 1, 4.5- and 6-month-old infants’ audio-visual matching ability of native (German) and non-native (French) fluent speech was assessed by presenting auditory and

• a novel training process is introduced, based on a single image (acquired at a reference pose), which includes the fast creation of a dataset using a simulator allowing for

Visual attention mechanisms have also been broadly used in a great number of active vision models to guide fovea shifts [20]. Typically, log-polar or foveated imaging techniques

In our system (see Figures 1 and 2), the global camera is static and controls the translating degrees of freedom of the robot effector to ensure its correct positioning while the

The computational cost is dependent of the input image: a complex image (in the meaning of filter used in the SNN) induced a large number of spikes and the simulation cost is high..

The computational models of visual attention, originally pro- posed as cognitive models of human attention, nowadays are being used as front-ends to numerous vision systems like