Haut PDF Location models for visual place recognition

Location models for visual place recognition

Location models for visual place recognition

In order to incorporate relative spatial information from geometric constraints into observation models, a number of methods have been investigated. For example, the work of [27] incorporates learned distributions of 3D distances between visual words into the generative model in order to increase robustness to perceptual aliasing. In [19], fea- tures are quantized in both descriptor and image space. This means that visual features are considered in a pairwise fash- ion, and additionally assigned a spatial word, which de- scribes their relative positions in terms of quantized angles, distances, orientations, and scales. In recent years, graph comparison techniques have become popular in a wide ar- ray of recognition tasks, including place recognition. Ap- plied to visual data, graphs of local features are created and used to represent and compare things such as objects. The work of [36] uses graph matching techniques which allow for inclusion of geometric constraints and local deforma- tions which often occur in object recognition tasks, by intro- ducing a factorized form for the aff nity matrix between two graphs. This approach explicitly solves for node correspon- dences of object features. Alternatively, the works of [14] and [5] apply graph kernels to superpixels and point clouds in order to recognize and classify visual data in a way which does not explicitly solve the node correspondence problem, but provides a similarity metric between graphs by map- ping them into a linear space. In the described approaches, graph comparison was applied on relatively small graphs consisting of only tens of nodes due to complexity. For the case of graph kernels, random walk and subtree kernels ap- plied in [5, 14], scale with at least !!" ! " with respect to the number of nodes " [35]. Other types of graph kernels have since been proposed, which strengthen node labels with ad- ditional structural information in order to reduce the relative kernel complexity [6, 29] and open the door for applications to larger graphs. For example, in [29], Weisfeiler-Lehman (WL) graph kernels scale with !!#" with respect to the number of edges #. Further details regarding graph kernels will be discussed in Section 3.2.2. In regards to visual place recognition, graph comparison has been applied in works such as [26, 32] which make use of landmark covisibility to compare locations based on visual word co-occurrence graphs, and also scale with the number of edges. The work of [26] demonstrates how the def ned similarity measures can be interpreted as simplif ed random-walk kernels.
En savoir plus

174 En savoir plus

The time course of visual influences in letter recognition

The time course of visual influences in letter recognition

At the individual level, trials corresponding to uppercase and lowercase letters were averaged separately. At the group level, we relied on a robust estimator of central tendency, the trimmed mean, to assess the differences between upper- and lowercase letters. For each electrode/time frame pair (e, tf), taken independently, amplitudes collected on the group were sorted, and the lowest 20 % and the highest 20 % of the distribution were trimmed. For each (e, tf) pair, the remaining amplitudes were then averaged. Since it preserves the central part of the distribution, the trimmed mean has been proved to be a robust and useful measure of location (see Wilcox, 2005 ; Wilcox & Keselman, 2003 ). Moreover, the trimmed mean has proven its utility in recent electrophysiological studies, be- cause of the robustness of this measure to contamination by extreme values (Desjardins & Segalowitz, 2013 ; Rousselet, Husk, Bennett, & Sekuler, 2008 ). Inferential results were computed by relying on the Yuen procedure, a robust coun- terpart to the paired t test, with a threshold fixed at p < .05 (see Wilcox, 2005 , 2012 ). Because statistical tests were performed for every (e, tf) pair, we corrected for multiple comparisons by using a bootstrap T approach at the cluster level (with p < .01; see Maris & Oostenveld, 2007 ; Pernet et al., 2011 ; Rousselet, Gaspar, Wieczorek, & Pernet, 2011 ).
En savoir plus

10 En savoir plus

Resources and Methods for the Automatic Recognition of Place Names in Alsatian

Resources and Methods for the Automatic Recognition of Place Names in Alsatian

Concerning specifically location detection in historical data, Borin et al. [2] proposed a knowledge- and rule-based approach which aims specifically at han- dling the variation in 19th century Swedish literature. Their best system reaches an F-measure of 86.4%. For the Arabic language, which also presents high variation, a knowledge and rule-based approach is described in [19]. The presented system reached an F-measure of 85.9%. All these works consider only one class for the location type. The QUAERO typology, on which our work is based, was used on French old press data [8] and on Swiss old press data in French [6] which also con- tains a lot of variation. In the latter, the authors compared various systems and for the location type the results ranged between 48% and 69% of F-measure depending on the system tested, which is similar to what we obtained in this work. Most cur- rent models for NER consider it as a supervised sequential classification problem where each sentence is a sequence [4, 11, 15]. In order to categorize words, the model can rely on orthographic information, captured by character-based represen- tations, and distributional information, captured by word embeddings. Recently a method to represent such information which includes character level information was proposed [1]. This approach is often considered as being robust against varia- tion. Our hypothesis was that this model may be useful for our purpose.
En savoir plus

11 En savoir plus

Place Recognition via 3D Modeling for Personal Activity Lifelog Using Wearable Camera

Place Recognition via 3D Modeling for Personal Activity Lifelog Using Wearable Camera

thresholding the distance between the camera and the predefined reference point delineates areas defined as: close (manipulation zone), intermediate (approach- ing) and far (seeing the place, but too far to do instrumental activities). Since the wearable camera is located on the shoulder, just above the arm associated with the dominant hand, we use the camera position directly to evaluate this distance, as we consider that the distance between the camera and a point in the environment is representative of the distance for manipulation. In the current state, only the presence in the close zone is considered to trigger a location based event, although information is available to define additional types of events. The frame based classification is segmented into event intervals using connected com- ponents of consecutive same class frames. Each frame is therefore associated to zero or one event. Each event corresponds to a temporal interval representing the arrival, stay and exit in a specific place.
En savoir plus

12 En savoir plus

Generating Unsupervised Models for Online Long-Term Daily Living Activity Recognition

Generating Unsupervised Models for Online Long-Term Daily Living Activity Recognition

T P T P +F P measures, where T P , F P and F N stands for True Positive, False Positive and False Negative, respec- tively. We have compared our approach with the results of the supervised approach in [ 11 ] where videos are manually clipped. We did also a comparison with an online super- vised approach that follows [ 11 ]. For doing this, we train the classifier on clipped videos and perform the testing us- ing sliding window. There are more recent approaches but they are not appropriate for our problem. For example [ 12 ] is adapted to cope with camera motion. Since there is no camera motion in our experiments it is not fitting the case in our problem. In the online approach, a SVM is trained using the action descriptors extracted from groundtruth in- tervals. For online testing, the descriptors of a test video are extracted in a sliding window of size W frames with a step size of T frames. At each sliding window interval, the ac- tion descriptors of the corresponding interval are extracted and classified using SVM. W and T parameters are found during learning. We have also tested different versions of our approach that i) only uses global motion features and ii) which only uses body motion features. We have randomly selected 3/5 of the videos in both datasets for learning the activity models using global and body motion information, as described in Section 3.4 . The remaining videos are used for testing. The codebook size is set to 4000 visual words for all the methods.
En savoir plus

6 En savoir plus

Semantic Event Fusion of Different Visual Modality Concepts for Activity Recognition

Semantic Event Fusion of Different Visual Modality Concepts for Activity Recognition

Studies on video content retrieval have investigated ways to extend the standard low-level, visual feature rep- resentations for actions [35] by aggregating other modali- ties commonly present in video recordings, such as audio and text [27] [29] [17]. In [17], authors have introduced a feature-level representation that models the joint patterns of audio and video features displayed by events. In [29], a multimodal (audio and video) event recognition system is presented, where base classifiers are learned from different subsets of low-level features, and then combined with mid- level features, such as object detectors [21] for the recog- nition of complex events. These studies have showed that by decomposing complex event representation into smaller semantic segments, like action and objects, inter-segment relations not attainable before can be captured to achieve higher event recognition rates. Nevertheless, these methods only recognize the most salient event in an entire video clip. The task targeted by this paper require us to precisely segment variable-length spatiotemporal regions along the multimodal recording, and accurately classify them into activities.
En savoir plus

15 En savoir plus

Probabilistic Place Recognition with Covisibility Maps

Probabilistic Place Recognition with Covisibility Maps

Impressive as these systems are, there is still room for im- provement in terms of how locations are modeled. Abstrac- tion from single image location models has been addressed in the work of [2], [3], [11], [12]. Location models built using specific poses in the robot’s trajectory imply that the robot must visit the same arbitrary pose in order to recognize any relevant loop-closures. CAT-SLAM [2] moves towards a con- tinuous representation, but requires local metric information. In [11] and [12] comparisons are made with sequences based on time, under the assumption that the speed remains fairly consistent. The work of [3] dynamically queries location models as cliques from a covisibility graph of landmarks which are connected if seen together. These location models are then based on the underlying environmental features, rather than the discretization of the robot’s trajectory in the form of individual images, or sequences of images in time. This paper describes an appearance-based method in which dynamic virtual locations are retrieved as cliques from a covisibility graph of landmarks, and then a Bayesian framework is used to asses place recognition.
En savoir plus

7 En savoir plus

Refining visual activity recognition with semantic reasoning

Refining visual activity recognition with semantic reasoning

according to the scenario. The ”opening door” scenario provides the best results. As we can see in Figure 6c, in most cases the robot identifies properly the activity, but still makes confusion in one out of two cases. With only 6.40% successful recognition rates, the robot seems to have difficulty to recognize the ”remote controlling” activity. However, in Figure 6b, in average, the proper activity is recognized in third place, behind the ”opening door” and ”applauding” activities. This is clearly due to a gesture confusion: all top three activities are illustrated by the rise of an arm and thus the difference can be tricky to distinguish. The phone scenario has very poor results. As we can see in Figure 6a, it is almost the last activity in the distribution while being the one to be observed. This result underlines the unreliability of the vision-based recognition in some cases. We suppose that the robot was not trained enough for this scenario and thus we emphasize the importance and the difficulty of learning of a proper vision-based activity recognition algorithm. Compared to previous works, it is important to notice that the experiments are not based on a given data set: gestures are varied, unpredictable and close to real case scenarios. Through these results, we can clearly see the weaknesses of this approach. We will now study the results obtained after applying the refinement process.
En savoir plus

9 En savoir plus

Spatiotemporal Dynamics of Morphological Processing in Visual Word Recognition

Spatiotemporal Dynamics of Morphological Processing in Visual Word Recognition

Recent work, however, has challenged the view that morphological decomposition only takes place when there is a true semantic relationship between the com- plex word and its stem. Indeed, a bulk of masked priming studies have typically reported that the size of morpho- logical priming for semantically transparent pairs ( farmer – FARM ) is identical to that of semantically opaque or pseudoaffixed pairs (corner –CORN) and that both con- ditions are significantly different from nonmorphological orthographic controls (cashew –CASH; e.g., Beyersmann et al., 2015; Beyersmann, Castles, & Coltheart, 2012; Lavric, Elchlepp, & Rastle, 2012; Rastle & Davis, 2008; Meunier & Longtin, 2007). These findings have been taken to suggest that early morphological decomposition is semantically “blind.” However, the form-then-meaning account has not gone uncriticized. Feldman, O ’Connor, and Moscoso del Prado Martin (2009) pointed out that these findings rely on a null effect (absence of a difference between farmer – farm and corner –corn items). When they increased the statistical power by pooling data from different published masked priming experiments into a meta-analysis, they showed that morphological facilitation was significantly greater (+10 msec) for semantically similar (transparent) than semantically dissimilar (opaque) pairs (Feldman et al., 2009). Also, a recent study showed that semantically similar prime –target pairs can produce greater facilitation than semantically dissimilar pairs even at short prime durations when using the same targets across conditions
En savoir plus

16 En savoir plus

Using Markov Logic Network for On-line Activity Recognition from Non-Visual Home Automation Sensors

Using Markov Logic Network for On-line Activity Recognition from Non-Visual Home Automation Sensors

This type of environment imposes constraints on the sensors and the technol- ogy used for recognition. Indeed, information provided by the sensor for activity recognition is indirect (no worn sensors for localisation), heterogeneous (from numerical to categorical), transient (no continuous recording), noisy, and non- visual (no camera). This application setting calls for new methods for activity recognition which can deal with the poverty and unreliability of the provided information, process streams of data, and whose models can be checked by hu- man and linked to domain knowledge. To this end, we present a method based on Markov Logic Network (MLN) to recognise activities of daily living in a per- ceptive environment. MLN, a statistical relational method, makes it possible to build logical models that can deal with uncertainty. This method is detailed in Section 4. Before this, a state of the art in MLN based activity recognition is given in Section 2 and the Sweet-Home project is introduced in Section 3. The method was tested in an experiment in a real smart home involving more than 20 participants. The experiment and the results are described in Section 5. The paper ends with a discussion and gives a short outlook on further work.
En savoir plus

17 En savoir plus

Late Fusion of Bayesian and Convolutional Models for Action Recognition

Late Fusion of Bayesian and Convolutional Models for Action Recognition

Fig. 2: An action sequence from CAD-120 [8] dataset: actor 1, video 2305260828, action microwaving-food. From left to right : reach, open, reach, move, place. In blue: human pose detected by OpenPose. In yellow: objects detected by SSD. Once we trained C3D, we retrieve its weights, freeze them 271

9 En savoir plus

Training and evaluation of the models for isolated character recognition

Training and evaluation of the models for isolated character recognition

Taking in account the results of the NN respectively SVMs, we can affirm than the recognition property of the NN is better in the cases when we have “enough” samples, and the SVMs results get higher when the number of samples is slow. So, in the first phase, we would like to use an NN, in order to place (to separate) the image object into one presumed class or another. In cases, when we are certainly sure, that the NN prediction is good, we can stop the recognition process. When we have the results of the NN, we are also looking for the confusion matrix, in order to see, which other candidates could be also taken in consideration. By having the first p classes (the classes which the character presumed by the NN could be also confused), we passing the results of the NN by these p possible SVMs, trained to recognize these classes. (Remark: Taking a look to the confusion matrix (NN case) we can notice two things. At each character class we can find some classes with a high confusion score and a lot of classes with minimal confusion score, which can be ignored. So we think that is sufficiently enough to pass the images through just the first p SVMs, where the confusion it’s really measurable). This parameter will be a global parameter of the combination model.
En savoir plus

13 En savoir plus

Bayesian models for visual information retrieval

Bayesian models for visual information retrieval

First, it is based on a universal recognition language (the language of probabilities) that provides a computational basis for the integration of information from mult[r]

211 En savoir plus

Context-based Visual Feedback Recognition

Context-based Visual Feedback Recognition

Hidden Markov Model As a third baseline, an HMM was trained for each gesture class. Since HHMs are designed for segmented data, we trained each HMM with segmented subsequences where the frames of each subsequence all belong to the same gesture class. This training set contained the same number of frames as the one used for training the 3 other models except frames were grouped into subsequences according to their label. As we stated earlier, we tested two configurations of Hidden Markov Models: an HMM evaluated over a sliding window (referred to as HMM in our experiments) and concatenated HMMs (refered to as HMM-C). For the first configuration, each trained HMM is tested separately on the new sequence using a sliding window of fixed size (32 frames). The class label associated with the HMM with the highest likelihood is selected for the frame at the center of the sliding window. For the second configuration, the HMMs trained on subsequences are concatenated into a single HMM with the number of hidden states equal to the sum of hidden states from each individual HMM. For example, if the recognition problem has two labels (e.g., gesture and other-gesture) and each individual HMM is trained using 3 hidden states then the concatenated HMM will have 6 hidden states. To estimate the transition matrix of the concatenated HMM, we compute the Viterbi path of each training subsequence, concatenate the subsequences into their original order, and then count the number of transitions between hidden states. The resulting transition matrix is then normalized so that its rows sum to one. At testing, we apply the forward-backward algorithm on the new sequence, and then sum at each frame the hidden states associated with each class label. The resulting HMM-C can seen as a generative version of our FHCRF model.
En savoir plus

197 En savoir plus

Semi-Supervised Learning for Location Recognition from Wearable Video

Semi-Supervised Learning for Location Recognition from Wearable Video

A large body of work attempts to extend “bag of words” model. In particular, by adding local discrimi- native information [25], fast location recognition from structure-from-motion point clouds [4] and finding ef- ficiently loop closures in monocular SLAM [11]. Some works propose also to include geometrical verification [7] of query results, which was applied in [10] to refine a global “bag of words” image search to the object level. The approach for image representation is relying on BoF visual word histograms which are known to be succesfully used in image recognition applications. While being effective, visual word histograms are typ- ically very high dimensionality vectors. It is known that such high dimensionality spaces are very sparse and leads to well-known “curse of dimensionality” [9] or empty space phenomena problem. As we intend to learn from weak supervision, many classical machine learning methods would be prone to overfitting be- cause of low sample number in comparison to their di- mensionality. Naturally there rises the question about leveraging unlabeled images which is the subject of semi-supervised learning algorithms.
En savoir plus

7 En savoir plus

Dynamic reshaping of functional brain networks during visual object recognition

Dynamic reshaping of functional brain networks during visual object recognition

module during the task). Integration and occurrence were greater for meaningless than for meaningful images. Our findings revealed also that the occurrence within the right frontal regions and the left occipito-temporal can help to predict the ability of the brain to rapidly recognize and name visual stimuli. We speculate that these observations are applicable not only to other fast

28 En savoir plus

Comparison of Visual Registration Approaches of 3D Models for Orthodontics

Comparison of Visual Registration Approaches of 3D Models for Orthodontics

To validate the two previous steps, tests have been carried out both on virtual and real images. From a “perfect” virtual case, we evaluate the robustness against noise and the increase of performance using several views. The virtual pictures are obtained by screenshots from the VTK rendering of 3D models (Fig. 2 a). So, several points of view can be simulated controlling the acquisition settings (focal, view angle, etc…). Such virtual pictures are distortion free but we can raise and control the noise level on the point coordinates to simulate real watching process. Real pictures are taken with a numerical 8 mega pixels camera (Canon EOS 350D), a 60 mm lens and an annular flash (Fig. 2 a). We use OpenCV for the calculation functions like POSIT [13] and the levmar functions [15] for the implementation of the Levenberg-Marquardt algorithm.
En savoir plus

13 En savoir plus

On the usage of visual saliency models for computer generated objects

On the usage of visual saliency models for computer generated objects

In this work, different CG contents were displayed on a monitor screen with HD resolution. Its typical visual acuity at standardized viewing distance is around 60 pixel/degree. The reason behind contents’ visualization on a conventional display in this preliminary work is to respect the recommended visual acuity. In fact, when it comes to devices used in immersive experiments (i.e. Head Mounted Displays (HMDs)), their corresponding fields of view changes according to their embedded characteristics. This might lead to a different human perceptual experience. Typical visual acuity at standardized viewing distance of a common VR HMD device is around 15 pixel/degree for the HTC vive and 30 pixel/degree for the HTC vive Pro.
En savoir plus

6 En savoir plus

Automaticity of phonological and semantic processing during visual word recognition

Automaticity of phonological and semantic processing during visual word recognition

In the whole-brain analyses, the areas that showed signi ficant activation in response to written words were mainly located in the prefrontal and occipito-temporal lobes. Yet, given the existence of functional and anatomical connections between the visual and auditory systems ( Booth et al., 2002; Thiebaut De Schotten et al., 2014; Vagharchakian et al., 2012; Yeatman et al., 2011 ), it is likely that processing written words also induces activation in the spoken language system, although to a lesser extent. To further examine how the spoken language system is influenced by written words’ visibility and task demand, we restricted our analyses to the cortical areas that process acoustic, phonological and semantic contents of spoken sentences, using the subject-speci fic approach proposed by Nieto- Castañón and Fedorenko (2012) . The same contrasts of interest as in the whole brain analyses (cf. Table 2 ) were examined with a threshold of p < .05 corrected for multiple comparisons across the ROIs using the False Detection Rate method (FDR) as proposed in the spm_ss toolbox ( http://www.nitrc.org/projects/spm_ss ). Overall, the pattern of activation obtained in the temporal areas identi fied by the auditory localizers differed from the main pattern found in the written language Fig. 3. Scatter plots illustrating the relation between the amplitude of the BOLD signal obtained in the visual, phonological and semantic tasks and performance of the participants in the forced-choice task (centered scores) as a function of word visibility. The signals were extracted from seven regions of interest [left IFG Tri (green), precentral gyrus (red), SMA (cyan), insula (blue), anterior fusiform gyrus (Ant. Fus, magenta), middle fusiform gyrus (Mid. Fus, yellow) and posterior fusiform gyrus (Post. Fus, purple)] defined as the intersections of spheres of 10mm radius with the clusters obtained in a conjunction contrast of the three tasks (p. < .001, uncorrected, with a minimum of 50 contiguous voxels). The dots represented individual data. The lines represented the fit by linear regressions.
En savoir plus

13 En savoir plus

Water sound recognition based on physical models

Water sound recognition based on physical models

To cite this version : Guyot, Patrice and Pinquier, Julien and AndréObrecht, Régine Water sound recognition based on physical models.. Any correspondance concerning this service should b[r]

7 En savoir plus

Show all 10000 documents...