• Aucun résultat trouvé

Data Integration and VISualization (DIVIS) : from large heterogeneous datasets to interpretable visualisations in plant science

N/A
N/A
Protected

Academic year: 2021

Partager "Data Integration and VISualization (DIVIS) : from large heterogeneous datasets to interpretable visualisations in plant science"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01858693

https://hal-agrocampus-ouest.archives-ouvertes.fr/hal-01858693

Submitted on 24 Aug 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Data Integration and VISualization (DIVIS) : from large heterogeneous datasets to interpretable visualisations in

plant science

Ophélie Thierry, Rachid Boumaza, Julia Buitink, Claudine Landès, Olivier Leprince, Mathilde Orsel, Pierre Santagostini, Julie Bourbeillon

To cite this version:

Ophélie Thierry, Rachid Boumaza, Julia Buitink, Claudine Landès, Olivier Leprince, et al.. Data Inte- gration and VISualization (DIVIS) : from large heterogeneous datasets to interpretable visualisations in plant science. JOBIM 2018, Jul 2018, Marseille, France. pp.61, JOBIM 2018. �hal-01858693�

(2)

Data Integration and VISualization (DIVIS) :

from large heterogeneous datasets to interpretable visualizations in plant science.)

O. Thierry 1 , R. Boumaza 1 , J. Buitink 1 , C. Landès 1 , O. Leprince 1 , M. Orsel 1 , P. Santagostini 1 et J. Bourbeillon 1

1

IRHS, Agrocampus-Ouest, INRA, Université d’Angers, SFR 4207 QuaSaV 42 rue Georges Morel, 49071 Beaucouzé Cedex, France

Introduction

• The increase in the scale of experiments, the multiplication of experimental techniques used as part of one study and the generalized storage of past experimental datasets have led to the creation of large collections of heterogeneous datasets such as those stored by the IRHS teams in plant science.

• Currently, biologists are more and more interested in reusing and exploring these collections through integration and visualization techniques, but there is no user friendly tool addressing the whole problem, in particular in plant science [1-3].

• The DIVIS project aims to tackle the issue through a methodological exploration of a possible solution and the development of a directly usable prototype software on two test datasets, by combining the most promising integration and visualization approaches that are publicly available.

Software Workflow

Project status

• Built the data matrices for both datasets.

• Performed basic statistical analyses to better understand the datasets.

• Performed some test visualizations with crude grouping approaches.

• Identified relevant existing ontologies and started designing missing ones.

• Current stage: associate metadata with ontologies and estimate their distribution [4-7].

Datasets

Seed datasetApple dataset

Seed dataset: ontologies

ColorClimateCountry

Apple dataset: ontologies

Visualization attempts

AppleSeed

Sugar contents Seed count plasticity

Organization axes

• mean sugar ratio

• sample

• region

• country

• precipitations class

• ecotype I Biologists show interest in the displays.

Limits

• Need to easily navigate in the variables and individuals spaces.

I More interactive visualization.

Conclusion

• Preliminary operations in order to create a tool able to efficiently summarize and visualize larges heterogeneous datasets.

• Limited variability for the current metadata variables in the test datasets.

I Exploring other dimensions, such as shapes of seeds, might help to generate more different groups.

• Current visualization attempts performed using manually organized data files and as such predefined organization axes.

I A strong need to select and order the organization axes, that is to say the metadata variables to consider, emerged, which is considered in the final tool.

I Beyond the supervised approach, unsupervised clustering approaches might be of interest to drive the hierarchical organization of the individuals.

References

[1] Tao, S., Cui, L., Wu, X., Zhang, G.Q. (2017) Facilitating Cohort Discovery by Enhancing Ontology Exploration, Query Management and Query Sharing for Large Clinical Data Repositories. AMIA Annu Symp Proc 2017: 1685-1694.

[2] Solovieva, E., Shikanai, T., Fujita N., Narimatsu, H. (2018) GGDonto ontology as a knowledge-base for genetic diseases and disorders of glycan metabolism and their causative genes. Journal of Biomedical Semantics 9(14).

[3] Hendler, J. (2014) Data Integration for Heterogenous Datasets. Big Data 2(4):205-215.

[4] Fonseca, F.T., Egenhofer, M. J., Agouris, P. and Câmara, G. (2002) Using Ontologies for Integrated Geographic Information Systems. Transactions in GIS 6: 231-257.

[5] Kohler, S. Improved ontology-based similarity calculations using a study-wise annotation model.(2018) Database: The Journal of Biological Databases and Curation 2018:bay026.

[6] Cohen J., Matthen M. Color Ontology and Color Science. Bradford Book, A. (2010).

[7] Hartmann, J., Palma, R., Gómez-Pérez, A. Ontology Repositories. Handbook on Ontologies. Berlin, Heidelberg. Springer Berlin Heidelberg. (2009)

Acknowledgements

This work was partially funded by the projects: REGULEG

ANR-15-CE20-000, AI-fruits ("Région Pays de la Loire") and

DIVIS (RFI "Objectif Végétal"). We would like to thank the

ANAN (transcriptomic analysis) and PHENOTIC (seed germination

imaging) platforms for their role in the datasets acquisition.

Références

Documents relatifs

In favour of the objective cause, one can notice that the readmission-associated trajectory contains different spectrum of primary diagnoses compared to the long stay-associated

Data Visualization is the representation of data in some systematic form to communicate the information extracted more clearly and effectively, exploiting the cognitive abilities of

Buts et question de recherche Quatre objectifs ont guidés cette étude : 1 Evaluer l’acceptation de la CC comme but chez les spécialistes français du traitement de la dépendance

Dimension bildet die Beziehung ab, in der das Individuum zu seiner Gesellschaft steht, also ob die Menschen enge Beziehungen mit anderen eingebunden sind

Sur le relevé de la caractéristique électrique avec présence du média traité par DBD, on distingue bien que le courant de la décharge a bel et bien diminué par

As these metadata projects grow horizontally, with more databases and types of data sets, as well as vertically, with more documents, the ecologist will need information

Observations as well as simulations may be accessed from remote data providers as explained in Section 6; these data can be displayed in a 3D scene along the trajectories of

For data processing systems benchmarking, visual analytics performance can include (1) metrics of the indication of uncertainty in intermediate results, (2) bounds on time until