Amye Kenall
Journal Development Manager, Open Data
Gfii Research data and the scientific publication Institut Pasteur, Paris
12 February 2014
Data Computational Research
• Founded in 2000, bought by Springer in 2008
• BioMed Central publishes 260 open access journals
• ~25,000 peer reviewed research articles published annually
• Genomics and computational biology are a significant fraction e.g. Genome Biology, BMC Genomics, BMC Bioinformatics
• Other key fields include
•
Public Health / Global Health / Infectious Disease•
Cancer• All research articles are CC-BY licensed for reuse
• Since mid 2013, all data is covered by a CC0 rights waiver unless otherwise indicated
Open Data at BioMed Central
• Strong encouragement to
authors of all journals to provide underlying datasets and
required on a select number (eg, Genome Biology, Genome
Medicine, GigaScience)
• CC0 + CC-BY 4.0 by default To do…
• Tabular data as CSV download
• DOIs for all additional files
• Searchability of additional files
• OAI-PMH for additional files as a Virtual Data Repository to aid harvesting
e.g. Data Citation Index
Data reuse
• Availability of Data section and Data Citation
• Encourage use of ISA-TAB (especially GigaScience / GigaDB and BMC
Research Notes)
Open API to retrieve information from
API
(on GigaDB)
URL XML File
•Only journal where data and source code behind article + article can be mined through open API
•Only journal offering a home for complex image data (like fMRIs, eg) right next to article
Linking and Citation
Already in place
• JMOL for 3D rendering of MOL/PDB
• Google Earth for geographic data (KML)
• Virtual microscope slides
• Mini-websites (generic)
To do...
• Movies as H263/MP4
• Interactive dataset visualization via JS
• Interactive visualization as part of reproducible data analysis
Data visualization at BioMed Central
Manipulatable 3D Files
in PDF
Video Files in PDF
Deep Zoom Electronic Lab
Notebooks
Already in place
• JMOL for 3D rendering of MOL/PDB
• Google Earth for geographic data (KML)
• Virtual microscope slides
• Mini-websites (generic)
To do...
• Movies as H263/MP4
• Interactive dataset visualization via JS
• Interactive visualization as part of reproducible data analysis
Data visualization at BioMed Central
Reproducibility of computational research
• Computational research in principle should be easier to
replicate/reproduce than bench studies
• However, practical issues get in the way
• Even if source code is shared, reproducing entire technical setup, gathering appropriate
input data, rerunning analysis , is a significant effort
• This means readers and even reviewers don’t bother
• We would like to reduce this
‘activation energy’
Strong interest from potential partners
Key technologies
Technologies +
Partners +
Journal Article
+
• Publishers have role in enforcement of community standards
• Public/academic databases can provide credible long term archiving guarantees for key data
• Academic grid computing infrastructure can provide access for researchers to large-scale computing resource
• Commercial cloud providers
universalize/democratize access to large-scale computing. Even if you are not at an institution with its own facilities, you can carry out high-end computations. No bureaucracy/politics – simply pay per CPU-hour.
Complementary roles of publishers,
academia, and cloud providers
Flexible management/deployment of packaged
data/analysis suites using VM infrastructure
• To what extent can/should datasets be included in the VM/suite or pulled in externally? Where should they be hosted?
• To what extent are cross-domain standards for referring to and pulling in underlying datasets feasible. Dataset DOIs typically point to metadata.
• Multiple versions of datasets. To what extent is it practical,
when dealing with evolving datasets/databases, to make them available as reproducible snapshots?
• Culture of data sharing. How to get authors to share their data?
Specific challenges with respect to data
Culture of Data Sharing
• Data may mean the difference between getting a grant or not.
• Creators (understandably) prefer to hold the data until they have extracted all the possible publication value they can.
• Credit for data and source code is not institutionalised as it is for the article
• This behaviour comes at a cost for the wider scientific community.
an open data badge.
• With big data and computational tools, research is becoming more
“reproducible/reusable”
• The infrastructure is out there and growing
• What authors need to communicate their research is also changing, and as publishers we must respond
• Clear publishers have a role, with other organisations, in setting some community standards
• It took a few 100 years, but publishing is now getting exciting
Conclusions
Questions?
“One reason that the worldwide web worked was because
people reused each other’s content in ways never imagined or achieved by those who created it. The same will be true of open data.”
– Tim Berners-Lee and Nigel Shadbolt, The Times, New Year’s Eve 2011
Amye Kenall
Journal Development Manager (Open Science), BioMed Central
@AmyeKenall (also @OpenDataBMC) [email protected]