Data Computational Research

(1)

Amye Kenall

Journal Development Manager, Open Data

Gfii Research data and the scientific publication Institut Pasteur, Paris

12 February 2014

Data Computational Research

(2)

• Founded in 2000, bought by Springer in 2008

• BioMed Central publishes 260 open access journals

• ~25,000 peer reviewed research articles published annually

• Genomics and computational biology are a significant fraction e.g. Genome Biology, BMC Genomics, BMC Bioinformatics

• Other key fields include

•

Public Health / Global Health / Infectious Disease

•

Cancer

• All research articles are CC-BY licensed for reuse

• Since mid 2013, all data is covered by a CC0 rights waiver unless otherwise indicated

Open Data at BioMed Central

(3)

• Strong encouragement to

authors of all journals to provide underlying datasets and

required on a select number (eg, Genome Biology, Genome

Medicine, GigaScience)

• CC0 + CC-BY 4.0 by default To do…

• Tabular data as CSV download

• DOIs for all additional files

• Searchability of additional files

• OAI-PMH for additional files as a Virtual Data Repository to aid harvesting

e.g. Data Citation Index

Data reuse

• Availability of Data section and Data Citation

• Encourage use of ISA-TAB (especially GigaScience / GigaDB and BMC

Research Notes)

(4)

(5)

(6)

Open API to retrieve information from

API

(on GigaDB)

URL XML File

•Only journal where data and source code behind article + article can be mined through open API

•Only journal offering a home for complex image data (like fMRIs, eg) right next to article

(7)

Linking and Citation

(8)

Already in place

• JMOL for 3D rendering of MOL/PDB

• Google Earth for geographic data (KML)

• Virtual microscope slides

• Mini-websites (generic)

To do...

• Movies as H263/MP4

• Interactive dataset visualization via JS

• Interactive visualization as part of reproducible data analysis

Data visualization at BioMed Central

(9)

Manipulatable 3D Files

in PDF

(10)

Video Files in PDF

(11)

Deep Zoom Electronic Lab

Notebooks

(12)

Already in place

• JMOL for 3D rendering of MOL/PDB

• Google Earth for geographic data (KML)

• Virtual microscope slides

• Mini-websites (generic)

To do...

• Movies as H263/MP4

• Interactive dataset visualization via JS

• Interactive visualization as part of reproducible data analysis

Data visualization at BioMed Central

(13)

Reproducibility of computational research

• Computational research in principle should be easier to

replicate/reproduce than bench studies

• However, practical issues get in the way

• Even if source code is shared, reproducing entire technical setup, gathering appropriate

input data, rerunning analysis , is a significant effort

• This means readers and even reviewers don’t bother

• We would like to reduce this

‘activation energy’

(14)

Strong interest from potential partners

(15)

Key technologies

(16)

Technologies +

Partners +

Journal Article

(17)

+

(18)

• Publishers have role in enforcement of community standards

• Public/academic databases can provide credible long term archiving guarantees for key data

• Academic grid computing infrastructure can provide access for researchers to large-scale computing resource

• Commercial cloud providers

universalize/democratize access to large-scale computing. Even if you are not at an institution with its own facilities, you can carry out high-end computations. No bureaucracy/politics – simply pay per CPU-hour.

Complementary roles of publishers,

academia, and cloud providers

(19)

Flexible management/deployment of packaged

data/analysis suites using VM infrastructure

(20)

• To what extent can/should datasets be included in the VM/suite or pulled in externally? Where should they be hosted?

• To what extent are cross-domain standards for referring to and pulling in underlying datasets feasible. Dataset DOIs typically point to metadata.

• Multiple versions of datasets. To what extent is it practical,

when dealing with evolving datasets/databases, to make them available as reproducible snapshots?

• Culture of data sharing. How to get authors to share their data?

Specific challenges with respect to data

(21)

Culture of Data Sharing

• Data may mean the difference between getting a grant or not.

• Creators (understandably) prefer to hold the data until they have extracted all the possible publication value they can.

• Credit for data and source code is not institutionalised as it is for the article

• This behaviour comes at a cost for the wider scientific community.

(22)

an open data badge.

(23)

• With big data and computational tools, research is becoming more

“reproducible/reusable”

• The infrastructure is out there and growing

• What authors need to communicate their research is also changing, and as publishers we must respond

• Clear publishers have a role, with other organisations, in setting some community standards

• It took a few 100 years, but publishing is now getting exciting

Conclusions

(24)

Questions?

“One reason that the worldwide web worked was because

people reused each other’s content in ways never imagined or achieved by those who created it. The same will be true of open data.”

– Tim Berners-Lee and Nigel Shadbolt, The Times, New Year’s Eve 2011

Amye Kenall

Journal Development Manager (Open Science), BioMed Central

@AmyeKenall (also @OpenDataBMC) [email protected]