Conference Presentation
Reference
Dissecting PubMed: which content is covered by the Library? and Open Access?
IRIARTE, Pablo, MULLER, Floriane Sophie
Abstract
Our project aims to uncover accessibility to PubMed's contents. By downloading all PubMed metadata, enriching it with missing DOIs and confronting it to our e-journal and paper collection and Open Access tools we dissect the full-text accessibility at our institution: How does the library fare, with its online subscriptions and paper collections of journals? And which portion of PubMed is accessible to the general public via Open Access (OA)?
IRIARTE, Pablo, MULLER, Floriane Sophie. Dissecting PubMed: which content is covered by the Library? and Open Access? In: 16th European Association for Health Information and Libraries (EAHIL) Conference, Cardiff (UK), 9-13 July, 2018
Available at:
http://archive-ouverte.unige.ch/unige:106482
Disclaimer: layout of this document may differ from the published version.
1 / 1
LIBRARY
D ISSECTING P UB M ED
Which content is covered by the Library ? And Open Access ?
Pablo Iriarte Floriane Muller
July 12th, 2018
C ONTEXT
I NSTITUTIONAL CONTEXT AT A GLANCE
Established in 1559
UNIGE data2017 Library data
Staff
Students
= 4’494,2 FTE
Faculties
nationalities of foreign
students
More than 500 programmes - 27 Bachelor’s degree - 108 master’s degree - 80 doctoral programmes
The Library Open
343 days / year 88 hours / week 16 342 m2
2757 seats 104,4 FTE
(11.05 in Medicine)
Professors (FTE)
A STORY FIRST
O UR R ESEARCH Q UESTIONS
• What is full-text coverage of PubMed at our library ?
• Which portion of PubMed is accessible to the general public via Open Access (OA) ?
o Are some PubMed’s articles disconnected from their DOIs ? o Would finding those help increase OA tools’ efficiency ?
Try some tools (data science &
OA) Observe
Open Access trends within PubMed Discover
how our library
collections cover
PubMed’s contents
Improve contents’
visibility and accessibility at our
institution Discover
how our library
collections cover
PubMed’s contents
P ROJECT A IMS
Try some tools (data science &
OA) Observe
Open Access trends
within PubMed
LIBRARY
METHODS
Pernell CC BY-SA 2.0
https://www.nlm.nih.gov/databases/download/pubmed_medline.html
M AIN D ATASET
> 928 compressed files (gzip)
> Total size: 23 Gb (~200 Gb unzipped)
> 30’000 XML-formatted entries per file (08.02.2018 status)
D ATA S OURCES
27’836’723 PMIDs
Swiss national Licences
11’348 journals
2631 journals 4’966’742 PMCIDs
93.7 million DOIs
21’840 print holdings (STM field) 128’449 ejournals
2 million records 92 million
records 17.8 million
records
T OOLS
XMLStarlet
> Command line utilities to parse, extract and transform XML files
> Python library for data
manipulation and analysis
> Python library for algorithm
optimisation (parallel computing)
> iPython Notebooks creation and collaboration tool
RESULTS
ACCESS OFFERED BY THE LIBRARY
Université de Genève, Jacques Erard
METHODS
21’840 print holdings (STM field) 128’449 ejournals
P UB M ED
P UB M ED
Which content is covered by the Library ?
P UB M ED
Which content is covered by the Library ?
P UB M ED
(percentage)
P UB M ED
Which content is covered by the Library ? (percentage)
P UB M ED
Which content is covered by the Library ? (percentage)
O VERALL P UB M ED COVERAGE OFFERED BY THE LIBRARY
27’836’723 PMIDs
Full-text @ UNIGE = 73,5 %
O VERALL P UB M ED COVERAGE
OFFERED BY THE LIBRARY
P UB M ED ’ S C ONTENTS & OA
V ARIOUS SOURCES OF (O PEN ) A CCESS
T HE MISSING DOI S S TORY
P UB M ED DOI S
M ETHODS : M ATCHING PMID S TO DOI S
• 92 million references in CrossRef,
• 28 million references in PubMed
• Using APD key (Author, start page, date) Merging
PubMed &
Crossref
• PPV = 0.45
• Sensitivity = 0.91
• Specificity = 0.30 Compairing
results with Europe PMC data
• Levenshtein distance to mesure differences between PubMed’s and CrossRef’s article titles
Improving our results
• APD - improved
• Europe PMC PMID-DOIs
• DOIs already in PubMed Merging all
DOIs
L OOKING FOR THE MISSING DOI S
L OOKING FOR THE MISSING DOI S
L OOKING FOR THE MISSING DOI S
P UB M ED
Existing / found DOIs (percentage)
Total DOIs in PubMed = 11’931’616 - Added DOIs = 7’510’309
P UB M ED
Which content is Open Access ?
P UB M ED
Which content is Open Access ?
P UB M ED
Which content is Open Access ?
P UB M ED
Which content is Open Access ?
P UB M ED
Which content is Open Access ?
Total Open Access in PubMed= 25,2 % - (33% of contents with known DOI)
P UB M ED
Which content is Open Access ? (percentage)
T HE STATE OF OA
Figure 2 Number of articles (A) and proportion of articles (B) with OA copies, estimated based on a random sample of 100,000 articles with Crossref DOIs. 10.7717/peerj.4375/fig-2
“We estimate that at least 28% of the scholarly literature is OA (19M in total) and that this proportion is growing, driven particularly by growth in Gold and Hybrid. The most recent year analyzed (2015) also has the highest
percentage of OA (45%).” Piwowar et al. (2018), 10.7717/peerj.4375
T HE STATE OF P UB M ED OA
“We estimate that at least 28% of the scholarly literature is OA (19M in total) and that this proportion is growing, driven particularly by growth in Gold and Hybrid. The most recent year analyzed (2015) also has the highest
percentage of OA (45%).” Piwowar et al. (2018), 10.7717/peerj.4375
We found that at least 33% of PubMed literature with known/found DOI is OA (6M in total) and that this proportion is growing. The most recent years analyzed are suffering from embargo periods. 2015 has the highest
percentage of OA (49%).
Accessiblity for swiss citizens
= 29,1 % of PMIDs
Accessiblity for swiss citizens
OA + S WISS N ATIONAL L ICENCES
C ONCLUSIONS
Our UNIGE users have
access to 74% of PubMed’s
articles Swiss citizens
have access to 29% of PubMed’s articles
DOIs are (the missing) keys
7’510’309 could be added
Embargoes’
impact on OA is visible
OA evaluation depends on granularity
33% of PMIDs with known DOIs are OA
= 6’131’801
Articles considered:
27’836’723 PMIDs PubMed’s
growth :
>1 mio articles / year
Q UI CK R ECAP
Benchmarking
Inclusion & promotion of OA tools & contents Systematic reviews
Informed collection development decisions Visibility of our institution’s production (IR)
N EXT ?
CC-By Николай Максимович
LIBRARY
T HANK YOU FOR YOUR ATTENTION
SEE YOU NEXT YEAR IN
B
ASEL!
Pablo.Iriarte@unige.ch Floriane.Muller@unige.ch
Bibliothèque de l’UNIGE, 2018
This document is licensed under Creative Commons Attribution-ShareAlike 4.0 International License: http://creativecommons.org/licenses/by-sa/4.0/.
@pablog_ch
@Flor_Mu
notebooks + slides: www.purl.org/unige/eahil2018