Gene Duplicates
in Mammalian Genomes
Jin Jun, Edward Hemphill, Ion Mandoiu, Craig Nelson
University of Connecticut
o
Paul Ryvkin
University of Pennsylvania
Genome Evolution by Gene Duplication
Gene duplication has been recognized as a major force in mammalian genome evolution
~720 human-specific gene duplications (Zhang 2002)
DNA- vs. RNA-mediated duplication mechanisms
DNA-mediated
Occurs continuously
Contributed significantly to the divergence of gene content
RNA-mediated
Occurs quite frequently
Believed to be non-functional (pseudogenes) due to lack of regulatory material of the parental gene
Recently functional retro-copies in human/mouse genomes detected (Sakai et al. 2007, Vinckenbosch et al. 2006)
Characteristics of Duplication Mechanisms
DNA-mediated (segmental duplication or SD) copies
Syntenic to each other
Share introns
RNA-mediated (retrotransposed or RT) copies
Non-syntenic to parent gene
Intronless or share no introns with parent gene
Pseudogenes
Majority of retrotransposition events generate processed pseudogenes
Some of SD copies can be degraded to non-processed pseudogenes
Mammalian Genomes in Our Study
Our Approach
Included pseudogenes in our analysis to capture more duplication events.
Avoided using the coding sequence similarity to reconstruct the evolutionary history of gene family.
Local synteny information and gene structure information were used to distinguish SD events from RT events.
Able to measure the functional constraint based on the ancestral evolutionary history.
Using Pseudogenes
Hs
Speciation
Intact Gene
No duplication event
Functional orthologs
Pt
Intact Gene
Hs Hs
Duplication
Pt
Pseudonization
Pt
Pseudogenes
One duplication event
Ancestral orthologs
Measuring Local Synteny
GOI
Three neighboring genes on each up/down stream are considered
Homologous matches:
BlastP score > 50 and sequence similarity > 80%
Weak homologous matches
Homologous genes
(# matched sides, # matches) = (2, 3)
FP and FN rates of Local Synteny
Local synteny of various orthologs
0%
20%
40%
60%
80%
100%
(0,0) (1,1) (1,2) (1,3) (2,2) (2,3) (2,4) (2,5) (2,6) Random Inparanoid Ensembl
Synteny level si,j
0 1 2
Homology
Orthologous introns Chr9.
- strand
Chr3.
+ strand Chr9.
- strand
6.76Kb
3.06Kb //
18.84Kb
Inparanoid 1-to-1 ortholog Orthologs with the number of
local matches=6 and ICR=2/3 Gene
(# cds introns) Rat A
(0)
Rat B (2)
Dog (3)
GOI Downstream neighbors Upstream neighbors
Arpc5l 20211
RGD1560362 Gpr45
Tgfbrap1 RGD1310553
24652
Mrps9 Pou3f3
20212 20213
20214
20208 20206 20203
RGD1566237 LOC690485
Olfml2a
Golga1 LOC690538 Ppp6c
RT Miscall by Inparanoid
Duplication Event Identification
Local Synteny Information
2-2-stage Clusteringstage Clustering Algorithm
Algorithm
Intron Intron Conservation Conservation
Filter Filter Hierarchical
Hierarchical Clustering Clustering Algorithm Algorithm
ENSEMBL Protein Families PseudoPipe*
pseudogenes
Gene + p-gene Families
Syntenic Clusters
SD Events
SD Events RT EventsRT Events
Find
successive SD events
Recover the old / larger gene families
Recover putative RT events
Confirm RT events
Yes
*. Zhang et al. Bioinfomatics 2006
2-2-stage Clusteringstage Clustering Algorithm
Algorithm
Two-Stage Clustering Algorithm
Stage 1: Single-linkage clustering with synteny level 2
Stage 2: Complete-linkage clustering with synteny level 1
Considering the phylogenetic structure
Result: syntenic clusters
Any member within clusters: from SD or speciation events
Between these clusters: RT events or old SD events with loss of local synteny
Duplication Event Identification
Local Synteny Information
2-2-stage Clusteringstage Clustering Algorithm
Algorithm
Intron Intron Conservation Conservation
Filter Filter Hierarchical
Hierarchical Clustering Clustering Algorithm Algorithm
ENSEMBL Protein Families PseudoPipe*
pseudogenes
Gene + p-gene Families
Syntenic Clusters
SD Events
SD Events RT EventsRT Events
Find
successive SD events
Recover the old / larger gene families
Recover putative RT events
Confirm RT events
Yes
Non-syntenic + low ICR RT event
Intron Conservation Filter for RT events
# positional orthologous introns ICR (Intron Conservation Rate) = ---
# total introns positions
0%
20%
40%
60%
80%
100%
[0,1/3) [1/3,2/3) [2/3,1]
ICR histogram of non-syntenic events
Duplication Event Identification
Local Synteny Information
2-2-stage Clusteringstage Clustering Algorithm
Algorithm
Intron Intron Conservation Conservation
Filter Filter Hierarchical
Hierarchical Clustering Clustering Algorithm Algorithm
ENSEMBL Protein Families PseudoPipe*
pseudogenes
Gene + p-gene Families
Syntenic Clusters
SD Events
SD Events RT EventsRT Events
Find
successive SD events
Recover the old / larger gene families
Recover putative RT events
Confirm RT events
Yes
Hierarchical Clustering for Successive SD events
Pearson’s correlation coefficients
Hierarchical clustering (UPGMA)
hs1
pt1
mm1
rn1 rn2
cf
mm2
Line thickness corresponds to Pearson’s correlation coefficient Hierarchical syntenic clusters from UPGMA
hs1
pt1
mm1
rn1 rn2
cf
mm2 hs1
pt1
mm1
rn1 rn2
cf
mm2
Result: hierarchy of SD/speciation events shown by different degree of synteny
Hierarchical syntenic clusters by UPGMA
Inferring SD and RT events
hs0 pt0
Syntenic clusters from two-stage clustering
hs1 pt1 mm1 rn1
rn2
cf
mm2
Loss of local synteny?
Intron conservation?
primate RT event
Rodent SD event
the Bursts of Retrotransposition Events on Mammalian Genomes
Normalized by maximum
RT Events
B
0%
100%
0 0.2 0.4 0.6 0.8 1
dS
Proportions
Primate RT
SD C
0%
40%
0 0.2 0.4 0.6 0.8 1
dS
Proportions
Rodent RT
SD
of Constraint on Their Protein Coding Regions
B
0 1
0 0.2 0.4 0.6 0.8 1
dN/dS ratio Normalized counts by maximum
In-group RT SD
C
0 1
0 0.2 0.4 0.6 0.8 1
dN/dS ratio Normalized counts by maximum
Rodent RT SD
D
0 1
0 0.2 0.4 0.6 0.8 1
dN/dS ratio Normalized counts by maximum
Primate
RT SD
C
0 1
0 0.2 0.4 0.6 0.8 1
dN/dS ratio Normalized counts by maximum
Rodent RT SD
No Evidence of Purifying Selective Pressure
A
0 1
0 0.2 0.4 0.6 0.8 1
dN/dS ratio Normalized counts by maximum
Rodent RT
intact inactivated
0 0.1 0.2 0.3 0.4 0.5
0 0.1 0.2 0.3 0.4 0.5
A
0 1
0 0.2 0.4 0.6 0.8 1
dN/dS ratio Normalized counts by maximum
Rodent RT
intact inactivated
0 0.1 0.2 0.3 0.4 0.5
0 0.1 0.2 0.3 0.4 0.5
Intact
Pseudogenized
“Functional” events : dN/dS ratios < 0.5
dN/dS
= 0.5 Rodent RT Events
dS dN
Equal Numbers of New Functional Genes
(MYA) 5
41 91
92
Homo sapiens
Pan troglodytes Mus musculus Rattus norvegicus Canis familiaris
104 / 1,782 53 / 161
46 / 522 48 / 88 37 / 45
47 / 52
1,913 116 2,731 312 2,562 418 330 127 2,193 241 RT functional / assigned events
SD functional / assigned events
134
12,078 1,649 187 / 2,349 = 7.96%
148 / 301 = 49.17%
3 Inner branches Whole tree functional/ total events
functional/ total events
Three internal branches
Duplication Event Identification II
BranchClust*
ENSEMBL Protein Families
Duplication Duplication
Events Events
Distant
Distant RTRT
Find orthologous groups
Identify duplication events by simple parsimony
Syntenic No Yes
Paralogs in 10 flanking
genes
Tandem
Tandem UnknownUnknown
Intron filter Both #>1
#conserved > 0
Intronless + Intron bearing
Otherwise Yes
No
*. Poptsova & Gogarten.
BMC Bioinfomatics 2007
(MYA) 5
41 91
Homo sapiens
Pan troglodytes Mus musculus Rattus norvegicus Canis familiaris
104 / 1,782 53 / 161
46 / 522 48 / 88 37 / 45
47 / 52
1,913 116 2,731 312 2,562 418 330 127 2,193 241 functional / assigned events
functional / assigned events
12,078 1,649 187 / 2,349 = 7.96%
148 / 301 = 49.17%
3 Inner branches Whole tree
All Predicted Gene Duplicates
187 /
148 / 301 = 49.17%
3 Inner branches
Between Duplication Types
C
0%
40%
0 0.2 0.4 0.6 0.8 1
dS
Proportions
Rodent RT
SD
Between Duplication Types
C
0 1
0 0.2 0.4 0.6 0.8 1
dN/dS ratio Normalized counts by maximum
Rodent
RT SD
Fate of Duplicates
Some RT copies are under stabilizing selective pressure
Any asymmetry of duplicates in selective pressure?
Can we gain some insight into regulatory element of the duplicates?
Do RT copies co-opt the preexisting enhancers?
Does any disruption in flanking regions affect the fate of SD duplicates?
Pressure are Significantly Closer to the Nearest Upstream Genes
Enhancer?
Co-option? Gene desert?
RT classes Distance to the nearest upstream gene
1 2
*
3 A
0 1
0 0.2 0.4 0.6 0.8 1
dN/dS ratio Normalized counts by maximum
Rodent RT
intact inactivated
A
0.6 0.8 1
dN/dS ratio Rodent RT
inactivated
Under stabilizing selective pressure
Pseudogenized
1
3
2
Pseudogenizing?
RT copy RT copy
Gain of regulatory elements by co-opting
the Fate of SD Duplicates
A
Segmental duplication Regulatory Genes?
B
2) Disruption in flanking region 1) Syntenic
B’
Relaxation of
evolutionary constraint?
1) Symmetry
A
B
2) Asymmetry
A
B
A
A
B
B
GOI
Frequency of asymmetry
0%
10%
20%
30%
40%
RT Distant Tandem
Segmental Duplicates Correlate With an Increase in a Relaxation of Constraint
Frequency of asymmetry
0%
10%
20%
30%
40%
Upstream Downstream Upstream Downstream
Distant Tandem
Syntenic Disrupted
*
Loss of regulatory elements by disruption in flanking region
Conclusion
A method to reconstruct ancestral relations independent of functional relations
Roughly equal contributions of new genes by SD and RT duplication mechanisms
Gain/loss of regulatory elements affect the fate of duplicates
1. Co-opting preexisting enhancers by RT copies
2. Correlation between the disruptions in flanking regions and an increasing asymmetry on distant SD copies
Dr. Craig Nelson
Dr. Ion Mandoiu
Paul Ryvkin
Edward Hemphill
Matthew Kozachek
Gerstein Lab
Gogarten Lab
ENSEMBL