• Aucun résultat trouvé

On the Origin and Fate of Gene Duplicates in Mammalian Genomes

N/A
N/A
Protected

Academic year: 2022

Partager "On the Origin and Fate of Gene Duplicates in Mammalian Genomes"

Copied!
31
0
0

Texte intégral

(1)

Gene Duplicates

in Mammalian Genomes

Jin Jun, Edward Hemphill, Ion Mandoiu, Craig Nelson

University of Connecticut

o

Paul Ryvkin

University of Pennsylvania

(2)

Genome Evolution by Gene Duplication

Gene duplication has been recognized as a major force in mammalian genome evolution

~720 human-specific gene duplications (Zhang 2002)

DNA- vs. RNA-mediated duplication mechanisms

DNA-mediated

Occurs continuously

Contributed significantly to the divergence of gene content

RNA-mediated

Occurs quite frequently

Believed to be non-functional (pseudogenes) due to lack of regulatory material of the parental gene

Recently functional retro-copies in human/mouse genomes detected (Sakai et al. 2007, Vinckenbosch et al. 2006)

(3)

Characteristics of Duplication Mechanisms

DNA-mediated (segmental duplication or SD) copies

Syntenic to each other

Share introns

RNA-mediated (retrotransposed or RT) copies

Non-syntenic to parent gene

Intronless or share no introns with parent gene

Pseudogenes

Majority of retrotransposition events generate processed pseudogenes

Some of SD copies can be degraded to non-processed pseudogenes

(4)

Mammalian Genomes in Our Study

(5)

Our Approach

Included pseudogenes in our analysis to capture more duplication events.

Avoided using the coding sequence similarity to reconstruct the evolutionary history of gene family.

Local synteny information and gene structure information were used to distinguish SD events from RT events.

 Able to measure the functional constraint based on the ancestral evolutionary history.

(6)

Using Pseudogenes

Hs

Speciation

Intact Gene

No duplication event

Functional orthologs

Pt

Intact Gene

Hs Hs

Duplication

Pt

Pseudonization

Pt

Pseudogenes

One duplication event

Ancestral orthologs

(7)

Measuring Local Synteny

GOI

Three neighboring genes on each up/down stream are considered

Homologous matches:

BlastP score > 50 and sequence similarity > 80%

Weak homologous matches

Homologous genes

(# matched sides, # matches) = (2, 3)

(8)

FP and FN rates of Local Synteny

Local synteny of various orthologs

0%

20%

40%

60%

80%

100%

(0,0) (1,1) (1,2) (1,3) (2,2) (2,3) (2,4) (2,5) (2,6) Random Inparanoid Ensembl

Synteny level si,j

0 1 2

(9)

Homology

Orthologous introns Chr9.

- strand

Chr3.

+ strand Chr9.

- strand

6.76Kb

3.06Kb //

18.84Kb

Inparanoid 1-to-1 ortholog Orthologs with the number of

local matches=6 and ICR=2/3 Gene

(# cds introns) Rat A

(0)

Rat B (2)

Dog (3)

GOI Downstream neighbors Upstream neighbors

Arpc5l 20211

RGD1560362 Gpr45

Tgfbrap1 RGD1310553

24652

Mrps9 Pou3f3

20212 20213

20214

20208 20206 20203

RGD1566237 LOC690485

Olfml2a

Golga1 LOC690538 Ppp6c

RT Miscall by Inparanoid

(10)

Duplication Event Identification

Local Synteny Information

2-2-stage Clusteringstage Clustering Algorithm

Algorithm

Intron Intron Conservation Conservation

Filter Filter Hierarchical

Hierarchical Clustering Clustering Algorithm Algorithm

ENSEMBL Protein Families PseudoPipe*

pseudogenes

Gene + p-gene Families

Syntenic Clusters

SD Events

SD Events RT EventsRT Events

Find

successive SD events

Recover the old / larger gene families

Recover putative RT events

Confirm RT events

Yes

*. Zhang et al. Bioinfomatics 2006

2-2-stage Clusteringstage Clustering Algorithm

Algorithm

(11)

Two-Stage Clustering Algorithm

Stage 1: Single-linkage clustering with synteny level 2

Stage 2: Complete-linkage clustering with synteny level 1

Considering the phylogenetic structure

Result: syntenic clusters

Any member within clusters: from SD or speciation events

Between these clusters: RT events or old SD events with loss of local synteny

(12)

Duplication Event Identification

Local Synteny Information

2-2-stage Clusteringstage Clustering Algorithm

Algorithm

Intron Intron Conservation Conservation

Filter Filter Hierarchical

Hierarchical Clustering Clustering Algorithm Algorithm

ENSEMBL Protein Families PseudoPipe*

pseudogenes

Gene + p-gene Families

Syntenic Clusters

SD Events

SD Events RT EventsRT Events

Find

successive SD events

Recover the old / larger gene families

Recover putative RT events

Confirm RT events

Yes

(13)

Non-syntenic + low ICR  RT event

Intron Conservation Filter for RT events

# positional orthologous introns ICR (Intron Conservation Rate) = ---

# total introns positions

0%

20%

40%

60%

80%

100%

[0,1/3) [1/3,2/3) [2/3,1]

ICR histogram of non-syntenic events

(14)

Duplication Event Identification

Local Synteny Information

2-2-stage Clusteringstage Clustering Algorithm

Algorithm

Intron Intron Conservation Conservation

Filter Filter Hierarchical

Hierarchical Clustering Clustering Algorithm Algorithm

ENSEMBL Protein Families PseudoPipe*

pseudogenes

Gene + p-gene Families

Syntenic Clusters

SD Events

SD Events RT EventsRT Events

Find

successive SD events

Recover the old / larger gene families

Recover putative RT events

Confirm RT events

Yes

(15)

Hierarchical Clustering for Successive SD events

Pearson’s correlation coefficients

Hierarchical clustering (UPGMA)

hs1

pt1

mm1

rn1 rn2

cf

mm2

Line thickness corresponds to Pearson’s correlation coefficient Hierarchical syntenic clusters from UPGMA

hs1

pt1

mm1

rn1 rn2

cf

mm2 hs1

pt1

mm1

rn1 rn2

cf

mm2

Result: hierarchy of SD/speciation events shown by different degree of synteny

(16)

Hierarchical syntenic clusters by UPGMA

Inferring SD and RT events

hs0 pt0

Syntenic clusters from two-stage clustering

hs1 pt1 mm1 rn1

rn2

cf

mm2

Loss of local synteny?

Intron conservation?

primate RT event

Rodent SD event

(17)

the Bursts of Retrotransposition Events on Mammalian Genomes

Normalized by maximum

(18)

RT Events

B

0%

100%

0 0.2 0.4 0.6 0.8 1

dS

Proportions

Primate RT

SD C

0%

40%

0 0.2 0.4 0.6 0.8 1

dS

Proportions

Rodent RT

SD

(19)

of Constraint on Their Protein Coding Regions

B

0 1

0 0.2 0.4 0.6 0.8 1

dN/dS ratio Normalized counts by maximum

In-group RT SD

C

0 1

0 0.2 0.4 0.6 0.8 1

dN/dS ratio Normalized counts by maximum

Rodent RT SD

D

0 1

0 0.2 0.4 0.6 0.8 1

dN/dS ratio Normalized counts by maximum

Primate

RT SD

C

0 1

0 0.2 0.4 0.6 0.8 1

dN/dS ratio Normalized counts by maximum

Rodent RT SD

(20)

No Evidence of Purifying Selective Pressure

A

0 1

0 0.2 0.4 0.6 0.8 1

dN/dS ratio Normalized counts by maximum

Rodent RT

intact inactivated

0 0.1 0.2 0.3 0.4 0.5

0 0.1 0.2 0.3 0.4 0.5

A

0 1

0 0.2 0.4 0.6 0.8 1

dN/dS ratio Normalized counts by maximum

Rodent RT

intact inactivated

0 0.1 0.2 0.3 0.4 0.5

0 0.1 0.2 0.3 0.4 0.5

Intact

Pseudogenized

“Functional” events : dN/dS ratios < 0.5

dN/dS

= 0.5 Rodent RT Events

dS dN

(21)

Equal Numbers of New Functional Genes

(MYA) 5

41 91

92

Homo sapiens

Pan troglodytes Mus musculus Rattus norvegicus Canis familiaris

104 / 1,782 53 / 161

46 / 522 48 / 88 37 / 45

47 / 52

1,913 116 2,731 312 2,562 418 330 127 2,193 241 RT functional / assigned events

SD functional / assigned events

134

12,078 1,649 187 / 2,349 = 7.96%

148 / 301 = 49.17%

3 Inner branches Whole tree functional/ total events

functional/ total events

Three internal branches

(22)

Duplication Event Identification II

BranchClust*

ENSEMBL Protein Families

Duplication Duplication

Events Events

Distant

Distant RTRT

Find orthologous groups

Identify duplication events by simple parsimony

Syntenic No Yes

Paralogs in 10 flanking

genes

Tandem

Tandem UnknownUnknown

Intron filter Both #>1

#conserved > 0

Intronless + Intron bearing

Otherwise Yes

No

*. Poptsova & Gogarten.

BMC Bioinfomatics 2007

(23)

(MYA) 5

41 91

Homo sapiens

Pan troglodytes Mus musculus Rattus norvegicus Canis familiaris

104 / 1,782 53 / 161

46 / 522 48 / 88 37 / 45

47 / 52

1,913 116 2,731 312 2,562 418 330 127 2,193 241 functional / assigned events

functional / assigned events

12,078 1,649 187 / 2,349 = 7.96%

148 / 301 = 49.17%

3 Inner branches Whole tree

All Predicted Gene Duplicates

187 /

148 / 301 = 49.17%

3 Inner branches

(24)

Between Duplication Types

C

0%

40%

0 0.2 0.4 0.6 0.8 1

dS

Proportions

Rodent RT

SD

(25)

Between Duplication Types

C

0 1

0 0.2 0.4 0.6 0.8 1

dN/dS ratio Normalized counts by maximum

Rodent

RT SD

(26)

Fate of Duplicates

Some RT copies are under stabilizing selective pressure

Any asymmetry of duplicates in selective pressure?

Can we gain some insight into regulatory element of the duplicates?

Do RT copies co-opt the preexisting enhancers?

Does any disruption in flanking regions affect the fate of SD duplicates?

(27)

Pressure are Significantly Closer to the Nearest Upstream Genes

Enhancer?

Co-option? Gene desert?

RT classes Distance to the nearest upstream gene

1 2

*

3 A

0 1

0 0.2 0.4 0.6 0.8 1

dN/dS ratio Normalized counts by maximum

Rodent RT

intact inactivated

A

0.6 0.8 1

dN/dS ratio Rodent RT

inactivated

Under stabilizing selective pressure

Pseudogenized

1

3

2

Pseudogenizing?

RT copy RT copy

Gain of regulatory elements by co-opting

(28)

the Fate of SD Duplicates

A

Segmental duplication Regulatory Genes?

B

2) Disruption in flanking region 1) Syntenic

B’

Relaxation of

evolutionary constraint?

1) Symmetry

A

B

2) Asymmetry

A

B

A

A

B

B

GOI

(29)

Frequency of asymmetry

0%

10%

20%

30%

40%

RT Distant Tandem

Segmental Duplicates Correlate With an Increase in a Relaxation of Constraint

Frequency of asymmetry

0%

10%

20%

30%

40%

Upstream Downstream Upstream Downstream

Distant Tandem

Syntenic Disrupted

*

Loss of regulatory elements by disruption in flanking region

(30)

Conclusion

A method to reconstruct ancestral relations independent of functional relations

Roughly equal contributions of new genes by SD and RT duplication mechanisms

Gain/loss of regulatory elements affect the fate of duplicates

1. Co-opting preexisting enhancers by RT copies

2. Correlation between the disruptions in flanking regions and an increasing asymmetry on distant SD copies

(31)

Dr. Craig Nelson

Dr. Ion Mandoiu

Paul Ryvkin

Edward Hemphill

Matthew Kozachek

Gerstein Lab

Gogarten Lab

ENSEMBL

InParanoid

RECOMB-CG ’08

Références

Documents relatifs

In the alignment of North Atlantic and Southern Ocean SST records to ice core records (see examples in section 7.1), we estimate age uncertainties during the penultimate

Le Conseil scientifique, par l’intermédiaire de son président, collabore à l’installation d’une serre à Welwitschia mirabilis à Porrentruy, en partenariat étroit avec le

VALÉRIE JEANNERAT ● MEMBRE ESPOIR POUR EUX ● WWW.ESPOIRPOUREUX.CH LORS DE LA SOIRÉE DE PRÉSENTATION DES PROJETS, TRÈS VITE, DEUX ENSEIGNANTES SE SONT APPROCHÉES DE NOTRE BANC

Even if available schema matching tools can obtain a set of good quality attribute correspondences, if we would like to use them for data integration, we need to eliminate the

In the Page parking (or packing) model on a discrete interval (also known as the discrete R´ enyi packing problem or the unfriendly seating problem), cars of length two

In the second analysis, from the sequences of the Saccharum representatives that mapped to the Adh1 and Rpa1 regions, we identified SNPs that were specific to each germplasm

Construis la hauteur issue de D dans le triangle ABDc. Trace un triangle

individuals concordantly for microsatellite and CRTISO data, while the orange group was composed of individuals mainly assigned to the Western genetic group, with some