saskPrimer - an automated pipeline for design of intron-spanning PCR primers in non-model organisms

(1)

Publisher’s version / Version de l'éditeur:

2011 IEEE International Conference on Bioinformatics and Biomedicine

Workshops (BIBMW), pp. 173-177, 2011

READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE. https://nrc-publications.canada.ca/eng/copyright

Vous avez des questions? Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la

première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez pas à les repérer, communiquez avec nous à PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca.

Questions? Contact the NRC Publications Archive team at

PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca. If you wish to email the authors directly, please see the first page of the publication for their contact information.

NRC Publications Archive

Archives des publications du CNRC

This publication could be one of several versions: author’s original, accepted manuscript or the publisher’s version. / La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur.

For the publisher’s version, please access the DOI link below./ Pour consulter la version de l’éditeur, utilisez le lien DOI ci-dessous.

https://doi.org/10.1109/BIBMW.2011.6112371

Access and use of this website and the material on it are subject to the Terms and Conditions set forth at

saskPrimer - an automated pipeline for design of intron-spanning PCR

primers in non-model organisms

Clarke, Wayne E.; Sidebottom, Christine; Parkin, Isobel; Sharpe, Andrew

https://publications-cnrc.canada.ca/fra/droits

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.

NRC Publications Record / Notice d'Archives des publications de CNRC:

https://nrc-publications.canada.ca/eng/view/object/?id=b8e3c867-2b90-47c5-be22-d916fe934256 https://publications-cnrc.canada.ca/fra/voir/objet/?id=b8e3c867-2b90-47c5-be22-d916fe934256

(2)

saskPrimer - an automated pipeline for design of intron-spanning PCR primers in

non-model organisms

Wayne E Clarke∗, Christine Sidebottom‡, Isobel Parkin† and Andrew Sharpe‡

∗_{Department of Computer Science} University of Saskatchewan 176 Thorvaldson Bldg., 110 Science Place Saskatoon, Saskatchewan, Canada S7N 5C9

Email: wayne.clarke@usask.ca †_{Agriculture and Agri-Food Canada}

107 Science Place

Saskatoon, Saskatchewan, Canada S7N 0X2

‡_{NRC Plant Biotechnology Institute (NRC-PBI) National Research Council of Canada} 110 Gymnasium Place

Saskatoon, Saskatchewan, Canada S7N 0W9

Abstract—

Summary:: Robust and automated Polymerase Chain Reaction (PCR) primer design is an important pre-requisite to many strategies of large scale discovery of nucleotide variation, specifically Single Nucleotide Polymorphisms (SNPs). In many cases the design of PCR primers that amplify multiple members of gene families in complex genomes is complicated by the desire to design primers that amplify non-coding regions of the target organism’s genome. This is especially complicated in organisms that do not have a fully sequenced genome, requiring further time intensive procedures. Thus, this phase of SNP discovery is often a bottle-neck for the overall process.

In order to increase the efficiency of designing conserved intron-spanning gene family specific primers, an automated pipeline that streamlines the process by reducing the depen-dency on human participation was developed. The automated design process is proven to significantly reduce primer design time and human participation in comparison to the semi-automated approach employed previously. The increase in performance comes with a modest reduction in overall PCR efficiency but does not significantly reduce the total number of amplified PCR products. The pipeline was tested extensively using the target organism Brassica napus with the reference organism Arabidopsis thaliana, with an overall amplification success of 80.5% of the reference inputs.

Availability:: By request.

Keywords-Single Nucleotide Polymorphism, SNP, Primer, Pipeline;

I. INTRODUCTION

The detection of Single Nucleotide Polymorphisms (SNPs) is an important tool in human, animal, and plant genetics since this natural variation can be utilized for the de-velopment of genetic markers that can identify genes causing susceptibility to complex diseases or other traits of interest [1]. A common technique used for SNP discovery is the sequencing of genomic DNA extracted from several diverse genotypes. Polymerase Chain Reaction (PCR) primers are

designed to amplify segments of DNA that can be readily sequenced; these segments are often determined using genes of interest or Expressed Sequence Tags (ESTs) [1]. PCR is performed to amplify multiple members of a gene family (a group of genes that are closely related, descended from a common ancestral gene, and encoding similar products) and the products are sequenced (Fig.1). The sequenced products are then analyzed for variations.

PCR primers are often designed to span non-coding regions of genes (introns) as these regions typically contain greater variation [1]. For known intronic regions, primers can often be designed using software such as Primer3 [2]. How-ever, in many species (including the major crop species), the location of intronic regions within the genome are unknown as they do not have fully sequenced genomes. Fortunately, by comparing available expressed gene sequence collections (ESTs) from the target organism to the reference sequence of a fully sequenced, closely related model organism, the positions of the conserved exonic coding regions can be determined and the intronic regions inferred. There are many reference organisms available that cover a broad range of target species; rice is a common model organism for grasses (including wheat) [3], Medicago truncatula is a common model for legumes [4], Arabidopsis thaliana is a common model for members of the mustard (Brassicaceae) family, the mouse and rat genomes are a common model for mammalian organisms, and there are many other model or-ganisms (www.nih.gov/science/models/). Depending on the application it may be desirable to design primers for all or part of the target sequence.

A common semi-automated technique for identifying con-served intron-spanning primer pairs requires several steps (Fig.2). First, one can use BLAST [5] to determine a set of EST sequences that are conserved with respect to a

(3)

Intron

Exon

Intron

Exon

P

1

P

2

UTR

N2

N12

N3

N13

N10

N19

Figure 1. Amplification of multiple gene families using a single primer pair.

reference from a model organism. Next, the EST hits can be assembled using the model sequence as a reference and utilizing a high level of similarity to further screen for conserved sequences. This produces an alignment of all the ESTs against the model reference sequence. However, alignments often require iterative improvement to produce an alignment with appropriate intron/exon boundaries. Once an alignment has been determined with appropriate intron gaps, an intron that spans the optimum number of bases must be found. If no such intron can be found the next most optimal intron must be identified. Once an intron has been identified for primer design, the alignment is exported and primers picked. Primers are generally chosen that meet certain requirements for length, melting temperature, and complementarity, which often requires several iterations of the picking process.

Previous attempts to automate primer design have fo-cused on automating the design of individual primer pairs as discussed by [6]. However, there are several programs available for designing PCR primers for SNP discovery [6], [7], [8], [9], [10]. Each of these programs are not appropriate due to one of the following factors: the inability to infer the intron/exon structure [6], [7], the limited number of organisms that it can be used with [8], [10], or its availability [9]. Therefore, currently the only option for researchers is to perform a semi-automated primer design using a variety of bioinformatics tools to first infer the intron/exon structure

Reference Ids Target EST database Designed Primers BLAST Fasta of EST hits Assembly Alignment Manual creation of Fasta Manual Blast report processing Manual reﬁnement of assembly Manual reﬁnement of alignment Manual input formatting Manual parsing of output Finished alignment Primer3

Figure 2. Flow chart describing the semi-automated method for primer design.

and then design the primers. This often requires manual refinement or parsing of the output of each phase of the intron/exon structure determination and primer design.

II. APPROACH

The approach described here was to replicate the method-ology of the semi-automated process performed by re-searchers in a fully automated non-interactive manner (Fig. 3). This is accomplished by using computational steps that are similar to each stage in the semi-automated method. For simplicity the pipeline requires only four input parameters: the path to a directory containing a list of reference sequence identifiers, the name of the EST BLAST database for the target organism, a cdbfasta formatted index of the EST fasta file, and a cdbfasta formatted index of the reference sequences. To maintain flexibility the pipeline only requires that the reference sequence contain intronic and exonic sequence, which includes but is not limited to a sequenced Bacterial Artificial Chromosome (BAC) or a gene model.

Output from each program is parsed and formatted for input into the next program using custom scripting in order to reduce the need for human participation. The pipeline uses the open-source bioinformatics tools BLAST, cdbfasta and cdbyank [11], CAP3 [12], Kalign [13], and Primer3. Cdbfasta is a program designed to create an index of a FASTA file in order to provide fast access to individual

(4)

Reference Ids Target EST database Designed Primers BLAST Fasta of EST hits CAP3 Kalign Automated creation of Fasta Automated Blast report processing Automated assembly Automated alignment Automated input formatting Automated parsing of output Finished alignment Primer3

Figure 3. Flow chart of the automated primer design process showing the transition from manual intervention (Fig.2) to fully automated.

sequences and cdbyank is a program designed to utilize the index file created by cdbfasta to extract a sequence record from the FASTA file. These programs are used in the pipeline to extract the desired reference sequences. The ref-erence sequence is blasted against the EST sequences from the organism of interest, and the blast reports are parsed to determine overlapping high-scoring pairs (HSPs). The HSPs represent the conserved sequence fragments between the EST sequences and the reference sequence. The overlapping HSP fragments are assembled into a conserved region using CAP3. These conserved regions (exons) are then aligned to the reference sequence using Kalign (Fig.4. From this alignment a consensus sequence can be produced with exons separated by intronic regions. Each nucleotide of the intron sequence is translated to the ambiguity character (N). This allows Primer3 to design primers that span intronic regions. The consensus sequence is formatted into the Primer3 input format and Primer3 is run on each input file. The resulting output files are parsed to extract forward and reverse primer sequences, their melting temperatures, and the predicted size of the amplified sequence.

This approach can be used to find single primer pairs per contig, by iterating over the design primers and choosing a pair that most closely meets a set of criteria such at melting temperature or product size, or can be used to design primers

to span all or part of the target sequence by looking for a set of primers that cover the target sequence (Fig.5).

III. EVALUATION

The hardware used for evaluation of the pipeline is a HP Proliant DL-385 server running the CentOS 4.5 linux distri-bution. This machine houses two AMD 2.2 Ghz processors and 16 Gb RAM.

A. Single Primer Pair

In order to accurately compare the two techniques, both the semi-automated and fully automated design techniques were performed. The target/reference organism pair Brassica napus/Arabidopsis thaliana, which share a level of sequence similarity of 87% in exons [14], was used to evaluate the two approaches. Time requirements to design 100 primer pairs using the semi-automated approach were compared to the time required to design 100 primer pairs using the fully automated pipeline. In our tests the automated pipeline designed 100 primers in less than 10 minutes (an average of 9 minutes and 16 seconds over 5 trials). This is a significant improvement over the semi-automated approach, which required approximately 1 hour to produce 2 primer pairs or 50 hours per 100 primer pairs (C. Sidebottom, personal communication).

To evaluate differences in the performance of the semi-automated and semi-automated approaches, primers were de-signed using both methods and the results compared. For the semi-automated approach, 133 primer pairs were de-signed and PCR carried out, resulting in 88% successful amplification. For the automated approach, 256 Arabidopsis gene models were selected for input to the pipeline. Of the 256 selected gene models, 99% successfully had primers designed for them. Further, PCR was performed on the resulting primer pairs, with an average amplification rate of 81.5%. Thus an overall of 80.5% of all gene model inputs result in successful amplification.

B. Multiple Primer Pairs

Multiple primer pairs were designed to span 22 target sequences. There were 90 primer pairs chosen in a size range of 200-1100 base pairs in order to maximize the coverage of the target sequences. Of these 90 primer pairs there were 83 successful PCR amplifications resulting in 92% efficacy. Sizes of the amplified products were checked and the majority were within the size range expected. Variations in the sizes of the products versus the expected sizes are explained by variation in the intron sizes between the model organism and the target organism.

IV. CONCLUSIONS

By creating a pipeline that fully automates the process of designing family specific PCR primers, a significant decrease in the design time for primers can be achieved.

(5)

Intron Exon Intron Exon Exon Align Target ESTs to Reference Model

(a)

(b)

Intron Exon Intron Exon Exon

(c)

Assemble High Scoring Pairs into Contigs Align Contigs to Reference UTR UTR UTR UTR

Figure 4. The process of determining intron/exon structure by first aligning EST sequences to gene model (a), then assembling the fragments that overlap (b), and placing those contigs back against the model (c). Once the contigs have been placed back against the model, the distance between them can be determined and a consensus sequence derived by filling the intronic space with N characters.

This increase in time efficiency comes at a small cost in the overall success of PCR amplifications. For many researchers the improvement in design time will outweigh the small reduction in overall PCR reaction success. Further, this pipeline can be utilized for any organism that has a closely related reference species available. The pipeline also lends flexibility to the use of the reference sequences as it can utilize a variety of reference types including finished and unfinished BACs, pseudochromosomes, or gene models. Based on this flexibility and the ability to infer the positions of intronic regions using a reference organism, saskPrimerFS has broad applicability for SNP discovery, SNP marker development, genetic mapping and marker assisted selection in a wide range of important organisms.

ACKNOWLEDGEMENTS

We would like to thank Matthew Links of Agriculture and Agri-Food Canada for valuable feedback during the design process. Funding was provided by Matching Industry Initiative (MII) funds provided by Agriculture and Agri-Food Canada.

REFERENCES

[1] Rafalski A: Applications of single nucleotide polymor-phisms in crop genetics. Current Opinion in Plant Biology 2002, 5:94–100.

[2] Rozen S, Skaletsky H: Primer3 on the WWW for general users and biologist programmers. Methods Mol. Biol. 2000, 132:365–386.

[3] Lazo G, Chao S, Hummel D, Edwards H, Crossman C, Lui N, Matthews D, Carollo V, Hane D, You F, Butler G, Miller R, et al: Development of an expressed sequence tag (EST) resource for wheat (Triticum aestivum L.):EST generation, unigene analysis, probe selection, and bioinformatics for a 16,000-locus bin-delineated map. Genetics 2004, 168:585– 593.

[4] Town C: Annotating the genome of Medicago truncatula.

Curr Opin Plant Biol2006, 9:122–127.

[5] Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller N, Lipman D: Gapped BLAST, PSI-BLAST: a new generation of protein database search programs. Nucl. Ac.

(6)

(a)

UTR Exon Intron Exon Intron Exon UTR

P

1

P

1

P

2

P

2

P

3

P

3

(b)

UTR Exon Intron Exon Intron Exon UTR

P

1

P

1

P

2

P

2

Figure 5. Primers can be designed to generate a single amplified fragment per contig (a), usually the most optimal of a set of primers (P1) is chosen.

Primers can also be designed to amplify across the gene space (b).

[6] Weckx S, Rijk PD, Broeckhoven CV, Del-Favero J: SNPbox: a modular software package for large-scale primer design.

Bioinformatics2005, 21(3):385–387.

[7] Fredslund J, Schauser L, Madsen LH, Sandal N, Stougaard J: PriFi: using a multiple alignment of related sequences to find primers for amplification of homologs. Nucl. Ac.

Res.2005, 33:W516–520.

[8] Fredslund J, Madsen L, Hougaard B, Nielsen A, Bertioli D, Sandal N, Stougaard J, Schauser L: A general pipeline for the development of anchor markers for comparative genomics in plants. BMC Bioinformatics 2006, 7:207–216. [9] You F, Lazo G, Gu Y, Renfro J, D EM, Akhunov E, McGuire

P, Dvorak J, Anderson O: Conserved Intron-Spanning Primer Design and Application In Wheat SNP Discovery.

Plant & Animal Genome Conference2006.

[10] Hu Z, Glenn K, Ramos A, Otieno C, Reecy J, Rothschild M: Expeditor: A Pipeline for Designing Primers Using Human Gene Structure and Livestock Animal EST In-formation. J. of Heredity 2005, 96:80–82.

[11] Pertea G, Huang X, Lian F, Antonescu V, Sultana R, Karamy-cheva S, Lee Y, White J, Cheung F, Parvizi B, Tsai J, Quack-enbush J: TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets.

Bioinformatics2003, 19:651–652.

[12] Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9:868–877.

[13] Lassmann T, Sonnhammer EL: Kalign–an accurate and fast mulitple sequence alignment algorithm. BMC

Bioinformat-ics2005, 6:298–236.

[14] Cavell A, Lydiate D, Parkin I, Dean C, Trick M: Collinearity between a 30-centimorgan segment of Arabidopsis thaliana chromosome 4 and duplicated regions within the Brassica napus genome. Genome 1998, 41:62–69.