• Aucun résultat trouvé

Comparative genomic workflow

N/A
N/A
Protected

Academic year: 2021

Partager "Comparative genomic workflow"

Copied!
7
0
0

Texte intégral

(1)

Comparative genomic workflow

The MIT Faculty has made this article openly available.

Please share

how this access benefits you. Your story matters.

Citation

Rajapakse, J. et al. “Comparative genomic workflow.” Engineering

in Medicine and Biology Magazine, IEEE 28.4 (2009): 19-24. ©2009

IEEE.

As Published

http://dx.doi.org/10.1109/MEMB.2009.932910

Publisher

Institute of Electrical and Electronics Engineers

Version

Final published version

Citable link

http://hdl.handle.net/1721.1/60264

Terms of Use

Article is made available in accordance with the publisher's

policy and may be subject to US copyright law. Please refer to the

publisher's site for terms of use.

(2)

© PHOTODISC

Comparative

Genomic Workflow

W

e present a comparative genomic workflow to find conserved regions in noncoding parts of genomic sequences. The workflow is designed for cross-species comparisons of all available verte-brate genome sequences, which involves high-throughput sequence alignment and high-performance computational analysis. Therefore, an implementation on a computational grid infrastructure for efficient mapping of comparison tasks is proposed. Results of comparison of mouse and human genomes are illustrated. A relational database will be created to store conserved noncoding regions among all sequenced vertebrate genomes, which enables scientists to determine the functional significances of conserved segments.

The premise of comparative genomics is that a DNA seg-ment conserved across species over the evolution is likely to have some important biological function [1]. Coding DNA sequences conserved from the last common ancestor should be preserved in contemporary genome sequences [2]. Compar-ative genomics involves comparison of whole genomes across species to identify common segments of DNA or protein sequences and building evolutionary profiles. Comparative genomics applications of coding sequences have been able to detect most known genes [2], [3]. Our short-term goal in this project is to detect regions of noncoding DNA conserved across vertebrate genomes and the long-term aim is to under-stand potential regulatory functions of the conserved segments in noncoding DNA. This is an essential but challenging prob-lem in future bioinformatics [4], because noncoding regions, which account for about 95% of the genomes, are widely believed not to be junk DNA as perceived before but rather to be important regions of genomes remaining to be investigated for their function in the postgenomic area [5].

We have taken a comparative genomic approach, as the comparison of genomes of interspecies has facilitated the prediction and verification of noncoding RNA genes [6], [7]. For instance, comparative genomics of the human 5q31 region and murine chromosome 11 has identified a coordinate regula-tor of interleukins IL-4, IL-13, and IL-5 [8]. The techniques for the discovery and characterization of conserved regions in non-coding genomic sequences could have numerous applications

in future as more and more genomic sequences become avail-able. In this project, vertebrate sequences were chosen because such conserved segments are likely to have regulatory functions of biological processes of human.

This article describes a workflow for identifying con-served patterns in noncoding regions of vertebrate genomes, with an intention of investigating possible functions of the conserved regions. The annotations of genomes are collected from the Ensembl database [9]. The sequences are then arranged to use for sequence alignment with basic local alignment search tool Z (BLASTZ) [10], which finds gap-free alignments of at least q% identity and l b/s in length. All the conserved noncoding regions identified are stored in a relational database. An user-friendly Web interface provides easy access to conserved regions and related information and visualization capabilities.

To account for the compute- and data-intensive nature of the processes [11], the workflow was implemented on a computing grid architecture that offers higher computing power at a reasonable cost from already-available resour-ces. The implementation has to consider the heterogeneous nature of computing and data resources on the grid and the effective ways of running applications in parallel fashion [12]. We propose to use the master-slave paradigm for com-munication and dynamic load balancing for scheduling par-allel tasks.

Besides commonalities with available noncoding databases [13], the present database of conserved noncoding regions provides 1) conserved noncoding regions in untranslated regions (UTRs: those regions which are transcribed to RNA but not translated), 2) additional information such as the genomes and locations of the conserved noncoding regions, and 3) cross-species comparisons among all available verte-brate genomes, such as human, mouse, chimp, chicken, zebra-fish, tetraodon, and fugu.

This manuscript is organized as follows. The ‘‘Materials and Methods’’ section introduces the workflow for identifying conserved noncoding regions and provides a description of mapping tasks onto a computational grid. The ‘‘Experiments and Results’’ section demonstrates the efficacy of the grid-enabled workflow. Finally, the ‘‘Conclusions’’ section ends with an outlook of future work.

BY JAGATH C. RAJAPAKSE, POOJA, CHUNXI CHEN, AND SY-LOI HO

PATTERN

DISCOVERY

IN

BIOINFORMATICS

Discovery of Conserved Noncoding DNA Patterns

(3)

Materials and Methods

Steps in Identifying Conserved Noncoding Regions

1) Annotations of vertebrate sequences: The Ensembl project [9] provides a comprehensive and integrated source of annotations of large genome sequences. It employs MySQL relational databases to store sequence informa-tion and provides a comprehensive set of applicainforma-tion pro-gram interfaces (APIs) to serve as a middle layer between the database and application programs. The vertebrate genomic sequences obtained from the Ensembl database are stored, and the necessary (BioPerl) APIs are installed locally.

2) Extraction of noncoding sequences: Noncoding DNA segments include introns, UTRs, pseudogenes, and inter-genic regions, which code for neither proteins nor stable RNA [7]. We merge all noncoding regions into a single sequence and supplement with the information of the cor-responding organism, chromosome number, noncoding region type (i.e., 5-UTR, introns, 3-UTR, or intergenic regions), and starting and ending locations.

3) Sequence alignment: We divide or concatenate noncoding sequences to form segments of length L Mb/s, allowing for

gaps and overlaps, say D b/s, between adjoining segments. The overlapping lengths should be chosen sufficient enough to extract conserved regions given a maximum length d b/s. BLASTZ, an extension to BLAST [14] pair-wise local alignment tool, was used as an alignment tool [10], [11]. BLASTZ is found to be suitable for aligning long noncoding sequences that have evolved neutrally [15]. Based on a three-step strategy [14], [16], BLASTZ first removes lineage-specific interspersed repeats from each sequence and searches for short near-perfect matches between the two sequences [11]. Each match is then extended gap-free, using dynamic programming, until the alignment score remains above a specified threshold. 4) Detecting conserved noncoding regions: A criteria to

identify conserved regions from BLASTZ alignment needs to be defined: for example, gap-free alignments of at least 70% identity and at least 100 b/s in length have been widely used for identifying the conserved regions [3]–[6], [8], [13], [17]. Since the alignments are usually weaker in the noncoding regions, we additionally searched for all conserved regions complying with gap-free alignments of at least 50% and at least 20 b/s in length. This provides us with four versions of conserved noncoding regions in our database: q¼ 50% or 70%, and d ¼ 20 or 100 b/s. 5) Database of conserved noncoding regions: The conserved

noncoding regions are imported into a custom-designed database along with the information about their genomes and locations. The database can be accessed through a user-friendly Web interface, which allows users not only to the data but also to compare any two chromosomes of two vertebrates and find the conserved regions by setting the own alignment and length parameters.

Grid-Enabled Pipeline

Sequence alignment by using BLASTZ accounts for most of the running time of the workflow. We use a computational grid to speed up the alignment process, which allows one to utilize the distributed computing and memory resources at a reasona-ble cost [18], [19]. In a computational grid, heterogeneous computing servers and data are distributed at distinct locations connected through Intranet/Internet. A major challenge in grid computing is how to efficiently find available resources and data and distribute the tasks among the processing nodes. The development of applications in a grid-computing environment needs to handle the following characteristics of resources [12]. ä Computing servers having different computing powers: For example, computing machines on the grid may have processors having different speed and architecture, mem-ory, storage, and software platform.

ä Sharing of resources: The grid should allow for several users’ tasks running at the same time; the effective usage

CNS Databases End Update CNS Databases Analyze Alignment Results Perform BLASTZ Tasks Get Sequences from Local Hard

Disks Get a Task Description and Delete It from the

Database Remaining Tasks? N Y N Y

Initialize Processes Using GLOBUS-JOB-RUN Processi Data Ready? Ensembl Databases Grid Task Database Extract Noncoding Sequences Tailor Sequences to Feed to BLASTZ

Fig. 1. The workflow of the grid-enabled pipeline using dy-namic load balancing.

C

oding DNA sequences conserved

from the last common ancestor should be

preserved in contemporary

genome sequences.

IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE JULY/AUGUST 2009

(4)

of CPUs in an application depends on the number of jobs running at a computing node at one time.

ä The different network infrastructures that link grid resources: They may have different protocol and perform-ance standards.

To parallelize an application efficiently on a computational grid, the implementation should minimize data transfer on low-speed connections and allocate the tasks to a node that can offer the required computational power to the application at that time. This assures that no processor or connection would become a bottleneck to an application on the grid.

We illustrate a grid-enabled implementation of comparative genomics pipeline in Figure 1 with Grid-BLASTZ [18]. The par-allelization of the sequence alignment in BLASTZ is embedded in the workflow such that extracted noncoding sequences can be processed simultaneously at different computing nodes. The lengths of the input segments determine the granularity of sub-tasks, and the length of overlaps between the adjacent segments determines the shortest length of target-conserved segments.

Previously, a static load balancing approach was presented in [19] for mapping BLAST applications [16] onto grid archi-tectures. However, it achieves reasonable performance only when the execution of a task is uninterrupted by other applica-tions running at that node. Therefore, we have chosen a dynamic load balancing approach for the grid implementation so that grid-based applications are able to scale well even when there are other applications running over the resources in the grid. With dynamic load balancing, jobs are allocated dynami-cally to idle nodes to keep all nodes busy and to minimize the overhead of the load balancer. Dynamic load balancing has been increasingly used in several bioinformatics applications, which works similar to a server–client model [20]. For exam-ple, fastDNAml [21] and a profile hidden Markov model soft-ware (HMMER) [22] have reported an increased efficiency by using a dynamic load-balancing strategy.

As shown in Figure 1, the processes are created by invoking the tasks [23]. The grid-task database serves as the master process that stores the descriptions of all BLASTZ processes for comparative tasks. A typical task would be to compare a specific segment in mouse genome with a specified segment in a human genome. The slaves, such as computing nodes, keep retrieving exclusively task descriptions from the master once they are free and then execute the corresponding BLASTZ tasks until all subtasks are completed.

Experiments and Results Testbed

A testbed comprising of three Linux clusters and a database server, at two research centers at the Nanyang Technological

University (NTU), Singapore, was used to test the perform-ance of the workflow. Two clusters are located at the BioInfor-matics Research Center (BIRC), one with eight nodes of Intel Itanium-1 733-MHz processors with 1 GB internal memory and the other with four nodes of advanced microdevices (AMDs) Opteron 2.60-GHz processor with 8 GB internal memory. The third cluster is situated at the parallel and dis-tributed computing center (PDCC), consisting of eight nodes of Intel Xeon 2.60-GHz processor with 1 GB internal memory. The clusters are internally connected with a Myrinet. An Ethernet switch was used as an intercluster connection. The grid middleware on the clusters was GLOBUS 4.0.5 [23]. The sun grid engine (SGE) [24] was installed as a distributed resource management (DRM) software, which allocates grid tasks from the control node to execution nodes inside a cluster. The grid-task database and conserved noncoding database were stored on a Sun Fire X4200 server at BIRC.

Conserved Noncoding Sequences

Here we present the results of comparison of noncoding sequen-ces extracted from human chromosome 21 and mouse chromo-some 16 noncoding sequences. The human chromochromo-some 21 was split into 12 noncoding segments of length L¼ 5 Mb/s with an overlap D¼ 100 b/s allowed between adjacent segments. Longer segment length may slightly reduce the number of over-lapping nucleotides but increases the computational demands on the grid and the turnaround time due to data transmission. This would decrease the grid throughput and likely make the grid less sensitive to disturbances. As we are interested in extracting conserved segments in noncoding regions of at least 20 b/s, these settings are adequate to guarantee that no candi-date-conserved segment is missed. The mouse chromosome 16 was split into 21 noncoding segments of the same configura-tion. The UTRs were also considered as noncoding sequences and extracted as candidate sequences for aligning. Previously, the Penn State, Berkeley, and the University of California, Santa Cruz (UCSC)-conserved noncoding sequence projects [13] used the Human Dec 2001 and MGSCv3 flat files as the source sequences. The complete UTRs are not all represented in these flat files. Moreover, repetitive sequences in these flat files are represented with lowercase characters and would be masked by alignment tools such as BLASTZ. These numbers of noncoding segments made a total of 252 subtasks for the grid. In BLASTZ, the threshold for maximal scoring segment pair (MSP) represents the sensitivity factor. For restricting gaps in alignments to be at least 100 b/s with the identity as low as 50%, where longer gaps are not harshly penalized, we set the MSP threshold at 3,400 [10].

The conserved noncoding segments in the database are associated with adequate supplementary information. Table 1

Table 1. Examples of conserved noncoding regions detected by comparative genomics of human and mouse.

Human Mouse

Region Chromosome Type Region Chromosome Type

14403123–14403392 21 5-UTR 75423398–75423667 16 5-UTR

16024234–16024284 21 5-UTR 76896397–76896447 16 Intron

13366269–13366445 21 Intron 53056429–53056605 16 3-UTR

15068493–15068565 21 Intron 19636875–19636947 16 Intergenic

(5)

shows the content of example records in the database, result-ing from the comparison of human and mouse genomes. The region column represents the actual locations of the conserved regions on the chromosome indicated by the next column. The type column indicates which functional region the corre-sponding conserved region belongs to.

Table 2 shows the statistics of conserved regions found by the comparative genomic workflow by using two sets of parameters: 1) at least 50% identity and 20 b/s length (q¼ 50%, l¼ 20 b/s) and 2) at least 70% identity and 100 b/s length (q¼ 70%, l ¼ 100 b/s). The number of conserved noncoding regions obtained was 22,791 and 3,209, respectively. With a stricter criterion, the number of conserved regions dropped drastically. As seen, most conserved noncoding regions were from the human introns of chromosome 21 and mouse introns of chromosome 16 and characterized by a low guanine cyto-sine (GC) content of about 36–37%. A very small number of conserved segments were found in regions of the human 5-UTR and intergenic regions, despite its large size, compared with other mouse noncoding regions.

Compared to the existing conserved noncoding databases in the Penn State database [13], the segments identified by our pipe-line for human and mouse genomes were quite different even

when the same selective criterion and alignment tool (BLASTZ) were used. There are two reasons to this: our pipeline included UTRs as candidate sequences for alignment; and we used the lat-est versions of sequences and annotation information from the Ensembl (version 36, 2006). The Penn State project used older Human Dec 2001 and MGSCv3 flat files. Further, the annota-tions used are different: for instance, the length of human chro-mosome 21 in the Ensembl (version 36) is 46,944,323, whereas the length in Human Dec 2001 is 44,620,379.

Performance

We investigated the speedups of the parallel implementation of the comparative genomic pipeline on the grid. Figure 2 shows speedups of our grid-enabled pipeline with and without distur-bances. The speedup of a cluster is measured by the sum of the running time of each processor in a corresponding fraction of tasks, divided by the running time when multiple processes are equally distributed in the cluster [18]. A disturbance is simu-lated by running an application in one of the 2.60-GHz CPU. It can be seen that the grid-enabled pipeline achieved significant speed improvements even under system disturbances.

As the computing resources on the grid are shared by all users, the execution of a job is usually disturbed by another

Table 2. Statistics of noncoding segments conserved in human and mouse genomes with at least l b/s long and q% identity.

Category

l= 20, q = 50 l= 100, q = 70

No. of Regions Mean Length GC% No. of Regions Mean Length GC%

Human Mouse 5-UTR 5-UTR 101 132 58 31 236 49 3-UTR Intron 24 100 53 6 198 55 Intergenic 3-UTR 5-UTR 6 412 38 3 744 39 3-UTR 701 90 43 71 154 45 Intron 168 85 38 21 178 37 Intergenic Intron 5-UTR 7 109 45 2 229 45 3-UTR 68 80 38 7 121 39 Intron 21,604 88 36 3,050 151 37 Intergenic 51 96 43 9 191 40 Intergenic 5-UTR 1 59 58 3-UTR 6 72 53 1 113 68 Intron 1 97 52 Intergenic 3 154 46 2 198 50 Total 22,741 88 37 3,209 153 37

A

major challenge in grid computing

is how to efficiently find available resources

and data and distribute the tasks among

the processing nodes.

IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE JULY/AUGUST 2009

(6)

application running at a node. A good implementation should be robust to such perturbations. We measure the extent of the effect by a disturbance on the performance of an application with the performance degradation ratio under disturbance (PDRD) [22]. Smaller PDRD values indicate a better robustness of the application to disturbances. Figure 3 shows the PDRD of our grid-enabled pipeline in comparison to the grid-enabled BLAST pipeline employing the static load balancing for job allocation. As seen, the use of dynamic load balancing has made the pipeline more robust to the disturbances on the grid.

Conclusions

Comparative genomics compares DNA sequences of different species at genome scale to discover preserved patterns across the species. The conserved patterns have evolved naturally and therefore should relate to an important biological function. Comparative genomics is useful in detecting novel genes or sequence segments that are likely to have regulatory functions. Here we present our preliminary work on building a comprehen-sive database of conserved noncoding segments across verte-brates. Such a database may contain yet undiscovered regulatory segments that researchers can look for regulatory function.

When large genomes like vertebrates are involved, compar-ative genomics applications become compute-intensive and therefore the need for high-computing resources and platforms arises. To increase the efficiency and speed of comparative genomics applications, we presented an implementation of Grid-BLASTZ and grid-enabled comparative genomic

workflow. Our aim is to identify conserved noncoding regions in vertebrate genomes, using already-available computational resources spanning across different sites, which may be dis-tantly located. As seen, the proposed grid implementation considerably speeded up the comparative workflow. Our implementation was robust to the disturbances by other appli-cations in the grid and highly efficient in an environment com-posed of diverse resources with heterogeneous characteristics. We created a relational database to store conserved seg-ments, including additional information such as location refer-ence and prior knowledge about conserved noncoding regions. Our workflow is incremental in the sense that the database can be updated when new vertebrate genomic sequences become available. Presently, the database contains conserved segments obtained from mouse and human genomes, but in future, other vertebrates such as chimp, chicken, zebrafish, tetraodon, and fugu will be added. In addition to the conserved segments gen-erated, we will allow researchers to compare chromosomes across genomes at different similarity and length parameters.

Our future work will involve a designing methodology for characterizing the conserved noncoding regions, which could be performed by genome-wide statistical and computational analy-sis. It involves finding how statistically significant these potential regulatory elements present within and across genomes, assessing the frequency of all elements in the databases across genomes, finding the similarity among elements, and creating new potential elements by merging the elements that are similar. These findings can be verified by using previously studied elements and regions

13 15 11 9 7 5 3 1 Number of Processors 15 13 11 9 7 5 3 1 Speedup

Speedup of the Grid-Enabled Pipeline Without Disturbance Speedup of the Grid-Enabled Pipeline with Disturbance

Fig. 2. Speedups of the grid-enabled comparative genomic pipeline when comparing human chromosome 21 and mouse chromosome 16 under a disturbance on the grid.

100 90 80 70 60 50 40 30 20 10 0 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of Processors PDRD (%)

Grid-Enabled Pipeline with Dynamic Load Balancing Grid-Enabled Pipeline with Static Load Balancing Grid-Enabled Pipeline with Dynamic Load Balancing Grid-Enabled Pipeline with Static Load Balancing

Fig. 3. Demonstration of performance by PDRD of the grid-enabled BLAST under a disturbance on the grid.

BLASTZ

is found to be suitable

for aligning long noncoding sequences

(7)

to assign the biological functions to conserved noncoding sequen-ces. If closely related elements have not been previously found, the help of wet-lab experiments is needed.

Further, comparative genomics allows us to build phyloge-netic profiles and evolutionary rates of sequences. The segments with functional importance evolve at comparatively lower rate. Also, the conserved segments could be analyzed against the fea-tures of protein-coding sequences to investigate their functional-ity. For example, it has recently been found that codon-usage bias is different in different genes [25]. Our finding may be com-pared across the existing databases of regulatory elements for possible role of their regulatory functions. However, testing for possible regulatory functions of conserved noncoding sequences in wet-lab experiments will eventually be necessary before mak-ing strong conclusions about their functions.

Acknowledgments

The work is partly supported by grants to J.C. Rajapakse, by the Biomedical Research Council (grant no. 04/1/22/19/376) of Agency of Science and Technology Research, adminis-tered through the National Grid Office, Singapore, and by Sun Microsystems, Inc., USA.

Jagath C. Rajapakse received his M.Sc. and Ph.D. degrees in electrical and com-puter engineering from the University of Buffalo (USA). He was a visiting fellow at the National Institute of Mental Health (USA) and a visiting scientist at the Max Planck Institute of Brain and Cognitive Sci-ences (Germany). Currently, he is a profes-sor of computer engineering and the director of BioInformatics Research Center, Nanyang Technological University (NTU), Singapore. He is a visiting professor to the Department of Bio-logical Engineering, Massachusetts Institute of Technology (MIT). His research interests include neuroinformatics, bioin-formatics, modeling brain connectivity through functional brain imaging, and building pathways from gene and protein expressions obtained by microarrays and high-content micro-scopic imaging. He has authored or coauthored more than 225 peer-reviewed journal and conference papers and book chap-ters and was listed among the most cited scientists. He serves as an associate editor of IEEE Transactions on Medical Imag-ing, IEEE Transactions on Computational Biology and Bioin-formatics, and IEEE Transactions on Neural Networks and in editorial boards of several other journals.

Pooja received her B.Sc. degree in computer science from Birla Institute of Technology, India, and her M.Sc. (bioin-formatics) degree from NTU, Singapore. She worked as a research associate in BIRC at NTU. She is now working as a lead software developer at ITN, London. Chunxi Chen received his Ph.D. degree in computer science from NTU, Singapore. His research interests include parallel and distributed computing and sequence align-ment. He was with the BIRC, School of Computer Engineering, NTU.

Sy-Loi Ho received his B.Sc. degree in computer science from the Vietnam National University, Hanoi, in 2000 and Ph.D. degree in computer engineering from NTU, Singa-pore, in 2005. Since January 2005, he has been working in the BIRC, NTU, where he is currently a research fellow. He has coau-thored more than ten research papers in the area of bioinformatics, including computational techniques, to detect various signals and motifs in biological sequences and comparative genomics.

Address for Correspondence: Jagath C. Rajapakse, Bioin-formatics Research Center, Nanyang Technological Univer-sity, Singapore 639798. E-mail: asjagath@ntu.edu.sg.

References

[1] R. C. Hardison, ‘‘Comparative genomics,’’ PLoS Biol., vol. 1, p. e58, 2003. [2] L. A. Pennacchio and E. M. Rubin, ‘‘Genomic strategies to identify mamma-lian regulatory sequences,’’ Nat. Rev. Genet., vol. 2, no. 2, pp. 100–109, 2001. [3] L. A. Pennacchio and E. M. Rubin, ‘‘Comparative sequence tools and data-bases providing insights into the human genome,’’ J. Clin. Invest., vol. 111, no. 8, pp. 1099–1106, 2003.

[4] S. Levy, S. Hannenhalli, and C. Workman, ‘‘Enrichment of regulatory signals in conserved non-coding genomic sequence,’’ Bioinformatics, vol. 17, no. 10, pp. 871–877, 2001.

[5] E. T. Dermitzakis, A. Reymond, R. Lyle, N. Scamuffa, C. Ucla, and S. Deutsch, ‘‘Numerous potentially functional but non-genic conserved sequences on human chromosome 21,’’ Nature, vol. 420, no. 6915, pp. 578–582, 2002. [6] H. Guo and S. P. Moose, ‘‘Conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution,’’ Plant Cell, vol. 15, no. 5, pp. 1143–1158, 2003.

[7] R. C. Hardison, ‘‘Conserved non-coding sequences are reliable guides to regu-latory elements,’’ Trends Genet., vol. 16, no. 9, pp. 369–372, 2000.

[8] G. G. Loots, R. M. Locksley, C. M. Blankespoor, Z. E. Wang, W. Miller, E. M. Rubin, and K. A. Frazer, ‘‘Identification of a coordinate regulator of interleu-kins 4, 13, and 5 by cross-species sequence comparisons,’’ Science, vol. 288, no. 5463, pp. 136–140, 2002.

[9] Ensembl project [Online]. Available: http://www.ensembl.org/index.html [10] BLASTZ alignment program [Online]. Available: http://www.bx.psu.edu/ miller_lab/

[11] S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller, ‘‘Human-mouse alignments with BLASTZ,’’ Genome Res., vol. 13, no. 1, pp. 103–107, 2003.

[12] I. Foster and C. Kesselman, Eds., The Grid: The Blueprint for New Comput-ing Infrastructure, 2nd ed. San Mateo, CA: Morgan Kaufmann, 2003.

[13] Conserved non-coding databases [Online]. Available: http://pipeline.lbl.gov/ cgi-bin/cnc

[14] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, ‘‘Basic local alignment search tool,’’ J. Mol. Biol., vol. 215, no. 3, pp. 403–410, 1990. [15] D. A. Pollard, C. M. Bergman, J. Stoye, S. E. Celniker, and M. B. Eisen, ‘‘Benchmarking tools for the alignment of functional noncoding DNA,’’ BMC Bioinformatics, vol. 5, no. 6, pp. 6–22, 2004.

[16] S. F. Altschul, T. L. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, ‘‘Gapped BLAST and PSI-BLAST—A new generation of protein database search programs,’’ Nucleic Acids Res., vol. 25, no. 17, pp. 3389– 3402, 1997.

[17] B. Giardine, L. Elnitski, C. Riemer, L. Makalowska, S. Schwartz, and W. Miller, ‘‘GALA, a database for genomic sequence alignments and annota-tions,’’ Genome Res., vol. 13, no. 4, pp. 732–741, 2003.

[18] C. X. Chen and J. C. Rajapakse, ‘‘Grid-enable BLASTZ: Application to comparative genomics,’’ J. VLSI Signal Process, vol. 48, no. 3, pp. 301–309, 2007. [19] A. Krishnan, ‘‘GridBLAST: A globus-based high-throughput implementation of BLAST in a grid computing framework,’’ Concurrency Comput.: Pract. Exp., vol. 17, no. 13, pp. 1607–1623, 2005.

[20] Master-slave paradigm [Online]. Available: http://charm.cs.uiuc.edu/ research/masterSlave/

[21] C. A. Stewart, D. Hart, D. K. Berry, G. J. Olsen, E. A. Wernert, and F. W. Ischer, ‘‘Parallel implementation and performance of fastDNAml: A program for maximum likelihood phylogenetic inference,’’ in Proc. SC2001, Denver, CO, 2001, p. 32. [22] W. R. Zhu, Y. W. Niu, J. Z. Lu, C. Shen, and G. R. Gao, ‘‘A cluster-based solution for high performance Hmmpfam using EARTH execution model,’’ in Proc. Cluster 03, 2003, p. 30.

[23] GLOBUS project [Online]. Available: http://www.globus.org

[24] Sun grid engine project [Online]. Available: http://gridengine.sunsource.net/ [25] J. Ma, M. N. Nguyen, and J. C. Rajapakse, ‘‘Gene classification using codon usage pattern analysis and support vector machines,’’ IEEE Trans. Comput. Biol. Bioinformatics, vol. 6, no. 1, pp. 134–143, Jan./Mar. 2009.

IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE JULY/AUGUST 2009

Figure

Fig. 1. The workflow of the grid-enabled pipeline using dy- dy-namic load balancing.
Table 1. Examples of conserved noncoding regions detected by comparative genomics of human and mouse.
Table 2 shows the statistics of conserved regions found by the comparative genomic workflow by using two sets of parameters: 1) at least 50% identity and 20 b/s length (q ¼ 50%, l ¼ 20 b/s) and 2) at least 70% identity and 100 b/s length (q ¼ 70%, l ¼ 100
Fig. 3. Demonstration of performance by PDRD of the grid- grid-enabled BLAST under a disturbance on the grid.

Références

Documents relatifs

Mechanisms include disclosure of financial dealings to allow public accountability; the employment of the tax system to broaden the base of party funds; state subsidy of

They both work for the Ministry of Employment and have just been asked to design the questionnaire on employee job satisfaction that will be given to workers in order to improve

This could lead to huge orders for electronics firms, 71___ is why, in his recent jobs speech, President Barack Obama called for $1 billion of extra _72__ to speed up

[r]

FOREWORD.. Water and food, food and water, this seemingly uneven couple cannot be separated. There are regions in the world with abundant water resources but without food

Actions should be chosen that are proportional to the seriousness of the potential harm, with consideration of their positive and negative consequences, and with an assessment of

In case Krein’s strings with spectral functions of polynomial growth a necessary and sufficient condition for the Krein’s correspondence to be continuous is given.. One can regard dm

Key words: gene order comparison, distance between permutations, transposition distance, exact distribution, random uniform permutation, conserved genomic regions, reference