You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Michael Schatz <ms...@umiacs.umd.edu> on 2009/04/09 06:18:46 UTC

CloudBurst: Hadoop for DNA Sequence Analysis

Hadoop Users,

I just wanted to announce my Hadoop application 'CloudBurst' is available
open source at:
http://cloudburst-bio.sourceforge.net

In a nutshell, it is an application for mapping millions of short DNA
sequences to a reference genome to, for example, map out differences in one
individual's genome compared to the reference genome. As you might imagine,
this is a very data intense problem, but Hadoop enables the application to
scale up linearly to large clusters.

A full description of the program is available in the journal
Bioinformatics:
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp236

I also wanted to take this opportunity to thank everyone on this mailing
list. The discussions posted were essential for navigating the ins and outs
of hadoop during the development of CloudBurst.

Thanks everyone!

Michael Schatz

http://www.cbcb.umd.edu/~mschatz

RE: CloudBurst: Hadoop for DNA Sequence Analysis

Posted by Dmitry Pushkarev <um...@stanford.edu>.

As a matter of fact it is nowhere near close to being data intensive, it
does take gigabytes of input data to process, however it is mostly RAM and
CPU intensive. Although post-processing of alignment files is exactly where
hadoop excels. At least as far as I understand majority of time is spent on
DP alignment whereas navigation in seed space and N*log(n) sort requires
only a fraction of that time - that was my experience applying hadoop
cluster to sequencing human genomes.



---
Dmitry Pushkarev
+1-650-644-8988

-----Original Message-----
From: michael.schatz@gmail.com [mailto:michael.schatz@gmail.com] On Behalf
Of Michael Schatz
Sent: Wednesday, April 08, 2009 9:19 PM
To: core-user@hadoop.apache.org
Subject: CloudBurst: Hadoop for DNA Sequence Analysis

Hadoop Users,

I just wanted to announce my Hadoop application 'CloudBurst' is available
open source at:
http://cloudburst-bio.sourceforge.net

In a nutshell, it is an application for mapping millions of short DNA
sequences to a reference genome to, for example, map out differences in one
individual's genome compared to the reference genome. As you might imagine,
this is a very data intense problem, but Hadoop enables the application to
scale up linearly to large clusters.

A full description of the program is available in the journal
Bioinformatics:
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp236

I also wanted to take this opportunity to thank everyone on this mailing
list. The discussions posted were essential for navigating the ins and outs
of hadoop during the development of CloudBurst.

Thanks everyone!

Michael Schatz

http://www.cbcb.umd.edu/~mschatz