You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Franco Nazareno <fr...@gmail.com> on 2011/03/28 04:51:14 UTC

Hadoop for Bioinformatics

Good day everyone!

 

First, I want to congratulate the group for this wonderful project. It did
open up new ideas and solutions in computing and technology-wise. I'm
excited to learn more about it and discover possibilities using Hadoop and
its components. 

 

Well I just want to ask this with regards to my study. Currently I'm
studying my PhD course in Bioinformatics, and my question is that can you
give me a (rough) idea if it's possible to use Hadoop cluster in achieving a
DNA sequence alignment? My basic idea for this goes something like a string
search out of a huge data files stored in HDFS, and the application uses
MapReduce in searching and computing. As the Hadoop paradigm impies, it
doesn't serve well in interactive applications, and I think this kind of
searching is a "write-once, read-many" application.

 

I hope you don't mind my question. And it'll be great hearing your comments
or suggestions about this.

 

Thanks and more power!

Franco


Re: Hadoop for Bioinformatics

Posted by "Tsz Wo (Nicholas), Sze" <s2...@yahoo.com>.
Hi Franco,

I recall that there are some Hadoop-Blast researches/projects.  For examples, 
see


- http://www.cs.umd.edu/Grad/scholarlypapers/papers/MichaelSchatz.pdf
- http://salsahpc.indiana.edu/tutorial/hadoopblast.html

Nicholas



________________________________
From: Franco Nazareno <fr...@gmail.com>
To: common-user@hadoop.apache.org
Sent: Sun, March 27, 2011 7:51:14 PM
Subject: Hadoop for Bioinformatics

Good day everyone!



First, I want to congratulate the group for this wonderful project. It did
open up new ideas and solutions in computing and technology-wise. I'm
excited to learn more about it and discover possibilities using Hadoop and
its components. 



Well I just want to ask this with regards to my study. Currently I'm
studying my PhD course in Bioinformatics, and my question is that can you
give me a (rough) idea if it's possible to use Hadoop cluster in achieving a
DNA sequence alignment? My basic idea for this goes something like a string
search out of a huge data files stored in HDFS, and the application uses
MapReduce in searching and computing. As the Hadoop paradigm impies, it
doesn't serve well in interactive applications, and I think this kind of
searching is a "write-once, read-many" application.



I hope you don't mind my question. And it'll be great hearing your comments
or suggestions about this.



Thanks and more power!

Franco

Re: Hadoop for Bioinformatics

Posted by Bibek Paudel <et...@gmail.com>.
On Mon, Mar 28, 2011 at 4:51 AM, Franco Nazareno
<fr...@gmail.com> wrote:
> Good day everyone!
>
>
>
> First, I want to congratulate the group for this wonderful project. It did
> open up new ideas and solutions in computing and technology-wise. I'm
> excited to learn more about it and discover possibilities using Hadoop and
> its components.
>
>
>
> Well I just want to ask this with regards to my study. Currently I'm
> studying my PhD course in Bioinformatics, and my question is that can you
> give me a (rough) idea if it's possible to use Hadoop cluster in achieving a
> DNA sequence alignment? My basic idea for this goes something like a string
> search out of a huge data files stored in HDFS, and the application uses
> MapReduce in searching and computing. As the Hadoop paradigm impies, it
> doesn't serve well in interactive applications, and I think this kind of
> searching is a "write-once, read-many" application.

Are you looking for something like a "distributed grep?" The hadoop
package comes with some examples, and 'grep' is one of them.

Please see: http://wiki.apache.org/hadoop/Grep and
http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html .

Let us know if you are looking for something else.

-b

>
>
>
> I hope you don't mind my question. And it'll be great hearing your comments
> or suggestions about this.
>
>
>
> Thanks and more power!
>
> Franco
>
>

Re: Hadoop for Bioinformatics

Posted by Kiss Tibor <ki...@gmail.com>.
Hi Franco,

We are using Hadoop for next-gen sequence alignment.
Earlier we had a classic programming model solution, but currently we are
upgrading our software services to M/R modell based on Hadoop.
We transferred most of our classic algorithms to Hadoop and I can say that
everything is getting more manageable.

We are going with Hadoop on the cloud and/or on datacenter. Another
challenge, especially with cloud, how you are transferring the data, because
in bioinformatics the amount of data are usually very high.
Currently i am working on an open-source version of Amazon multipart upload
which will be available in the next release of
JClouds<http://code.google.com/p/jclouds/wiki/BlobStore>,
here are the starting
ideas<http://www.slideshare.net/jclouds/big-data-in-real-life-a-study-on-s3-multipart-uploads>and
also a sample
client app<https://github.com/jclouds/jclouds-examples/tree/master/blobstore-largeblob>
.
If you want to follow new results on
twitter<http://twitter.com/#%21/tiborkisstibor>,
you are invited. I plan to release a paper with results of the data transfer
operations based on this open-source approach.

Also, soon we are releasing the version of our cloud based service stack
which is fully based on Hadoop.

Tibor

On Mon, Mar 28, 2011 at 4:51 AM, Franco Nazareno
<fr...@gmail.com>wrote:

> Good day everyone!
>
>
>
> First, I want to congratulate the group for this wonderful project. It did
> open up new ideas and solutions in computing and technology-wise. I'm
> excited to learn more about it and discover possibilities using Hadoop and
> its components.
>
>
>
> Well I just want to ask this with regards to my study. Currently I'm
> studying my PhD course in Bioinformatics, and my question is that can you
> give me a (rough) idea if it's possible to use Hadoop cluster in achieving
> a
> DNA sequence alignment? My basic idea for this goes something like a string
> search out of a huge data files stored in HDFS, and the application uses
> MapReduce in searching and computing. As the Hadoop paradigm impies, it
> doesn't serve well in interactive applications, and I think this kind of
> searching is a "write-once, read-many" application.
>
>
>
> I hope you don't mind my question. And it'll be great hearing your comments
> or suggestions about this.
>
>
>
> Thanks and more power!
>
> Franco
>
>

Re: Hadoop for Bioinformatics

Posted by Luca Pireddu <pi...@crs4.it>.
On March 28, 2011 04:51:14 Franco Nazareno wrote:
> 
> Well I just want to ask this with regards to my study. Currently I'm
> studying my PhD course in Bioinformatics, and my question is that can you
> give me a (rough) idea if it's possible to use Hadoop cluster in achieving
> a DNA sequence alignment? My basic idea for this goes something like a
> string search out of a huge data files stored in HDFS, and the application
> uses MapReduce in searching and computing. As the Hadoop paradigm impies,
> it doesn't serve well in interactive applications, and I think this kind
> of searching is a "write-once, read-many" application.

I'll add some relevant citations:

An overview of the Hadoop/MapReduce/HBase framework and its current 
applications in bioinformatics
http://www.biomedcentral.com/1471-2105/11/S12/S1


Biodoop: Bioinformatics on Hadoop
http://www.computer.org/portal/web/csdl/doi/10.1109/ICPPW.2009.37


CloudBurst: highly sensitive read mapping with MapReduce
http://bioinformatics.oxfordjournals.org/content/25/11/1363.short


CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources 
for Bioinformatics Applications
http://www.computer.org/portal/web/csdl/doi/10.1109/eScience.2008.62


-- 
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452

RE: Hadoop for Bioinformatics

Posted by Evert Lammerts <Ev...@sara.nl>.
> The short answer is yes!  At CRS4 we are working on this very problem.
>
> We have implemented a Hadoop-based workflow to perform short read
> alignment to
> support DNA sequencing activities in our lab.  Its alignment operation
> is
> based on (and therefore equivalent to) BWA.  We have written a paper
> about it
> which will appear in the coming months, and we are working on an open
> source
> release, but alas we haven't completed that task yet.
>
> We have also implemented a Hadoop-based distributed blast alignment
> program,
> in case you're working with long fragments.  It's currently being used
> by our
> collaborators to align viral DNA segments.
>
>
> In either case, if you're interested we can let you have an advance
> release of
> either program so you can try them out.

Hi Luca,

Could you send me an advanced release of your software? I work for the Dutch national center for scientific computing, and I will give a workshop on Hadoop to BioInformatics on a large BI conference (http://www.nbic.nl/about-nbic/nbic-conferences/nbic-conference-2011/). Lots of people there work with BWA and BLAST type applications (among others in the BBMRI project, which I think CRS4 is involved in as well). So BWA on Hadoop could be a great case study.

Let me know!
Cheers,
Evert

>
>
> --
> Luca Pireddu
> CRS4 - Distributed Computing Group
> Loc. Pixina Manna Edificio 1
> Pula 09010 (CA), Italy
> Tel:  +39 0709250452

Re: Hadoop for Bioinformatics

Posted by Luca Pireddu <pi...@crs4.it>.
On March 28, 2011 04:51:14 Franco Nazareno wrote:
> Good day everyone!

And a good day to you Franco!

> First, I want to congratulate the group for this wonderful project. It did
> open up new ideas and solutions in computing and technology-wise. I'm
> excited to learn more about it and discover possibilities using Hadoop and
> its components.
> 
> 
> Well I just want to ask this with regards to my study. Currently I'm
> studying my PhD course in Bioinformatics, and my question is that can you
> give me a (rough) idea if it's possible to use Hadoop cluster in achieving
> a DNA sequence alignment? My basic idea for this goes something like a
> string search out of a huge data files stored in HDFS, and the application
> uses MapReduce in searching and computing. As the Hadoop paradigm impies,
> it doesn't serve well in interactive applications, and I think this kind
> of searching is a "write-once, read-many" application.
> 
> 
> 
> I hope you don't mind my question. And it'll be great hearing your comments
> or suggestions about this.
> 
> 
> 
> Thanks and more power!
> 
> Franco

The short answer is yes!  At CRS4 we are working on this very problem.  

We have implemented a Hadoop-based workflow to perform short read alignment to 
support DNA sequencing activities in our lab.  Its alignment operation is 
based on (and therefore equivalent to) BWA.  We have written a paper about it 
which will appear in the coming months, and we are working on an open source 
release, but alas we haven't completed that task yet.  

We have also implemented a Hadoop-based distributed blast alignment program, 
in case you're working with long fragments.  It's currently being used by our 
collaborators to align viral DNA segments.


In either case, if you're interested we can let you have an advance release of 
either program so you can try them out.


-- 
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452