You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Matt Wood <ma...@sanger.ac.uk> on 2008/04/28 16:50:00 UTC

Distributed indexing

Hello all,

I was wondering if someone in the know could tell me about the current  
state of play with building and searching large indices with hadoop?

Some background: I work on the human genome project, and we're  
currently setting up a new facility based around the next generation  
of DNA sequencing. We're currently producing around 50Tb of data a  
week, some of which we would like to provide fast access to via an  
index.

Having read up on hadoop, it appears that it could play a central part  
in our infrastructure, and that others have tried (and succeeded) in  
building a distributed indexing and retrieval system with hadoop. I'd  
be interested if anyone could point me in the right direction to more  
information or examples of such a system. Yahoo! (with webmap) seems  
to be close to the sort of thing we would need.

Would map/reduce be a suitable approach for indexing _and_ retrieval,  
or just indexing? Would Solr/Lucene be a good fit? Any help or  
pointers to more information would be  much appreciated!

If you would like any more details, I'd be more than happy to supply  
them!

Many thanks,

~ Matt


-------------

Matt Wood
Sequencing Informatics // Production Software
www.sanger.ac.uk



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

Re: Distributed indexing

Posted by Samuel Guo <gu...@gmail.com>.

map/reduce will be a suitable approach for indexing large doc 
collections. but I don't know is it suitable for retrieval. you can see 
*Nutch* for the distributed searching.

under the hadoop/contrib directory , there is a *Index* package. It may 
be helpful :)

Matt Wood 写道:
> Hello all,
>
> I was wondering if someone in the know could tell me about the current 
> state of play with building and searching large indices with hadoop?
>
> Some background: I work on the human genome project, and we're 
> currently setting up a new facility based around the next generation 
> of DNA sequencing. We're currently producing around 50Tb of data a 
> week, some of which we would like to provide fast access to via an index.
>
> Having read up on hadoop, it appears that it could play a central part 
> in our infrastructure, and that others have tried (and succeeded) in 
> building a distributed indexing and retrieval system with hadoop. I'd 
> be interested if anyone could point me in the right direction to more 
> information or examples of such a system. Yahoo! (with webmap) seems 
> to be close to the sort of thing we would need.
>
> Would map/reduce be a suitable approach for indexing _and_ retrieval, 
> or just indexing? Would Solr/Lucene be a good fit? Any help or 
> pointers to more information would be much appreciated!
>
> If you would like any more details, I'd be more than happy to supply 
> them!
>
> Many thanks,
>
> ~ Matt
>
>
> -------------
>
> Matt Wood
> Sequencing Informatics // Production Software
> www.sanger.ac.uk
>
>
>

Re: Distributed indexing

Posted by Samuel Guo <gu...@gmail.com>.

Ted Dunning 写道:
> Check out the bailey and katta projects on sourceforge.
>   
I get nothing when checking out the katta project on sourceforge :(
> Also take a look at Nutch.
>
> Hadoop is certainly good for indexing and it isn't that hard to put
> distributed search alongside hadoop with indexes being pulled from HDFS to
> local storage or RAM for speed.
>
>
> On 4/28/08 7:50 AM, "Matt Wood" <ma...@sanger.ac.uk> wrote:
>
>   
>> Hello all,
>>
>> I was wondering if someone in the know could tell me about the current
>> state of play with building and searching large indices with hadoop?
>>
>> Some background: I work on the human genome project, and we're
>> currently setting up a new facility based around the next generation
>> of DNA sequencing. We're currently producing around 50Tb of data a
>> week, some of which we would like to provide fast access to via an
>> index.
>>
>> Having read up on hadoop, it appears that it could play a central part
>> in our infrastructure, and that others have tried (and succeeded) in
>> building a distributed indexing and retrieval system with hadoop. I'd
>> be interested if anyone could point me in the right direction to more
>> information or examples of such a system. Yahoo! (with webmap) seems
>> to be close to the sort of thing we would need.
>>
>> Would map/reduce be a suitable approach for indexing _and_ retrieval,
>> or just indexing? Would Solr/Lucene be a good fit? Any help or
>> pointers to more information would be  much appreciated!
>>
>> If you would like any more details, I'd be more than happy to supply
>> them!
>>
>> Many thanks,
>>
>> ~ Matt
>>
>>
>> -------------
>>
>> Matt Wood
>> Sequencing Informatics // Production Software
>> www.sanger.ac.uk
>>
>>
>>     
>
>

Re: Distributed indexing

Posted by Ted Dunning <td...@veoh.com>.

Check out the bailey and katta projects on sourceforge.

Also take a look at Nutch.

Hadoop is certainly good for indexing and it isn't that hard to put
distributed search alongside hadoop with indexes being pulled from HDFS to
local storage or RAM for speed.


On 4/28/08 7:50 AM, "Matt Wood" <ma...@sanger.ac.uk> wrote:

> Hello all,
> 
> I was wondering if someone in the know could tell me about the current
> state of play with building and searching large indices with hadoop?
> 
> Some background: I work on the human genome project, and we're
> currently setting up a new facility based around the next generation
> of DNA sequencing. We're currently producing around 50Tb of data a
> week, some of which we would like to provide fast access to via an
> index.
> 
> Having read up on hadoop, it appears that it could play a central part
> in our infrastructure, and that others have tried (and succeeded) in
> building a distributed indexing and retrieval system with hadoop. I'd
> be interested if anyone could point me in the right direction to more
> information or examples of such a system. Yahoo! (with webmap) seems
> to be close to the sort of thing we would need.
> 
> Would map/reduce be a suitable approach for indexing _and_ retrieval,
> or just indexing? Would Solr/Lucene be a good fit? Any help or
> pointers to more information would be  much appreciated!
> 
> If you would like any more details, I'd be more than happy to supply
> them!
> 
> Many thanks,
> 
> ~ Matt
> 
> 
> -------------
> 
> Matt Wood
> Sequencing Informatics // Production Software
> www.sanger.ac.uk
> 
>

Re: Distributed indexing

Posted by Ted Dunning <td...@veoh.com>.

Check out the bailey and katta projects on sourceforge.

Also take a look at Nutch.

Hadoop is certainly good for indexing and it isn't that hard to put
distributed search alongside hadoop with indexes being pulled from HDFS to
local storage or RAM for speed.


On 4/28/08 7:50 AM, "Matt Wood" <ma...@sanger.ac.uk> wrote:

> Hello all,
> 
> I was wondering if someone in the know could tell me about the current
> state of play with building and searching large indices with hadoop?
> 
> Some background: I work on the human genome project, and we're
> currently setting up a new facility based around the next generation
> of DNA sequencing. We're currently producing around 50Tb of data a
> week, some of which we would like to provide fast access to via an
> index.
> 
> Having read up on hadoop, it appears that it could play a central part
> in our infrastructure, and that others have tried (and succeeded) in
> building a distributed indexing and retrieval system with hadoop. I'd
> be interested if anyone could point me in the right direction to more
> information or examples of such a system. Yahoo! (with webmap) seems
> to be close to the sort of thing we would need.
> 
> Would map/reduce be a suitable approach for indexing _and_ retrieval,
> or just indexing? Would Solr/Lucene be a good fit? Any help or
> pointers to more information would be  much appreciated!
> 
> If you would like any more details, I'd be more than happy to supply
> them!
> 
> Many thanks,
> 
> ~ Matt
> 
> 
> -------------
> 
> Matt Wood
> Sequencing Informatics // Production Software
> www.sanger.ac.uk
> 
>