You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Matt Wood <ma...@sanger.ac.uk> on 2008/04/28 16:50:00 UTC
Distributed indexing
Hello all,
I was wondering if someone in the know could tell me about the current
state of play with building and searching large indices with hadoop?
Some background: I work on the human genome project, and we're
currently setting up a new facility based around the next generation
of DNA sequencing. We're currently producing around 50Tb of data a
week, some of which we would like to provide fast access to via an
index.
Having read up on hadoop, it appears that it could play a central part
in our infrastructure, and that others have tried (and succeeded) in
building a distributed indexing and retrieval system with hadoop. I'd
be interested if anyone could point me in the right direction to more
information or examples of such a system. Yahoo! (with webmap) seems
to be close to the sort of thing we would need.
Would map/reduce be a suitable approach for indexing _and_ retrieval,
or just indexing? Would Solr/Lucene be a good fit? Any help or
pointers to more information would be much appreciated!
If you would like any more details, I'd be more than happy to supply
them!
Many thanks,
~ Matt
-------------
Matt Wood
Sequencing Informatics // Production Software
www.sanger.ac.uk
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Re: Distributed indexing
Posted by Samuel Guo <gu...@gmail.com>.
map/reduce will be a suitable approach for indexing large doc
collections. but I don't know is it suitable for retrieval. you can see
*Nutch* for the distributed searching.
under the hadoop/contrib directory , there is a *Index* package. It may
be helpful :)
Matt Wood 写道:
> Hello all,
>
> I was wondering if someone in the know could tell me about the current
> state of play with building and searching large indices with hadoop?
>
> Some background: I work on the human genome project, and we're
> currently setting up a new facility based around the next generation
> of DNA sequencing. We're currently producing around 50Tb of data a
> week, some of which we would like to provide fast access to via an index.
>
> Having read up on hadoop, it appears that it could play a central part
> in our infrastructure, and that others have tried (and succeeded) in
> building a distributed indexing and retrieval system with hadoop. I'd
> be interested if anyone could point me in the right direction to more
> information or examples of such a system. Yahoo! (with webmap) seems
> to be close to the sort of thing we would need.
>
> Would map/reduce be a suitable approach for indexing _and_ retrieval,
> or just indexing? Would Solr/Lucene be a good fit? Any help or
> pointers to more information would be much appreciated!
>
> If you would like any more details, I'd be more than happy to supply
> them!
>
> Many thanks,
>
> ~ Matt
>
>
> -------------
>
> Matt Wood
> Sequencing Informatics // Production Software
> www.sanger.ac.uk
>
>
>
Re: Distributed indexing
Posted by Samuel Guo <gu...@gmail.com>.
Ted Dunning 写道:
> Check out the bailey and katta projects on sourceforge.
>
I get nothing when checking out the katta project on sourceforge :(
> Also take a look at Nutch.
>
> Hadoop is certainly good for indexing and it isn't that hard to put
> distributed search alongside hadoop with indexes being pulled from HDFS to
> local storage or RAM for speed.
>
>
> On 4/28/08 7:50 AM, "Matt Wood" <ma...@sanger.ac.uk> wrote:
>
>
>> Hello all,
>>
>> I was wondering if someone in the know could tell me about the current
>> state of play with building and searching large indices with hadoop?
>>
>> Some background: I work on the human genome project, and we're
>> currently setting up a new facility based around the next generation
>> of DNA sequencing. We're currently producing around 50Tb of data a
>> week, some of which we would like to provide fast access to via an
>> index.
>>
>> Having read up on hadoop, it appears that it could play a central part
>> in our infrastructure, and that others have tried (and succeeded) in
>> building a distributed indexing and retrieval system with hadoop. I'd
>> be interested if anyone could point me in the right direction to more
>> information or examples of such a system. Yahoo! (with webmap) seems
>> to be close to the sort of thing we would need.
>>
>> Would map/reduce be a suitable approach for indexing _and_ retrieval,
>> or just indexing? Would Solr/Lucene be a good fit? Any help or
>> pointers to more information would be much appreciated!
>>
>> If you would like any more details, I'd be more than happy to supply
>> them!
>>
>> Many thanks,
>>
>> ~ Matt
>>
>>
>> -------------
>>
>> Matt Wood
>> Sequencing Informatics // Production Software
>> www.sanger.ac.uk
>>
>>
>>
>
>
Re: Distributed indexing
Posted by Ted Dunning <td...@veoh.com>.
Check out the bailey and katta projects on sourceforge.
Also take a look at Nutch.
Hadoop is certainly good for indexing and it isn't that hard to put
distributed search alongside hadoop with indexes being pulled from HDFS to
local storage or RAM for speed.
On 4/28/08 7:50 AM, "Matt Wood" <ma...@sanger.ac.uk> wrote:
> Hello all,
>
> I was wondering if someone in the know could tell me about the current
> state of play with building and searching large indices with hadoop?
>
> Some background: I work on the human genome project, and we're
> currently setting up a new facility based around the next generation
> of DNA sequencing. We're currently producing around 50Tb of data a
> week, some of which we would like to provide fast access to via an
> index.
>
> Having read up on hadoop, it appears that it could play a central part
> in our infrastructure, and that others have tried (and succeeded) in
> building a distributed indexing and retrieval system with hadoop. I'd
> be interested if anyone could point me in the right direction to more
> information or examples of such a system. Yahoo! (with webmap) seems
> to be close to the sort of thing we would need.
>
> Would map/reduce be a suitable approach for indexing _and_ retrieval,
> or just indexing? Would Solr/Lucene be a good fit? Any help or
> pointers to more information would be much appreciated!
>
> If you would like any more details, I'd be more than happy to supply
> them!
>
> Many thanks,
>
> ~ Matt
>
>
> -------------
>
> Matt Wood
> Sequencing Informatics // Production Software
> www.sanger.ac.uk
>
>
Re: Distributed indexing
Posted by Ted Dunning <td...@veoh.com>.
Check out the bailey and katta projects on sourceforge.
Also take a look at Nutch.
Hadoop is certainly good for indexing and it isn't that hard to put
distributed search alongside hadoop with indexes being pulled from HDFS to
local storage or RAM for speed.
On 4/28/08 7:50 AM, "Matt Wood" <ma...@sanger.ac.uk> wrote:
> Hello all,
>
> I was wondering if someone in the know could tell me about the current
> state of play with building and searching large indices with hadoop?
>
> Some background: I work on the human genome project, and we're
> currently setting up a new facility based around the next generation
> of DNA sequencing. We're currently producing around 50Tb of data a
> week, some of which we would like to provide fast access to via an
> index.
>
> Having read up on hadoop, it appears that it could play a central part
> in our infrastructure, and that others have tried (and succeeded) in
> building a distributed indexing and retrieval system with hadoop. I'd
> be interested if anyone could point me in the right direction to more
> information or examples of such a system. Yahoo! (with webmap) seems
> to be close to the sort of thing we would need.
>
> Would map/reduce be a suitable approach for indexing _and_ retrieval,
> or just indexing? Would Solr/Lucene be a good fit? Any help or
> pointers to more information would be much appreciated!
>
> If you would like any more details, I'd be more than happy to supply
> them!
>
> Many thanks,
>
> ~ Matt
>
>
> -------------
>
> Matt Wood
> Sequencing Informatics // Production Software
> www.sanger.ac.uk
>
>