You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Kylie McCormick <ky...@gmail.com> on 2008/07/10 22:43:26 UTC

Hadoop Architecture Question: Distributed Information Retrieval

Hello!
My name is Kylie McCormick, and I'm currently working on creating a
distributed information retrieval package with Hadoop based on my previous
work with other middlewares like OGSA-DAI. I've been developing a design
that works with the structures of the other systems I have put together for
distributed IR.

Essentially, each service (search) returns a ResultSet, which is then merged
into a single FinalSet object as soon as it is returned to the main program.
Merging a ResultSet generally entails rescoring the documents and putting
them in the same OrderedList as documents from other services that have also
been rescored.

I have re-designed this so at the Map phase a service is invoked and the
ResultSet is collected by the OutputCollector. In the Reduce phase, I hoped
to merge all the results together. Is it possible to have reduce produce one
(and only one) object output?

Thank you,
Kylie

-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost

Re: Hadoop Architecture Question: Distributed Information Retrieval

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

If you tell Hadoop to use a single reducer, it should produce a single file
of output.

btw, you do know about Nutch I presume?

http://lucene.apache.org/nutch/

This is a distributed IR system built using Hadoop.

Miles
2008/7/10 Kylie McCormick <ky...@gmail.com>:

> Hello!
> My name is Kylie McCormick, and I'm currently working on creating a
> distributed information retrieval package with Hadoop based on my previous
> work with other middlewares like OGSA-DAI. I've been developing a design
> that works with the structures of the other systems I have put together for
> distributed IR.
>
> Essentially, each service (search) returns a ResultSet, which is then
> merged
> into a single FinalSet object as soon as it is returned to the main
> program.
> Merging a ResultSet generally entails rescoring the documents and putting
> them in the same OrderedList as documents from other services that have
> also
> been rescored.
>
> I have re-designed this so at the Map phase a service is invoked and the
> ResultSet is collected by the OutputCollector. In the Reduce phase, I hoped
> to merge all the results together. Is it possible to have reduce produce
> one
> (and only one) object output?
>
> Thank you,
> Kylie
>
> --
> The Circle of the Dragon -- unlock the mystery that is the dragon.
> http://www.blackdrago.com/index.html
>
> "Light, seeking light, doth the light of light beguile!"
> -- William Shakespeare's Love's Labor's Lost
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Hadoop Architecture Question: Distributed Information Retrieval

Posted by Kylie McCormick <ky...@gmail.com>.

Thanks for the replies! If I use a single reducer, however, would it be
possible for there to be only one object (FinalSet) to which the Reduce
function merges? If not, I could redo the structure of the program, but I
was hoping to maintain it as much as possible.

Yes, I am aware of Nutch, and I've been using some of the documentation to
help with my new design. It's quite exciting! I'm hoping to have another
Java package with which to continue work on large TREC tracks.

My work with OGSA-DAI can be seen @
http://snowy.arsc.alaska.edu:8080/edu/arsc/multisearch/ if you're
interested, and by the end of the summer I hope to have a write up that
discusses the differences (esp. performance) between the two. The system
from last year was used on this year's TREC collection (with 1,000 services
and 10,000 queries) and performed fairly well. I'm hoping Hadoop will make
more sense and run faster.

Thank you,
Kylie

On Thu, Jul 10, 2008 at 1:47 PM, Steve Loughran <st...@apache.org> wrote:

> Kylie McCormick wrote:
>
>> Hello!
>> My name is Kylie McCormick, and I'm currently working on creating a
>> distributed information retrieval package with Hadoop based on my previous
>> work with other middlewares like OGSA-DAI. I've been developing a design
>> that works with the structures of the other systems I have put together
>> for
>> distributed IR.
>>
>
>
> It would be interesting to see your write up of the different experiences
> that OGSA-DAI's storage model offers versus that of hadoop.
>
> -steve
>

-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost

Re: Hadoop Architecture Question: Distributed Information Retrieval

Posted by Steve Loughran <st...@apache.org>.

Kylie McCormick wrote:
> Hello!
> My name is Kylie McCormick, and I'm currently working on creating a
> distributed information retrieval package with Hadoop based on my previous
> work with other middlewares like OGSA-DAI. I've been developing a design
> that works with the structures of the other systems I have put together for
> distributed IR.


It would be interesting to see your write up of the different 
experiences that OGSA-DAI's storage model offers versus that of hadoop.

-steve