You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Paul Smith <ps...@aconex.com> on 2005/08/04 07:38:15 UTC

Map-Reduce

I've been reading the Nutch MapReduce stuff[1], and the original  
Google paper [2].

I know there's a mapreduce branch in the nutch project, but is there  
any plan/talk of perhaps integrating something like that directly  
into the Lucene API?  For projects that need a lower-level API like  
Lucene, rather than the crawl-like nature of Nutch, the potential to  
index lots of information in an efficient manner is very appealing  
indeed.

I'm not suggesting this is _easy_, just curious of what folks on the  
Lucene-side of things think.  Perhaps a chance to refactor out from  
nutch a shared library?

I would love to hear anyones thoughts on the matter.

cheers,

Paul Smith

[1] http://wiki.apache.org/nutch-data/attachments/Presentations/ 
attachments/oscon05.pdf
[2] http://labs.google.com/papers/mapreduce-osdi04.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Map-Reduce

Posted by Paul Smith <ps...@aconex.com>.
On 05/08/2005, at 4:10 AM, Doug Cutting wrote:

> Doug Cutting wrote:
>
>> Perhaps we need to factor Nutch into two projects, one with NDFS  
>> and MapReduce and the other with the search-specific code.  This  
>> falls almost exactly on package lines.  The packages  
>> org.apache.nutch.{io,ipc,fs,ndfs,mapred} are not dependent on the  
>> rest of Nutch.
>>
>
> FYI, over on the nutch-dev list, I just proposed that we split  
> these packages into a new project that Nutch then depends on, since  
> there seems to be interest in using them independently of Nutch.   
> Such a split probably wouldn't happen for at least a month.
>
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/ 
> msg00312.html


Awesome, thanks Doug!  I really believe that having this out as a  
separate project will be more useful for everyone.   This will also  
give more exposure to Nutch and Lucene as a whole, because people  
will experiment with the NDFS/MapReduce stuff first (smaller thing to  
comprehend first).

cheers,

Paul

Re: Map-Reduce

Posted by Doug Cutting <cu...@apache.org>.
Doug Cutting wrote:
> Perhaps we need to factor Nutch into two projects, one with NDFS and 
> MapReduce and the other with the search-specific code.  This falls 
> almost exactly on package lines.  The packages 
> org.apache.nutch.{io,ipc,fs,ndfs,mapred} are not dependent on the rest 
> of Nutch.

FYI, over on the nutch-dev list, I just proposed that we split these 
packages into a new project that Nutch then depends on, since there 
seems to be interest in using them independently of Nutch.  Such a split 
probably wouldn't happen for at least a month.

http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00312.html

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Map-Reduce

Posted by Doug Cutting <cu...@apache.org>.
Paul Smith wrote:
> I know there's a mapreduce branch in the nutch project, but is there  
> any plan/talk of perhaps integrating something like that directly  into 
> the Lucene API?  For projects that need a lower-level API like  Lucene, 
> rather than the crawl-like nature of Nutch, the potential to  index lots 
> of information in an efficient manner is very appealing  indeed.

You can easily use NDFS and MapReduce from Nutch without using Nutch's 
crawler.

Perhaps we need to factor Nutch into two projects, one with NDFS and 
MapReduce and the other with the search-specific code.  This falls 
almost exactly on package lines.  The packages 
org.apache.nutch.{io,ipc,fs,ndfs,mapred} are not dependent on the rest 
of Nutch.

But you don't need to wait for such a split in order to use NDFS and 
MapReduce.  Just check out the mapred branch from SVN and don't use the 
parts you don't need.  If you find it useful, then argue for the 
creation of a new project.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Map-Reduce

Posted by Tom White <to...@gmail.com>.
This might be what you're looing for: http://computefarm.jini.org/.

Cheers,

Tom

On 8/4/05, Cheolgoo Kang <ap...@gmail.com> wrote:
> Yeah, it would be great if we had a Directory subclass like MapReduceDirectory.
> 
> I'm looking for the ComputeFarm that is implemented a distributed
> parallel computing environment on the JINI technology.
> 
> 
> On 8/4/05, Paul Smith <ps...@aconex.com> wrote:
> > I've been reading the Nutch MapReduce stuff[1], and the original
> > Google paper [2].
> >
> > I know there's a mapreduce branch in the nutch project, but is there
> > any plan/talk of perhaps integrating something like that directly
> > into the Lucene API?  For projects that need a lower-level API like
> > Lucene, rather than the crawl-like nature of Nutch, the potential to
> > index lots of information in an efficient manner is very appealing
> > indeed.
> >
> > I'm not suggesting this is _easy_, just curious of what folks on the
> > Lucene-side of things think.  Perhaps a chance to refactor out from
> > nutch a shared library?
> >
> > I would love to hear anyones thoughts on the matter.
> >
> > cheers,
> >
> > Paul Smith
> >
> > [1] http://wiki.apache.org/nutch-data/attachments/Presentations/
> > attachments/oscon05.pdf
> > [2] http://labs.google.com/papers/mapreduce-osdi04.pdf
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> 
> 
> --
> Regards,
> Cheolgoo Kang
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Map-Reduce

Posted by Cheolgoo Kang <ap...@gmail.com>.
Yeah, it would be great if we had a Directory subclass like MapReduceDirectory.

I'm looking for the ComputeFarm that is implemented a distributed
parallel computing environment on the JINI technology.


On 8/4/05, Paul Smith <ps...@aconex.com> wrote:
> I've been reading the Nutch MapReduce stuff[1], and the original
> Google paper [2].
> 
> I know there's a mapreduce branch in the nutch project, but is there
> any plan/talk of perhaps integrating something like that directly
> into the Lucene API?  For projects that need a lower-level API like
> Lucene, rather than the crawl-like nature of Nutch, the potential to
> index lots of information in an efficient manner is very appealing
> indeed.
> 
> I'm not suggesting this is _easy_, just curious of what folks on the
> Lucene-side of things think.  Perhaps a chance to refactor out from
> nutch a shared library?
> 
> I would love to hear anyones thoughts on the matter.
> 
> cheers,
> 
> Paul Smith
> 
> [1] http://wiki.apache.org/nutch-data/attachments/Presentations/
> attachments/oscon05.pdf
> [2] http://labs.google.com/papers/mapreduce-osdi04.pdf
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 


-- 
Regards,
Cheolgoo Kang

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Map-Reduce

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Thanks.  I saw that, but I was curious about the actual presentation
(what exactly Doug said).

Otis

--- Chris Lamprecht <cl...@gmail.com> wrote:

> Maybe you already saw this, I hit it accidentally, it contains a few
> other files including one called mapred.pdf
> 
>
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/
> 
> On 8/4/05, Otis Gospodnetic <ot...@yahoo.com> wrote:
> > > [1] http://wiki.apache.org/nutch-data/attachments/Presentations/
> > > attachments/oscon05.pdf
> > 
> > Does anyone have any more info from Doug's MapReduce presentation
> > (transcript, notes, audio, video)?
> > 
> > Thanks,
> > Otis
> > 
> > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> > 
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Map-Reduce

Posted by Chris Lamprecht <cl...@gmail.com>.
Maybe you already saw this, I hit it accidentally, it contains a few
other files including one called mapred.pdf

http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/

On 8/4/05, Otis Gospodnetic <ot...@yahoo.com> wrote:
> > [1] http://wiki.apache.org/nutch-data/attachments/Presentations/
> > attachments/oscon05.pdf
> 
> Does anyone have any more info from Doug's MapReduce presentation
> (transcript, notes, audio, video)?
> 
> Thanks,
> Otis
> 
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Map-Reduce

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Aug 4, 2005, at 1:27 PM, Otis Gospodnetic wrote:
>> [1] http://wiki.apache.org/nutch-data/attachments/Presentations/
>> attachments/oscon05.pdf
>>
>
> Does anyone have any more info from Doug's MapReduce presentation
> (transcript, notes, audio, video)?

I was at Doug's OSCON presentation but did not see anyone taking  
photos or video.  Perhaps someone transcribed it, but I did not.  I  
was too busy being floored by the magnitude of what Doug has done.   
Killing the Internet Archive with the MapReduce implementation of  
Nutch crawling is mighty impressive!

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Map-Reduce

Posted by Otis Gospodnetic <ot...@yahoo.com>.
> [1] http://wiki.apache.org/nutch-data/attachments/Presentations/ 
> attachments/oscon05.pdf

Does anyone have any more info from Doug's MapReduce presentation
(transcript, notes, audio, video)?

Thanks,
Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org