You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jon Blower <jd...@mail.nerc-essc.ac.uk> on 2006/02/24 18:45:03 UTC

Getting started with standalone MapReduce

Dear all,

I have been looking for a Java implementation of Google's MapReduce design
and was very glad to find Nutch.  However, I don't want to use it for web
crawling: I want to experiment with Nutch's MapReduce as a method for
(distributed) searching through some existing, very large datasets that I
have stored in an NFS filesystem.

I've just had a quick try at getting started, armed with the Wiki
(http://wiki.apache.org/nutch/FAQ#head-48f8d8319c3c85953118721f42336613abf7f
6b6), Tom White's blog
(http://weblogs.java.net/blog/tomwhite/archive/2005/09/mapreduce.html) and
the source code.  (I checked out the 0.7 branch from
http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.7/).  I tried
running the Grep application (using Cygwin under WinXP):

cd nutch/bin
./nutch org.apache.nutch.mapReduce.demo.Grep c:\in c:\out 'A|C'

I get messages saying it is parsing nutch-default.xml and nutch-site.xml (in
fact I get these messages twice each), then I get a
java.net.ConnectException that traces to
mapReduce.JobClient.getFs(JobClient.java:195).  I can't figure out what it's
trying to connect to - maybe it's trying to find a NDFS instance?  I just
want to run everything on my local machine for now.

I think that some of the material on the web is out of date (it refers to a
mapred package, not the mapReduce package for example), which is fine
because I understand that this is still under development.  However, if
someone could give me some pointers for using MapReduce in "standalone" mode
(i.e. without using NDFS or doing web crawls) I'd be extremely grateful.

Thanks in advance,
Jon Blower,
University of Reading, UK

Re: Getting started with standalone MapReduce

Posted by Jérôme Charron <je...@gmail.com>.

> I have been looking for a Java implementation of Google's MapReduce design
> and was very glad to find Nutch.  However, I don't want to use it for web
> crawling: I want to experiment with Nutch's MapReduce as a method for
> (distributed) searching through some existing, very large datasets that I
> have stored in an NFS filesystem.

Jon, the MapReduce implementation in Nutch has been splitted to a new Lucene
subproject : Hadoop.
I think this is exactly what you need : http://lucene.apache.org/hadoop/

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/