You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/10/04 19:12:07 UTC

[Nutch Wiki] Update of "SimpleMapReduceTutorial" by EarlCahill

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by EarlCahill:
http://wiki.apache.org/nutch/SimpleMapReduceTutorial

New page:
This is the simplest map reduce example I could come up with. Local filesystem, just getting one segment indexed. I am running Ubuntu, on an Athlon 3200+ using a cable modem connection.

== Designate Url ==

Need to get to the right place

{{{
cd nutch/branches/mapred
}}}

We need to make a directory that contains files, where each line of each file is a url. I choose http://lucene.apache.org/nutch/

{{{
mkdir urls
echo "http://lucene.apache.org/nutch/" > urls/urls
}}}

Also need to change the crawl filter to include this site

{{{
perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt
}}}

We walk through the following steps: crawl, generate, fetch, updatedb, invertlinks, index.

== Crawl ==

We want to run crawl on the urls directory from above.

{{{
./bin/nutch crawl urls
}}}

Took me about ten minutes. Output included

051004 003916 178 pages, 17 errors, 0.4 pages/s, 48 kb/s

The errors generally seemed to be timeouts.

The rest of the commands are a bit more dynamic, relying on timestamp and the like. Environment variables help out.

== Generate ==

Here we walk a segment dir from the crawl above.

{{{
CRAWLDB=`find crawl-2* -name crawldb`
SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments`
./bin/nutch generate $CRAWLDB $SEGMENTS_DIR
}}}

Took less than five seconds.

== Fetch ==

{{{
SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1`
./bin/nutch fetch $SEGMENT
}}}

Took about seven minutes, and output looked like

051004 004931 65 pages, 404 errors, 0.2 pages/s, 19 kb/s,

Again, many timeouts.

== UbdateDB ==

{{{
./bin/nutch updatedb $CRAWLDB $SEGMENT
}}}

Took less than five seconds.

== InvertLinks ==

{{{
LINKDB=`find crawl-2* -name linkdb -maxdepth 1`
SEGMENTS=`find crawl-2* -name segments -maxdepth 1`
./bin/nutch invertlinks $LINKDB $SEGMENTS
}}}

Took less than five seconds.

== Index ==

We need a place for our index, say myindex

{{{
mkdir myindex
}}}

Now, let's index.

{{{
./bin/nutch index myindex $LINKDB $SEGMENT
}}}

Took less than ten seconds.

== Test ==

The best test I have for the moment is

{{{
ls -alR myindex
}}}

If you see several files, it at least did something. Happy nutching!

Tutorial written by Earl Cahill, 2005.