You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/10/04 19:12:07 UTC
[Nutch Wiki] Update of "SimpleMapReduceTutorial" by EarlCahill
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by EarlCahill:
http://wiki.apache.org/nutch/SimpleMapReduceTutorial
New page:
This is the simplest map reduce example I could come up with. Local filesystem, just getting one segment indexed. I am running Ubuntu, on an Athlon 3200+ using a cable modem connection.
== Designate Url ==
Need to get to the right place
{{{
cd nutch/branches/mapred
}}}
We need to make a directory that contains files, where each line of each file is a url. I choose http://lucene.apache.org/nutch/
{{{
mkdir urls
echo "http://lucene.apache.org/nutch/" > urls/urls
}}}
Also need to change the crawl filter to include this site
{{{
perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt
}}}
We walk through the following steps: crawl, generate, fetch, updatedb, invertlinks, index.
== Crawl ==
We want to run crawl on the urls directory from above.
{{{
./bin/nutch crawl urls
}}}
Took me about ten minutes. Output included
051004 003916 178 pages, 17 errors, 0.4 pages/s, 48 kb/s
The errors generally seemed to be timeouts.
The rest of the commands are a bit more dynamic, relying on timestamp and the like. Environment variables help out.
== Generate ==
Here we walk a segment dir from the crawl above.
{{{
CRAWLDB=`find crawl-2* -name crawldb`
SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments`
./bin/nutch generate $CRAWLDB $SEGMENTS_DIR
}}}
Took less than five seconds.
== Fetch ==
{{{
SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1`
./bin/nutch fetch $SEGMENT
}}}
Took about seven minutes, and output looked like
051004 004931 65 pages, 404 errors, 0.2 pages/s, 19 kb/s,
Again, many timeouts.
== UbdateDB ==
{{{
./bin/nutch updatedb $CRAWLDB $SEGMENT
}}}
Took less than five seconds.
== InvertLinks ==
{{{
LINKDB=`find crawl-2* -name linkdb -maxdepth 1`
SEGMENTS=`find crawl-2* -name segments -maxdepth 1`
./bin/nutch invertlinks $LINKDB $SEGMENTS
}}}
Took less than five seconds.
== Index ==
We need a place for our index, say myindex
{{{
mkdir myindex
}}}
Now, let's index.
{{{
./bin/nutch index myindex $LINKDB $SEGMENT
}}}
Took less than ten seconds.
== Test ==
The best test I have for the moment is
{{{
ls -alR myindex
}}}
If you see several files, it at least did something. Happy nutching!
Tutorial written by Earl Cahill, 2005.