You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/29 15:48:03 UTC

[Nutch Wiki] Trivial Update of "NutchTutorial" by AlexMc

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchTutorial" page has been changed by AlexMc.
The comment on this change is: formatting corrections only.
http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=22&rev2=23

--------------------------------------------------

  
  The injector adds urls to the crawldb. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.)
  
+ {{{ 
- {{{ wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
+ wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
- gunzip content.rdf.u8.gz }}}
+ gunzip content.rdf.u8.gz 
+ }}}
  
  Next we select a random subset of these pages. (We use a random subset so that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ contains around three million URLs. We select one out of every 5000, so that we end up with around 1000 URLs:
  
+ {{{ 
- {{{ mkdir dmoz
+ mkdir dmoz
- bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls }}}
+ bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls 
+ }}}
  
  The parser also takes a few minutes, as it must parse the full file. Finally, we initialize the crawl db with the selected urls.
  
@@ -123, +127 @@

  
  This generates a fetchlist for all of the pages due to be fetched. The fetchlist is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable {{{s1}}}:
  
+ {{{ 
- {{{ s1=`ls -d crawl/segments/2* | tail -1`
+ s1=`ls -d crawl/segments/2* | tail -1`
- echo $s1 }}}
+ echo $s1 
+ }}}
  
  Now we run the fetcher on this segment with:
  
@@ -139, +145 @@

  
  Now we generate and fetch a new segment containing the top-scoring 1000 pages:
  
+ {{{ 
- {{{ bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+ bin/nutch generate crawl/crawldb crawl/segments -topN 1000
  s2=`ls -d crawl/segments/2* | tail -1`
  echo $s2
  
  bin/nutch fetch $s2
- bin/nutch updatedb crawl/crawldb $s2 }}}
+ bin/nutch updatedb crawl/crawldb $s2 
+ }}}
  
  Let's fetch one more round:
  
+ {{{ 
- {{{ bin/nutch generate crawl/crawldb crawl/segments -topN 1000
+ bin/nutch generate crawl/crawldb crawl/segments -topN 1000
  s3=`ls -d crawl/segments/2* | tail -1`
  echo $s3
  
  bin/nutch fetch $s3
- bin/nutch updatedb crawl/crawldb $s3 }}}
+ bin/nutch updatedb crawl/crawldb $s3 
+ }}}
  
  By this point we've fetched a few thousand pages. Let's index them!
  
@@ -183, +193 @@

  
  Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file may be installed with the commands:
  
+ {{{ 
- {{{ rm -rf ~/local/tomcat/webapps/ROOT*
+ rm -rf ~/local/tomcat/webapps/ROOT*
- cp nutch*.war ~/local/tomcat/webapps/ROOT.war }}}
+ cp nutch*.war ~/local/tomcat/webapps/ROOT.war 
+ }}}
  
  The webapp finds its indexes in ./crawl, relative to where you start Tomcat, so use a command like: