You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2014/09/04 00:54:28 UTC

[Nutch Wiki] Update of "NutchTutorial" by riverma

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchTutorial" page has been changed by riverma:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=65&rev2=66

Comment:
Added requirements so that new users understand what software is needed to run or build Nutch.

  <<TableOfContents(3)>>
  
  == Steps ==
- 
  {{{#!wiki caution
  This tutorial describes the installation and use of Nutch 1.x (current release is 1.7). How to compile and set up Nutch 2.x with HBase, see Nutch2Tutorial.
  }}}
+ == Requirements ==
+  * Unix environment, or Windows-[[https://www.cygwin.com/|Cygwin]] environment
+  * Java Runtime/Development Environment (1.5+): http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html
+  * (Source build only) Apache Ant: http://ant.apache.org/
  
  == 1. Setup Nutch from binary distribution ==
   * Download a binary package (`apache-nutch-1.X-bin.zip`) from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]].
@@ -27, +30 @@

  
  === Set up from the source distribution ===
  Advanced users may also use the source distribution:
+ 
   * Download a source package (`apache-nutch-1.X-src.zip`)
   * Unzip
   * `cd apache-nutch-1.X/`
@@ -34, +38 @@

   * Now there is a directory `runtime/local` which contains a ready to use Nutch installation.
  
  When the source distribution is used `${NUTCH_RUNTIME_HOME}` refers to `apache-nutch-1.X/runtime/local/`. Note that
+ 
   * config files should be modified in `apache-nutch-1.X/runtime/local/conf/`
   * `ant clean` will remove this directory (keep copies of modified config files)
  
@@ -63, +68 @@

  {{{
  export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
  }}}
- 
  On Debian or Ubuntu, you can run the following command or add it to ~/.bashrc:
+ 
  {{{
  export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
  }}}
@@ -98, +103 @@

  This will include any URL in the domain `nutch.apache.org`.
  
  === 3.1 Using the Crawl Command ===
- 
  {{{#!wiki caution
  The crawl command is deprecated. Please see section [[#A3.3._Using_the_crawl_script|3.3]] on how to use the crawl script that is intended to replace the crawl command.
  }}}
- 
  Now we are ready to initiate a crawl, use the following parameters:
  
   * '''-dir''' ''dir'' names the directory to put the crawl in.
@@ -192, +195 @@

  {{{
  bin/nutch fetch $s1
  }}}
- 
  Then we parse the entries:
  
  {{{
  bin/nutch parse $s1
  }}}
- 
  When this is complete, we update the database with the results of the fetch:
  
  {{{
@@ -247, +248 @@

       Usage: bin/nutch solrindex <solr url> <crawldb> [-linkdb <linkdb>][-params k1=v1&k2=v2...] (<segment> ...| -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
       Example: bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize
  }}}
- 
  ==== Step-by-Step: Deleting Duplicates ====
  Once indexed the entire contents, it must be disposed of duplicate urls in this way ensures that the urls are unique.
  
@@ -260, +260 @@

       Usage: bin/nutch solrdedup <solr url>
       Example: /bin/nutch solrdedup http://localhost:8983/solr
  }}}
- 
  ==== Step-by-Step: Cleaning Solr ====
  The class scans a crawldb directory looking for entries with status DB_GONE (404) and sends delete requests to Solr for those documents. Once Solr receives the request the aforementioned documents are duly deleted. This maintains a healthier quality of Solr index.
  
@@ -268, +267 @@

       Usage: bin/nutch solrclean <crawldb> <solrurl>
       Example: /bin/nutch solrclean crawl/crawldb/ http://localhost:8983/solr
  }}}
- 
  === 3.3. Using the crawl script ===
- 
  If you have followed the 3.2 section above on how the crawling can be done step by step, you might be wondering how a bash script can be written to automate all the process described above.
  
- Nutch developers have written one for you :), and it is available at [[bin/crawl]]. 
+ Nutch developers have written one for you :), and it is available at [[bin/crawl]].
  
  {{{
       Usage: bin/crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
@@ -281, +278 @@

       Or you can use:
       Example: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
  }}}
- 
- 
  The crawl script has lot of parameters set, and you can modify the parameters to your needs. It would be ideal to understand the parameters before setting up big crawls.
- 
  
  == 4. Setup Solr for search ==
   * download binary file from [[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]]
@@ -311, +305 @@

  {{{
  bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
  }}}
- 
  The call signature for running the solrindex has changed. The linkdb is now optional, so you need to denote it with a "-linkdb" flag on the command line.
  
  This will send all crawl data to Solr for indexing. For more information please see [[bin/nutch solrindex]]