You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2014/09/04 00:54:28 UTC
[Nutch Wiki] Update of "NutchTutorial" by riverma
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchTutorial" page has been changed by riverma:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=65&rev2=66
Comment:
Added requirements so that new users understand what software is needed to run or build Nutch.
<<TableOfContents(3)>>
== Steps ==
-
{{{#!wiki caution
This tutorial describes the installation and use of Nutch 1.x (current release is 1.7). How to compile and set up Nutch 2.x with HBase, see Nutch2Tutorial.
}}}
+ == Requirements ==
+ * Unix environment, or Windows-[[https://www.cygwin.com/|Cygwin]] environment
+ * Java Runtime/Development Environment (1.5+): http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html
+ * (Source build only) Apache Ant: http://ant.apache.org/
== 1. Setup Nutch from binary distribution ==
* Download a binary package (`apache-nutch-1.X-bin.zip`) from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]].
@@ -27, +30 @@
=== Set up from the source distribution ===
Advanced users may also use the source distribution:
+
* Download a source package (`apache-nutch-1.X-src.zip`)
* Unzip
* `cd apache-nutch-1.X/`
@@ -34, +38 @@
* Now there is a directory `runtime/local` which contains a ready to use Nutch installation.
When the source distribution is used `${NUTCH_RUNTIME_HOME}` refers to `apache-nutch-1.X/runtime/local/`. Note that
+
* config files should be modified in `apache-nutch-1.X/runtime/local/conf/`
* `ant clean` will remove this directory (keep copies of modified config files)
@@ -63, +68 @@
{{{
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
}}}
-
On Debian or Ubuntu, you can run the following command or add it to ~/.bashrc:
+
{{{
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
}}}
@@ -98, +103 @@
This will include any URL in the domain `nutch.apache.org`.
=== 3.1 Using the Crawl Command ===
-
{{{#!wiki caution
The crawl command is deprecated. Please see section [[#A3.3._Using_the_crawl_script|3.3]] on how to use the crawl script that is intended to replace the crawl command.
}}}
-
Now we are ready to initiate a crawl, use the following parameters:
* '''-dir''' ''dir'' names the directory to put the crawl in.
@@ -192, +195 @@
{{{
bin/nutch fetch $s1
}}}
-
Then we parse the entries:
{{{
bin/nutch parse $s1
}}}
-
When this is complete, we update the database with the results of the fetch:
{{{
@@ -247, +248 @@
Usage: bin/nutch solrindex <solr url> <crawldb> [-linkdb <linkdb>][-params k1=v1&k2=v2...] (<segment> ...| -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
Example: bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize
}}}
-
==== Step-by-Step: Deleting Duplicates ====
Once indexed the entire contents, it must be disposed of duplicate urls in this way ensures that the urls are unique.
@@ -260, +260 @@
Usage: bin/nutch solrdedup <solr url>
Example: /bin/nutch solrdedup http://localhost:8983/solr
}}}
-
==== Step-by-Step: Cleaning Solr ====
The class scans a crawldb directory looking for entries with status DB_GONE (404) and sends delete requests to Solr for those documents. Once Solr receives the request the aforementioned documents are duly deleted. This maintains a healthier quality of Solr index.
@@ -268, +267 @@
Usage: bin/nutch solrclean <crawldb> <solrurl>
Example: /bin/nutch solrclean crawl/crawldb/ http://localhost:8983/solr
}}}
-
=== 3.3. Using the crawl script ===
-
If you have followed the 3.2 section above on how the crawling can be done step by step, you might be wondering how a bash script can be written to automate all the process described above.
- Nutch developers have written one for you :), and it is available at [[bin/crawl]].
+ Nutch developers have written one for you :), and it is available at [[bin/crawl]].
{{{
Usage: bin/crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
@@ -281, +278 @@
Or you can use:
Example: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
}}}
-
-
The crawl script has lot of parameters set, and you can modify the parameters to your needs. It would be ideal to understand the parameters before setting up big crawls.
-
== 4. Setup Solr for search ==
* download binary file from [[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]]
@@ -311, +305 @@
{{{
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
}}}
-
The call signature for running the solrindex has changed. The linkdb is now optional, so you need to denote it with a "-linkdb" flag on the command line.
This will send all crawl data to Solr for indexing. For more information please see [[bin/nutch solrindex]]