You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2017/11/09 12:56:18 UTC

[Nutch Wiki] Update of "NutchHadoopSingleNodeTutorial" by OmkarReddy

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchHadoopSingleNodeTutorial" page has been changed by OmkarReddy:
https://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial?action=diff&rev1=7&rev2=8

  
  '''1. Step: Download and install Hadoop in pseudo-distributed mode, as explained here:'''
  
-  [[http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html| Hadoop Single Node Setup]].
+  [[https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html| Hadoop Single Node Setup]].
  
  Here, it’s important to set up ''HADOOP_HOME'' to point to the root of the hadoop installation, 
  similar to ''JAVA_HOME'' it has to be set globally, so the hadoop start-up script can be called from anywhere. 
  
  (Check this by running: ' ''echo $HADOOP_HOME'' ' in the console, which should return the path to the root of your hadoop installation.)
  
- '''''N.B.''''' Make sure your hadoop installation is working correctly before trying to integrate Nutch!
+ '''''N.B.''''' Make sure your hadoop installation is working correctly by running the examples as mentioned in the link above before trying to integrate Nutch!
  
  E.g. try to connect to the jobtracker at: http://localhost:50030/. 
  
@@ -22, +22 @@

  
  '''2. Step: Download and install Nutch 1.x:'''
  
- Download a stable source version e.g. apache-nutch-1.8-src.zip from http://nutch.apache.org/downloads.html.
+ Download a stable source version e.g. apache-nutch-1.13-src.zip from http://nutch.apache.org/downloads.html.
  
- For installation of apache-nutch-1.8-src.zip:
+ For installation of apache-nutch-1.13-src.zip:
  
-  * Unzip and over the terminal cd into the freshly exracted folder ''apache-nutch-1.8''
+  * Unzip and over the terminal cd into the freshly exracted folder ''apache-nutch-1.13''
  
   * Run ‘ant runtime’ in this folder
  
  This command builds the runtime environment, where ''runtime/local'' stores all
  configuration files, libraries etc. but does not use the hadoop version, which has been set up here (pseudo-distributed mode), but the local (standalone) non-distributed version, that is often used for debugging and described in more detail here: 
- [[http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html#Local| Hadoop Standalone Setup]].
+ [[https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation| Hadoop Standalone Setup]].
  
  
  However, the nutch-job jar used for hadoop in pseudo-distributed mode lives in 
  ''runtime/deploy/''. 
  As a consequence, any modification to the configuration files in ''$NUTCH/conf'' (the configuration directory at the root) require
- a re-build with ‘ant’ to make sure the changes become part of the nutch-job jar as well.   
+ a re-build with ‘ant’ to make sure the changes become part of the nutch-job jar as well.
+ 
+ '''''N.B.''''' Make sure that the property mapreduce.framework.name in etc/hadoop/mapred-site.xml is set as mentioned in the hadoop documentation above.    
  
  See: NutchTutorial on how to set up a specific configuration and run a crawl.