You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2015/09/29 17:28:08 UTC

[Nutch Wiki] Trivial Update of "WhiteListRobots" by ayeshahasan

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "WhiteListRobots" page has been changed by ayeshahasan:
https://wiki.apache.org/nutch/WhiteListRobots?action=diff&rev1=5&rev2=6

Comment:
Changed the versions of jar in the classpath

  
  {{{
  <property>
-   <name>robot.rules.whitelist</name>
+   <name>http.robot.rules.whitelist</name>
    <value></value>
    <description>Comma separated list of hostnames or IP addresses to ignore robot rules parsing for.
    </description>
@@ -21, +21 @@

  
  {{{
  <property>
-   <name>robot.rules.whitelist</name>
+   <name>http.robot.rules.whitelist</name>
    <value>baron.pagemewhen.com</value>
    <description>Comma separated list of hostnames or IP addresses to ignore robot rules parsing for.
    </description>
@@ -50, +50 @@

  From your nutch SVN or git checkout top-level directory, run this command:
  
  {{{
- java -cp build/apache-nutch-1.10-SNAPSHOT.job:build/apache-nutch-1.10-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.5.jar:runtime/local/lib/slf4j-log4j12-1.6.1.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/guava-11.0.2.jar:runtime/local/lib/commons-logging-1.2.jar:runtime/local/lib/commons-cli-1.2.jar org.apache.nutch.protocol.RobotRulesParser robots.txt urls Nutch-crawler
+ java -cp build/apache-nutch-1.11-SNAPSHOT.job:build/apache-nutch-1.11-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.6.jar:runtime/local/lib/slf4j-log4j12-1.7.5.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.17.jar:runtime/local/lib/guava-16.0.1.jar:runtime/local/lib/commons-logging-1.1.3.jar:runtime/local/lib/commons-cli-1.2.jar org.apache.nutch.protocol.RobotRulesParser robots.txt url Nutch-crawler
  }}}
  
  You should see the following output: