You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2011/07/18 03:17:52 UTC

[Nutch Wiki] Update of "RunningNutchAndSolr" by EricPugh

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "RunningNutchAndSolr" page has been changed by EricPugh:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=64&rev2=65

Comment:
minor edits for clarity.

  
  This tutorial was originally constructed and posted by 'waycool' on the user lists. It has been edited slightly for integration into the Apache Nutch project.
  
- Apache Nutch is an open source web crawler written in Java. By using it, we can find out the hyperlinks in automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for future search. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward as explained below.
+ Apache Nutch is an open source web crawler written in Java. By using it, we can find web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward as explained below.
  
- Apache Nutch release 1.3 has Solr integration embedded, this greatly eases Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a 1.3 release from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. NOTE: You can download release 1.3 in either binary or source format, both of which are covered in this tutorial.
+ Apache Nutch release 1.3 has Solr integration embedded, greatly simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a 1.3 release from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. NOTE: You can download release 1.3 in either binary or source format, both of which are covered in this tutorial.
  
  == Table of Contents ==
  <<TableOfContents(3)>>
@@ -60, +60 @@

  </property>
  }}}
   * mkdir -p urls
-  * create a file nutch under /urls with the following content (or any site you want Nutch to crawl).
+  * create a file nutch under /urls with the following content (1 url per line for each site you want Nutch to crawl).
  {{{
  http://nutch.apache.org/
  }}}
@@ -68, +68 @@

  {{{
  bin/nutch crawl urls -dir crawl -depth 3 -topN 5
  }}}
-  * Now you should be able to see the following directories exist:
+  * Now you should be able to see the following directories created:
  {{{
  crawl/crawldb 
  Crawl/linkdb
@@ -102, +102 @@

  
  == 6. Integrate Solr with Nutch ==
  
- We have both Nutch and Solr installed and setup correctly. And Nutch already created crawl data from the seed url(s). Below are the steps to delagte searching to Solr for links to be searchable:
+ We have both Nutch and Solr installed and setup correctly. And Nutch already created crawl data from the seed url(s). Below are the steps to delegate searching to Solr for links to be searchable:
   * cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/ 
   * restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example 
   * run the Solr Index command:
@@ -111, +111 @@

  }}}
  This will send all crawl data to Solr for indexing. For more information please see bin/nutch solrindex
   
- If all has gone to plan, we are now ready to search with http://localhost:8983/solr/admin/. 
+ If all has gone to plan, we are now ready to search with http://localhost:8983/solr/admin/.  If you want to see the HTML indexed by Solr in the raw form, them then in solrconfig.xml change the field content to stored:
+ {{{
+ <field name="content" type="text" stored="true" indexed="true"/>
+ }}}