You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/13 14:00:02 UTC

[Nutch Wiki] Update of "RunningNutchAndSolr" by AlexMc

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "RunningNutchAndSolr" page has been changed by AlexMc.
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=33&rev2=34

--------------------------------------------------

  
  I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch.  I'm going to skip over doing command by command for right now.  I'm running/building on Ubuntu 7.10 using Java 1.6.0_05.  I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk.
  
+ 
  == Prerequisites ==
   * apt-get install sun-java6-jdk subversion ant patch unzip
+ 
+ == Ubuntu Note ==
+ 
+ If you are using more recent versions of Ubuntu Solr comes as a package installable through apt-get 
+ You might wish to install it that way instead of as follows. If so then you will find the solr config in /etc/solr/conf
  
  == Steps ==
  The first step to get started is to download the required software components, namely Apache Solr and Nutch.
@@ -62, +68 @@

  
  '''6.''' Start Solr
  
+ Assuming you have installed Solr as per instructions above. 
+ {{{
  cd apache-solr-1.3.0/example java -jar start.jar
+ }}}
+ 
+ 
  
  '''7. Configure Nutch'''
  
@@ -119, +130 @@

  bin/nutch generate crawl/crawldb crawl/segments
  }}}
  
- The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable:
+ The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable. 
  
+ {{{
  export SEGMENT=crawl/segments/&#96;ls -tr crawl/segments|tail -1&#96;
+ echo $SEGMENT
+ }}}
+ 
+ Note: This only works if you are using your local file system. If your crawl is on Hadoop DFS then you will need to figure out some other way of setting the SEGMENT environment variable - possibly using something like 
+ 
+ {{{
+ bin/hadoop fs  -ls crawl/segments
+ }}}
  
  Now I launch the fetcher that actually goes to get the content:
  
+ {{{
  bin/nutch fetch $SEGMENT -noParsing
+ }}}
  
  Next I parse the content:
  
+ {{{
  bin/nutch parse $SEGMENT
+ }}}
  
  Then I update the Nutch crawldb. The updatedb command wil store all new urls discovered during the fetch and parse of the previous segment into Nutch database so they can be fetched later. Nutch also stores information about the pages that were fetched so the same urls won’t be fetched again and again.
  
@@ -153, +177 @@

  
  http://127.0.0.1:8983/solr/nutch/?q=solr&amp;version=2.2&amp;start=0&amp;rows=10&amp;indent=on&amp;wt=json
  
+ === Comments ===
+ --------------------------------------
+ 
  HI, I to faced problems in integrating solr and nutch. After, some work out i found the below article and integrated successfully. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/