You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2008/03/27 21:31:28 UTC

[Nutch Wiki] Update of "RunningNutchAndSolr" by NickTkach

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by NickTkach:
http://wiki.apache.org/nutch/RunningNutchAndSolr

New page:
This is just a quick first pass at a guide for getting Nutch running with Solr.  I'm sure there are better ways of doing some/all of it, but I'm not aware of them.  By all means, please do correct/update this if someone has a better idea.  Many thanks to [http://variogram.com|Brian Whitman at Variogr.am] and [http://blog.foofactory.fi|Sam Siren at FooFactory] for all the help!  You guys saved me a lot of time! :)

I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch.  I'm going to skip over doing command by command for right now.  I'm running/building on Ubuntu 7.10 using Java 1.6.0_05.  I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk.

 1. Check out solr-trunk and nutch-trunk
 1. Go into the solr-trunk and run 'ant dist dist-solrj'
 1. Get zip from [http://variogram.com/latest/SolrIndexer.zip|Variogr.am] and unzip it to solr-trunk
 1. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar to nutch-trunk/lib
 1. Get the zip file from [http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html|FooFactory] for SOLR-20
 1. Unzip solr-client.zip somewhere, go into java/solr/src and run 'ant'
 1. Copy solr-client.jar from dist to nutch-trunk/lib
 1. Copy xpp3-1.1.3.4.0.jar from lib to nutch-trunk/lib
 1. Get SolrClientAdapter.java from [http://www.foofactory.fi/files/nutch-solr/nutch_solr.patch|FooFactory patch] and copy it to nutch-trunk/src/java/org/apache/nutch/indexer
   * Edit nutch-trunk/src/java/org/apache/nutch/indexer/SolrIndexer.java:
   * Replace int res = new SolrIndexer().doMain(NutchConfiguration.create(), args); with int res = ToolRunner.run(NutchConfiguration.create(), new SolrIndexer(), args);
 1. Edit the imports to pick up ToolRunner
 1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java changing scope on LuceneDocumentWrapper from private to protected
 1. Configure nutch-trunk/conf/nutch-site.xml with settings for your site including a value for property indexer.solr.url (something like http://localhost:8983/solr/)
 1. Configure some url(s) to crawl (files in a urls directory)
 1. Copy [http://www.foofactory.fi/files/nutch-solr/crawl.sh|Crawl.sh script] from FooFactory and copy it to nutch-trunk/bin (editing if needed)
 1. Start a Solr server (such as the solr-trunk/example instance)
 1. Run a Nutch crawl using the bin/crawl.sh script.

If you watch the output from your Solr instance (logs) you should see a bunch of messages scroll by when Nutch finishes crawling and posts new documents.  If not, then you've got something not configured right.  I'll try to add more notes here as people have questions/issues.