You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/13 12:41:41 UTC

[Nutch Wiki] Trivial Update of "RunningNutchAndSolr" by AlexMc

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "RunningNutchAndSolr" page has been changed by AlexMc.
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=32&rev2=33

--------------------------------------------------

  
  '''d.''' Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it
  
+ {{{
  <requestHandler name="/nutch" class="solr.SearchHandler" >
- 
  <lst name="defaults">
- 
  <str name="defType">dismax</str>
- 
  <str name="echoParams">explicit</str>
- 
  <float name="tie">0.01</float>
- 
  <str name="qf">
- 
  content&#94;0.5 anchor&#94;1.0 title&#94;1.2 </str>
- 
  <str name="pf"> content&#94;0.5 anchor&#94;1.5 title&#94;1.2 site&#94;1.5 </str>
- 
  <str name="fl"> url </str>
- 
  <str name="mm"> 2<-1 5<-2 6<90% </str>
- 
  <int name="ps">100</int>
- 
  <bool hl="true"/>
- 
  <str name="q.alt">*:*</str>
- 
  <str name="hl.fl">title url content</str>
- 
  <str name="f.title.hl.fragsize">0</str>
- 
  <str name="f.title.hl.alternateField">title</str>
- 
  <str name="f.url.hl.fragsize">0</str>
- 
  <str name="f.url.hl.alternateField">url</str>
- 
  <str name="f.content.hl.fragmenter">regex</str>
- 
  </lst>
- 
  </requestHandler>
+ }}}
  
  '''6.''' Start Solr
  
@@ -86, +68 @@

  
  a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) :
  
+ {{{
  <?xml version="1.0"?> <configuration>
+ <property>
+ <name>http.agent.name</name>
+ <value>nutch-solr-integration</value>
+ </property>
+ <property> <name>generate.max.per.host</name>
+ <value>100</value>
+ </property>
+ <property>
+ <name>plugin.includes</name>
+ <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
+ </property>
+ </configuration>
+ }}}
  
- <property>
  
- <name>http.agent.name</name>
+ '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace it’s content with something similar to the following:
  
- <value>nutch-solr-integration</value>
+ {{{
+ -^(https|telnet|file|ftp|mailto):
+ # skip some suffixes 
+ -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+ # skip URLs containing certain characters as probable queries, etc. 
+ -[?*!@=]
+ # allow urls in foofactory.fi domain (or lucidimagination.com...)
+ +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
+ # deny anything else 
+ -.
+ }}}
  
- </property>
- 
- <property> <name>generate.max.per.host</name>
- 
- <value>100</value>
- 
- </property>
- 
- <property>
- 
- <name>plugin.includes</name>
- 
- <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
- 
- </property>
- 
- </configuration>
- 
- '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace it’s content with following:
- 
- -^(https|telnet|file|ftp|mailto):
- 
- # skip some suffixes 
- 
- -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
- 
- # skip URLs containing certain characters as probable queries, etc. 
- 
- -[?*!@=]
- 
- # allow urls in foofactory.fi domain (or lucidimagination.com...)
- 
- +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
- 
- # deny anything else 
- 
- -.
  
  '''8.''' Create a seed list (the initial urls to fetch)
  
+ {{{
  mkdir urls 
  echo "http://www.lucidimagination.com/" > urls/seed.txt
+ }}}
  
  '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory)
  
+ {{{
  bin/nutch inject crawl/crawldb urls
+ }}}
  
  '''10.''' Generate fetch list, fetch and parse content
  
+ {{{
  bin/nutch generate crawl/crawldb crawl/segments
+ }}}
  
  The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable: