You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/05/25 00:30:57 UTC

[Nutch Wiki] Trivial Update of "RunningNutchAndSolr" by SeanOConnor

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "RunningNutchAndSolr" page has been changed by SeanOConnor.
The comment on this change is: minor formatting changes to address run-on commands .
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=31&rev2=32

--------------------------------------------------

  
  <str name="qf">
  
- content^0.5 anchor^1.0 title^1.2 </str>
+ content&#94;0.5 anchor&#94;1.0 title&#94;1.2 </str>
  
- <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str>
+ <str name="pf"> content&#94;0.5 anchor&#94;1.5 title&#94;1.2 site&#94;1.5 </str>
  
  <str name="fl"> url </str>
  
@@ -116, +116 @@

  
  -^(https|telnet|file|ftp|mailto):
  
- # skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+ # skip some suffixes 
  
- # skip URLs containing certain characters as probable queries, etc. -[?*!@=]
+ -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
  
- # allow urls in foofactory.fi domain +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
+ # skip URLs containing certain characters as probable queries, etc. 
  
+ -[?*!@=]
+ 
+ # allow urls in foofactory.fi domain (or lucidimagination.com...)
+ 
+ +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/
+ 
- # deny anything else -.
+ # deny anything else 
+ 
+ -.
  
  '''8.''' Create a seed list (the initial urls to fetch)
  
+ mkdir urls 
- mkdir urls echo "http://www.lucidimagination.com/" > urls/seed.txt
+ echo "http://www.lucidimagination.com/" > urls/seed.txt
  
  '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory)