You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2012/08/23 21:24:32 UTC

[Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "FAQ" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=134&rev2=135

  ==== How can I fetch pages that require Authentication? ====
  See the [[HttpAuthenticationSchemes]] wiki page.
  
+ ==== Speed of Fetching seems to decrease between crawl iterations... what's wrong? ====
+ 
+ A possible reason is that by default the 'partition.url.mode' is set to 'byHost', which is a reasonable setting, because in the url-subsets for the fetcher threads in different map steps, you want to have disjoint subsets to avoid that urls are loaded twice from
+ different machines.
+ 
+ Secondly the default setting for 'generate.max.count' could also be set to -1. This means the more urls you collect, especially from the same host, the more urls of the same host will be in the same fetcher map job!
+ 
+ Because there is also a policy setting (please do this at home!!) to wait for a delay of 30 secs. between calls to the same server, all maps which contains urls to the same server are slowing down. Therefore the resulting reduce step will only be done when all fetcher maps are done, which is a bottleneck in the overall processing step.
+ 
+ The following settings may solve your problem:
+ 
+ Map tasks should be splitted according to the host:
+ {{{
+ <property>
+   <name>partition.url.mode</name>
+   <value>byHost</value>
+   <description>Determines how to partition URLs. Default value is
+ 'byHost',  also takes 'byDomain' or 'byIP'.
+   </description>
+ </property>
+ }}}
+ 
+ Don't insert in a single fetch list more than 10000 entries!
+ {{{
+ <property>
+   <name>generate.max.count</name>
+   <value>10000</value>
+   <description>The maximum number of urls in a single
+   fetchlist.  -1 if unlimited. The urls are counted according
+   to the value of the parameter generator.count.mode.
+   </description>
+ </property>
+ }}}
+ 
+ Wait time between two fetches to the same server.
+ {{{
+ <property>
+  <name>fetcher.max.crawl.delay</name>
+  <value>10</value>
+  <description>
+  If the Crawl-Delay in robots.txt is set to greater than this value (in
+  seconds) then the fetcher will skip this page, generating an error report.
+  If set to -1 the fetcher will never skip such pages and will wait the
+  amount of time retrieved from robots.txt Crawl-Delay, however long that
+  might be.
+  </description>
+ </property>
+ }}}
  === Updating ===
  ==== Isn't there redudant/wasteful duplication between nutch crawldb and solr index? ====
  Nutch maintains a crawldb (and linkdb, for that matter) of the urls it crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the a re-crawling period. At the same time Solr maintains an inverted index of all the fetched pages. It'd seem more efficient if Nutch relied on the index instead of maintaining its own crawldb, to !store the same url twice? The problem we face here is what Nutch would do if we wished to change the Solr core which to index to?