You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2013/02/07 03:49:59 UTC

[Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "FAQ" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=135&rev2=136

Comment:
a

  Urls which are already in the database, won't be injected.
  
  === Fetching ===
+ 
+ ==== Can I parse during the fetching process? ====
+ In short yes, however this is disabled by default (justification follows shortly). To enable this simply configure the following in nutch-site.xml before initiating the fecth process.
+ {{{
+ <property>
+   <name>fetcher.parse</name>
+   <value>true</value>
+   <description>If true, fetcher will parse content. Default is false, which means
+   that a separate parsing step is required after fetching is finished.</description>
+ </property>
+ }}} 
+ 
+ '''N.B.''' In a parsing fetcher, outlinks are processed in the mapper (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to [[http://www.mail-archive.com/user@nutch.apache.org/msg05031.html|this]] is usually observed in this situation. 
+ 
+ In summary, if it is possible, users are advised '''not''' to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.
+  
  ==== Is it possible to fetch only pages from some specific domains? ====
  Please have a look on PrefixURLFilter. Adding some regular expressions to the regex-urlfilter.txt file might work, but adding a list with thousands of regular expressions would slow down your system excessively.