You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2013/02/07 03:49:59 UTC
[Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "FAQ" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=135&rev2=136
Comment:
a
Urls which are already in the database, won't be injected.
=== Fetching ===
+
+ ==== Can I parse during the fetching process? ====
+ In short yes, however this is disabled by default (justification follows shortly). To enable this simply configure the following in nutch-site.xml before initiating the fecth process.
+ {{{
+ <property>
+ <name>fetcher.parse</name>
+ <value>true</value>
+ <description>If true, fetcher will parse content. Default is false, which means
+ that a separate parsing step is required after fetching is finished.</description>
+ </property>
+ }}}
+
+ '''N.B.''' In a parsing fetcher, outlinks are processed in the mapper (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to [[http://www.mail-archive.com/user@nutch.apache.org/msg05031.html|this]] is usually observed in this situation.
+
+ In summary, if it is possible, users are advised '''not''' to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.
+
==== Is it possible to fetch only pages from some specific domains? ====
Please have a look on PrefixURLFilter. Adding some regular expressions to the regex-urlfilter.txt file might work, but adding a list with thousands of regular expressions would slow down your system excessively.