You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Joe Hansome <jo...@demandjump.com> on 2016/03/30 17:08:11 UTC

Question regarding fetcher.follow.outlinks.ignore.external

Hi,

I'm attempting a crawl of several sites in which I would like the crawler
to stay internal to each site, but store all outlinks, internal and
external.  I'm currently using version 1.11 rc2 with the following property
values set in my nutch-site.xml:

  <property>
        <name>db.ignore.internal.links</name>
        <value>false</value>
    </property>

    <property>
        <name>db.ignore.external.links</name>
        <value>false</value>
    </property>

    <property>
        <name>fetcher.follow.outlinks.ignore.external</name>
        <value>true</value>
    </property>

The problem is that I'm seeing that external pages (according to either
byHost or byDomain modes) are being fetched.  After taking a look at the
nutch source, it appears the property
fetcher.follow.outlinks.ignore.external is only accessed from
FetcherThread.java.  I'm doing fetching and parsing in separate steps,
keeping the default fetcher.parse value of false.

Is the property fetcher.follow.outlinks.ignore.external disregarded in this
case?  Is it only effective when combining fetching/parsing steps?

Thanks in advance for your help.

Joe