You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2014/09/05 07:58:26 UTC

[jira] [Commented] (NUTCH-1468) Redirects that are external links not adhering to db.ignore.external.links

    [ https://issues.apache.org/jira/browse/NUTCH-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122484#comment-14122484 ] 

Hudson commented on NUTCH-1468:
-------------------------------

SUCCESS: Integrated in Nutch-trunk #2768 (See [https://builds.apache.org/job/Nutch-trunk/2768/])
Whitespace change to tickle Github PR closer for NUTCH-1468: this closes #1. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1622623)
* /nutch/trunk/CHANGES.txt


> Redirects that are external links not adhering to db.ignore.external.links
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1468
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1468
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 2.1
>            Reporter: Matt MacDonald
>             Fix For: 2.1
>
>         Attachments: redirects-to-external.patch
>
>
> Patch attached for this.
> Hi,
> Likely this is a question for Ferdy but if anyone else has input
> that'd be great. When running a crawl that I would expect to be
> contained to a single domain I'm seeing the crawler jump out to other
> domains. I'm using the trunk of Nutch 2.x which includes the following
> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
> The goal is to perform a focused crawl against a single domain and
> restrict the crawler from expanding beyond that domain. I've set the
> db.ignore.external.links property to true. I do not want to add a
> regex to regex-urlfilter.txt as I will be adding several thousand
> urls. The domain that I am crawling has documents with outlinks that
> are still within the domain but then redirect to external domains.
> cat urls/seed.txt
> http://www.ci.watertown.ma.us/
> cat conf/nutch-site.xml
> ...
>   <property>
>     <name>db.ignore.external.links</name>
>     <value>true</value>
>     <description>If true, outlinks leading from a page to external hosts
>     will be ignored. This is an effective way to limit the crawl to include
>     only initially injected hosts, without creating complex URLFilters.
>     </description>
>   </property>
>   <property>
>     <name>plugin.includes</name>
>    <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>    <description>Regular expression naming plugin directory names to
>     include.  Any plugin not matching this expression is excluded.
>     In any case you need at least include the nutch-extensionpoints plugin. By
>     default Nutch includes crawling just HTML and plain text via HTTP,
>     and basic indexing and search plugins. In order to use HTTPS please enable
>     protocol-httpclient, but be aware of possible intermittent
> problems with the
>     underlying commons-httpclient library.
>     </description>
>   </property>
> ...
> Running
> bin/nutch crawl urls -depth 8 -topN 100000
> results in the the crawl eventually fetching and parsing documents on
> domains external to the only link in the seed.txt file.
> I would not expect to see urls like the following in my logs and in
> the HBase webpage table:
> fetching http://www.masshome.com/tourism.html
> Parsing http://www.disabilityinfo.org/
> I'm reviewing the code changes but am still getting up to speed on the
> code base. Any ideas while I continue to dig around? Configuration
> issue or code?
> Thanks,
> Matt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)