You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Matt MacDonald (JIRA)" <ji...@apache.org> on 2012/09/09 13:37:07 UTC
[jira] [Created] (NUTCH-1468) Redirects that are external links not
adhering to db.ignore.external.links
Matt MacDonald created NUTCH-1468:
-------------------------------------
Summary: Redirects that are external links not adhering to db.ignore.external.links
Key: NUTCH-1468
URL: https://issues.apache.org/jira/browse/NUTCH-1468
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 2.1
Reporter: Matt MacDonald
Attachments: redirects-to-external.patch
Patch attached for this.
Hi,
Likely this is a question for Ferdy but if anyone else has input
that'd be great. When running a crawl that I would expect to be
contained to a single domain I'm seeing the crawler jump out to other
domains. I'm using the trunk of Nutch 2.x which includes the following
commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
The goal is to perform a focused crawl against a single domain and
restrict the crawler from expanding beyond that domain. I've set the
db.ignore.external.links property to true. I do not want to add a
regex to regex-urlfilter.txt as I will be adding several thousand
urls. The domain that I am crawling has documents with outlinks that
are still within the domain but then redirect to external domains.
cat urls/seed.txt
http://www.ci.watertown.ma.us/
cat conf/nutch-site.xml
...
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent
problems with the
underlying commons-httpclient library.
</description>
</property>
...
Running
bin/nutch crawl urls -depth 8 -topN 100000
results in the the crawl eventually fetching and parsing documents on
domains external to the only link in the seed.txt file.
I would not expect to see urls like the following in my logs and in
the HBase webpage table:
fetching http://www.masshome.com/tourism.html
Parsing http://www.disabilityinfo.org/
I'm reviewing the code changes but am still getting up to speed on the
code base. Any ideas while I continue to dig around? Configuration
issue or code?
Thanks,
Matt
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1468) Redirects that are external links
not adhering to db.ignore.external.links
Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451580#comment-13451580 ]
Lewis John McGibbney commented on NUTCH-1468:
---------------------------------------------
Excellent, thank you Matt. Please give us a bit of time to look into this. In the meantime it would be great if we could also get a trivial JUnit test case for this.
> Redirects that are external links not adhering to db.ignore.external.links
> --------------------------------------------------------------------------
>
> Key: NUTCH-1468
> URL: https://issues.apache.org/jira/browse/NUTCH-1468
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 2.1
> Reporter: Matt MacDonald
> Attachments: redirects-to-external.patch
>
>
> Patch attached for this.
> Hi,
> Likely this is a question for Ferdy but if anyone else has input
> that'd be great. When running a crawl that I would expect to be
> contained to a single domain I'm seeing the crawler jump out to other
> domains. I'm using the trunk of Nutch 2.x which includes the following
> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
> The goal is to perform a focused crawl against a single domain and
> restrict the crawler from expanding beyond that domain. I've set the
> db.ignore.external.links property to true. I do not want to add a
> regex to regex-urlfilter.txt as I will be adding several thousand
> urls. The domain that I am crawling has documents with outlinks that
> are still within the domain but then redirect to external domains.
> cat urls/seed.txt
> http://www.ci.watertown.ma.us/
> cat conf/nutch-site.xml
> ...
> <property>
> <name>db.ignore.external.links</name>
> <value>true</value>
> <description>If true, outlinks leading from a page to external hosts
> will be ignored. This is an effective way to limit the crawl to include
> only initially injected hosts, without creating complex URLFilters.
> </description>
> </property>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin. By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins. In order to use HTTPS please enable
> protocol-httpclient, but be aware of possible intermittent
> problems with the
> underlying commons-httpclient library.
> </description>
> </property>
> ...
> Running
> bin/nutch crawl urls -depth 8 -topN 100000
> results in the the crawl eventually fetching and parsing documents on
> domains external to the only link in the seed.txt file.
> I would not expect to see urls like the following in my logs and in
> the HBase webpage table:
> fetching http://www.masshome.com/tourism.html
> Parsing http://www.disabilityinfo.org/
> I'm reviewing the code changes but am still getting up to speed on the
> code base. Any ideas while I continue to dig around? Configuration
> issue or code?
> Thanks,
> Matt
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1468) Redirects that are external links
not adhering to db.ignore.external.links
Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451897#comment-13451897 ]
Ferdy Galema commented on NUTCH-1468:
-------------------------------------
A nice catch indeed. Looks fine.
I'm +1 for committing this as is. It might be a bit difficult to test this functionality in a unit test, however if you do come up with a test that would be perfect. Otherwise I'll just commit this in a couple of days/weeks.
> Redirects that are external links not adhering to db.ignore.external.links
> --------------------------------------------------------------------------
>
> Key: NUTCH-1468
> URL: https://issues.apache.org/jira/browse/NUTCH-1468
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 2.1
> Reporter: Matt MacDonald
> Attachments: redirects-to-external.patch
>
>
> Patch attached for this.
> Hi,
> Likely this is a question for Ferdy but if anyone else has input
> that'd be great. When running a crawl that I would expect to be
> contained to a single domain I'm seeing the crawler jump out to other
> domains. I'm using the trunk of Nutch 2.x which includes the following
> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
> The goal is to perform a focused crawl against a single domain and
> restrict the crawler from expanding beyond that domain. I've set the
> db.ignore.external.links property to true. I do not want to add a
> regex to regex-urlfilter.txt as I will be adding several thousand
> urls. The domain that I am crawling has documents with outlinks that
> are still within the domain but then redirect to external domains.
> cat urls/seed.txt
> http://www.ci.watertown.ma.us/
> cat conf/nutch-site.xml
> ...
> <property>
> <name>db.ignore.external.links</name>
> <value>true</value>
> <description>If true, outlinks leading from a page to external hosts
> will be ignored. This is an effective way to limit the crawl to include
> only initially injected hosts, without creating complex URLFilters.
> </description>
> </property>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin. By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins. In order to use HTTPS please enable
> protocol-httpclient, but be aware of possible intermittent
> problems with the
> underlying commons-httpclient library.
> </description>
> </property>
> ...
> Running
> bin/nutch crawl urls -depth 8 -topN 100000
> results in the the crawl eventually fetching and parsing documents on
> domains external to the only link in the seed.txt file.
> I would not expect to see urls like the following in my logs and in
> the HBase webpage table:
> fetching http://www.masshome.com/tourism.html
> Parsing http://www.disabilityinfo.org/
> I'm reviewing the code changes but am still getting up to speed on the
> code base. Any ideas while I continue to dig around? Configuration
> issue or code?
> Thanks,
> Matt
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1468) Redirects that are external links
not adhering to db.ignore.external.links
Posted by "Matt MacDonald (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451583#comment-13451583 ]
Matt MacDonald commented on NUTCH-1468:
---------------------------------------
I'll see what I can do on the JUnit side. It's been about 7 years since I've been looking at and writing Java so pulling together a JUnit test might take me a while. Likely not this weekend.
Thanks,
Matt
> Redirects that are external links not adhering to db.ignore.external.links
> --------------------------------------------------------------------------
>
> Key: NUTCH-1468
> URL: https://issues.apache.org/jira/browse/NUTCH-1468
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 2.1
> Reporter: Matt MacDonald
> Attachments: redirects-to-external.patch
>
>
> Patch attached for this.
> Hi,
> Likely this is a question for Ferdy but if anyone else has input
> that'd be great. When running a crawl that I would expect to be
> contained to a single domain I'm seeing the crawler jump out to other
> domains. I'm using the trunk of Nutch 2.x which includes the following
> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
> The goal is to perform a focused crawl against a single domain and
> restrict the crawler from expanding beyond that domain. I've set the
> db.ignore.external.links property to true. I do not want to add a
> regex to regex-urlfilter.txt as I will be adding several thousand
> urls. The domain that I am crawling has documents with outlinks that
> are still within the domain but then redirect to external domains.
> cat urls/seed.txt
> http://www.ci.watertown.ma.us/
> cat conf/nutch-site.xml
> ...
> <property>
> <name>db.ignore.external.links</name>
> <value>true</value>
> <description>If true, outlinks leading from a page to external hosts
> will be ignored. This is an effective way to limit the crawl to include
> only initially injected hosts, without creating complex URLFilters.
> </description>
> </property>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin. By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins. In order to use HTTPS please enable
> protocol-httpclient, but be aware of possible intermittent
> problems with the
> underlying commons-httpclient library.
> </description>
> </property>
> ...
> Running
> bin/nutch crawl urls -depth 8 -topN 100000
> results in the the crawl eventually fetching and parsing documents on
> domains external to the only link in the seed.txt file.
> I would not expect to see urls like the following in my logs and in
> the HBase webpage table:
> fetching http://www.masshome.com/tourism.html
> Parsing http://www.disabilityinfo.org/
> I'm reviewing the code changes but am still getting up to speed on the
> code base. Any ideas while I continue to dig around? Configuration
> issue or code?
> Thanks,
> Matt
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1468) Redirects that are external links not
adhering to db.ignore.external.links
Posted by "Matt MacDonald (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matt MacDonald updated NUTCH-1468:
----------------------------------
Attachment: redirects-to-external.patch
> Redirects that are external links not adhering to db.ignore.external.links
> --------------------------------------------------------------------------
>
> Key: NUTCH-1468
> URL: https://issues.apache.org/jira/browse/NUTCH-1468
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 2.1
> Reporter: Matt MacDonald
> Attachments: redirects-to-external.patch
>
>
> Patch attached for this.
> Hi,
> Likely this is a question for Ferdy but if anyone else has input
> that'd be great. When running a crawl that I would expect to be
> contained to a single domain I'm seeing the crawler jump out to other
> domains. I'm using the trunk of Nutch 2.x which includes the following
> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
> The goal is to perform a focused crawl against a single domain and
> restrict the crawler from expanding beyond that domain. I've set the
> db.ignore.external.links property to true. I do not want to add a
> regex to regex-urlfilter.txt as I will be adding several thousand
> urls. The domain that I am crawling has documents with outlinks that
> are still within the domain but then redirect to external domains.
> cat urls/seed.txt
> http://www.ci.watertown.ma.us/
> cat conf/nutch-site.xml
> ...
> <property>
> <name>db.ignore.external.links</name>
> <value>true</value>
> <description>If true, outlinks leading from a page to external hosts
> will be ignored. This is an effective way to limit the crawl to include
> only initially injected hosts, without creating complex URLFilters.
> </description>
> </property>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin. By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins. In order to use HTTPS please enable
> protocol-httpclient, but be aware of possible intermittent
> problems with the
> underlying commons-httpclient library.
> </description>
> </property>
> ...
> Running
> bin/nutch crawl urls -depth 8 -topN 100000
> results in the the crawl eventually fetching and parsing documents on
> domains external to the only link in the seed.txt file.
> I would not expect to see urls like the following in my logs and in
> the HBase webpage table:
> fetching http://www.masshome.com/tourism.html
> Parsing http://www.disabilityinfo.org/
> I'm reviewing the code changes but am still getting up to speed on the
> code base. Any ideas while I continue to dig around? Configuration
> issue or code?
> Thanks,
> Matt
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1468) Redirects that are external links
not adhering to db.ignore.external.links
Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ferdy Galema resolved NUTCH-1468.
---------------------------------
Resolution: Fixed
Fix Version/s: 2.1
Committed @ Nutch2.x ref 1386526
Thanks for the patch.
> Redirects that are external links not adhering to db.ignore.external.links
> --------------------------------------------------------------------------
>
> Key: NUTCH-1468
> URL: https://issues.apache.org/jira/browse/NUTCH-1468
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 2.1
> Reporter: Matt MacDonald
> Fix For: 2.1
>
> Attachments: redirects-to-external.patch
>
>
> Patch attached for this.
> Hi,
> Likely this is a question for Ferdy but if anyone else has input
> that'd be great. When running a crawl that I would expect to be
> contained to a single domain I'm seeing the crawler jump out to other
> domains. I'm using the trunk of Nutch 2.x which includes the following
> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
> The goal is to perform a focused crawl against a single domain and
> restrict the crawler from expanding beyond that domain. I've set the
> db.ignore.external.links property to true. I do not want to add a
> regex to regex-urlfilter.txt as I will be adding several thousand
> urls. The domain that I am crawling has documents with outlinks that
> are still within the domain but then redirect to external domains.
> cat urls/seed.txt
> http://www.ci.watertown.ma.us/
> cat conf/nutch-site.xml
> ...
> <property>
> <name>db.ignore.external.links</name>
> <value>true</value>
> <description>If true, outlinks leading from a page to external hosts
> will be ignored. This is an effective way to limit the crawl to include
> only initially injected hosts, without creating complex URLFilters.
> </description>
> </property>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin. By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins. In order to use HTTPS please enable
> protocol-httpclient, but be aware of possible intermittent
> problems with the
> underlying commons-httpclient library.
> </description>
> </property>
> ...
> Running
> bin/nutch crawl urls -depth 8 -topN 100000
> results in the the crawl eventually fetching and parsing documents on
> domains external to the only link in the seed.txt file.
> I would not expect to see urls like the following in my logs and in
> the HBase webpage table:
> fetching http://www.masshome.com/tourism.html
> Parsing http://www.disabilityinfo.org/
> I'm reviewing the code changes but am still getting up to speed on the
> code base. Any ideas while I continue to dig around? Configuration
> issue or code?
> Thanks,
> Matt
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira