You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matt MacDonald <ma...@nearbyfyi.com> on 2012/09/08 19:44:35 UTC
Nutch 2.x trunk, focused domain crawl that contains links with HTTP
redirects pointing to external domains
Hi,
Likely this is a question for Ferdy but if anyone else has input
that'd be great. When running a crawl that I would expect to be
contained to a single domain I'm seeing the crawler jump out to other
domains. I'm using the trunk of Nutch 2.x which includes the following
commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
The goal is to perform a focused crawl against a single domain and
restrict the crawler from expanding beyond that domain. I've set the
db.ignore.external.links property to true. I do not want to add a
regex to regex-urlfilter.txt as I will be adding several thousand
urls. The domain that I am crawling has documents with outlinks that
are still within the domain but then redirect to external domains.
cat urls/seed.txt
http://www.ci.watertown.ma.us/
cat conf/nutch-site.xml
...
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent
problems with the
underlying commons-httpclient library.
</description>
</property>
...
Running
bin/nutch crawl urls -depth 8 -topN 100000
results in the the crawl eventually fetching and parsing documents on
domains external to the only link in the seed.txt file.
I would not expect to see urls like the following in my logs and in
the HBase webpage table:
fetching http://www.masshome.com/tourism.html
Parsing http://www.disabilityinfo.org/
I'm reviewing the code changes but am still getting up to speed on the
code base. Any ideas while I continue to dig around? Configuration
issue or code?
Thanks,
Matt
Re: Nutch 2.x trunk, focused domain crawl that contains links with
HTTP redirects pointing to external domains
Posted by Matt MacDonald <ma...@nearbyfyi.com>.
Hi Lewis,
Wonderful. I didn't see the issue in JIRA so I've created a new ticket
here: https://issues.apache.org/jira/browse/NUTCH-1468 with a patch
attached. Sorry for the pull request I was just in github and it was
easier to send the pull request than file a ticket.
Thanks,
Matt
On Sun, Sep 9, 2012 at 7:18 AM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi Matt,
>
> I don't know if you got my message on the github mirrior issue.
> If you could get the patch uploaded to a new Nutch Jira ticket (unless
> one is already open) then I will be very happy to test as some free
> time now means I am able to test a few patches.
>
> Thanks
>
> Lewis
>
> On Sat, Sep 8, 2012 at 8:36 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
>> Hi,
>>
>> Updated my local with the following
>> https://github.com/apache/nutch/pull/1/files and the crawler is
>> adhering to the db.ignore.external.links property when encountering
>> local redirects that point to external domains.
>>
>> Thanks,
>> Matt
>>
>> On Sat, Sep 8, 2012 at 1:44 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
>>> Hi,
>>>
>>> Likely this is a question for Ferdy but if anyone else has input
>>> that'd be great. When running a crawl that I would expect to be
>>> contained to a single domain I'm seeing the crawler jump out to other
>>> domains. I'm using the trunk of Nutch 2.x which includes the following
>>> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
>>>
>>> The goal is to perform a focused crawl against a single domain and
>>> restrict the crawler from expanding beyond that domain. I've set the
>>> db.ignore.external.links property to true. I do not want to add a
>>> regex to regex-urlfilter.txt as I will be adding several thousand
>>> urls. The domain that I am crawling has documents with outlinks that
>>> are still within the domain but then redirect to external domains.
>>>
>>> cat urls/seed.txt
>>> http://www.ci.watertown.ma.us/
>>>
>>> cat conf/nutch-site.xml
>>> ...
>>> <property>
>>> <name>db.ignore.external.links</name>
>>> <value>true</value>
>>> <description>If true, outlinks leading from a page to external hosts
>>> will be ignored. This is an effective way to limit the crawl to include
>>> only initially injected hosts, without creating complex URLFilters.
>>> </description>
>>> </property>
>>>
>>> <property>
>>> <name>plugin.includes</name>
>>> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>>> <description>Regular expression naming plugin directory names to
>>> include. Any plugin not matching this expression is excluded.
>>> In any case you need at least include the nutch-extensionpoints plugin. By
>>> default Nutch includes crawling just HTML and plain text via HTTP,
>>> and basic indexing and search plugins. In order to use HTTPS please enable
>>> protocol-httpclient, but be aware of possible intermittent
>>> problems with the
>>> underlying commons-httpclient library.
>>> </description>
>>> </property>
>>> ...
>>>
>>> Running
>>> bin/nutch crawl urls -depth 8 -topN 100000
>>>
>>> results in the the crawl eventually fetching and parsing documents on
>>> domains external to the only link in the seed.txt file.
>>>
>>> I would not expect to see urls like the following in my logs and in
>>> the HBase webpage table:
>>>
>>> fetching http://www.masshome.com/tourism.html
>>> Parsing http://www.disabilityinfo.org/
>>>
>>> I'm reviewing the code changes but am still getting up to speed on the
>>> code base. Any ideas while I continue to dig around? Configuration
>>> issue or code?
>>>
>>> Thanks,
>>> Matt
>
>
>
> --
> Lewis
Re: Nutch 2.x trunk, focused domain crawl that contains links with
HTTP redirects pointing to external domains
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Matt,
I don't know if you got my message on the github mirrior issue.
If you could get the patch uploaded to a new Nutch Jira ticket (unless
one is already open) then I will be very happy to test as some free
time now means I am able to test a few patches.
Thanks
Lewis
On Sat, Sep 8, 2012 at 8:36 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
> Hi,
>
> Updated my local with the following
> https://github.com/apache/nutch/pull/1/files and the crawler is
> adhering to the db.ignore.external.links property when encountering
> local redirects that point to external domains.
>
> Thanks,
> Matt
>
> On Sat, Sep 8, 2012 at 1:44 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
>> Hi,
>>
>> Likely this is a question for Ferdy but if anyone else has input
>> that'd be great. When running a crawl that I would expect to be
>> contained to a single domain I'm seeing the crawler jump out to other
>> domains. I'm using the trunk of Nutch 2.x which includes the following
>> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
>>
>> The goal is to perform a focused crawl against a single domain and
>> restrict the crawler from expanding beyond that domain. I've set the
>> db.ignore.external.links property to true. I do not want to add a
>> regex to regex-urlfilter.txt as I will be adding several thousand
>> urls. The domain that I am crawling has documents with outlinks that
>> are still within the domain but then redirect to external domains.
>>
>> cat urls/seed.txt
>> http://www.ci.watertown.ma.us/
>>
>> cat conf/nutch-site.xml
>> ...
>> <property>
>> <name>db.ignore.external.links</name>
>> <value>true</value>
>> <description>If true, outlinks leading from a page to external hosts
>> will be ignored. This is an effective way to limit the crawl to include
>> only initially injected hosts, without creating complex URLFilters.
>> </description>
>> </property>
>>
>> <property>
>> <name>plugin.includes</name>
>> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>> <description>Regular expression naming plugin directory names to
>> include. Any plugin not matching this expression is excluded.
>> In any case you need at least include the nutch-extensionpoints plugin. By
>> default Nutch includes crawling just HTML and plain text via HTTP,
>> and basic indexing and search plugins. In order to use HTTPS please enable
>> protocol-httpclient, but be aware of possible intermittent
>> problems with the
>> underlying commons-httpclient library.
>> </description>
>> </property>
>> ...
>>
>> Running
>> bin/nutch crawl urls -depth 8 -topN 100000
>>
>> results in the the crawl eventually fetching and parsing documents on
>> domains external to the only link in the seed.txt file.
>>
>> I would not expect to see urls like the following in my logs and in
>> the HBase webpage table:
>>
>> fetching http://www.masshome.com/tourism.html
>> Parsing http://www.disabilityinfo.org/
>>
>> I'm reviewing the code changes but am still getting up to speed on the
>> code base. Any ideas while I continue to dig around? Configuration
>> issue or code?
>>
>> Thanks,
>> Matt
--
Lewis
Re: Nutch 2.x trunk, focused domain crawl that contains links with
HTTP redirects pointing to external domains
Posted by Matt MacDonald <ma...@nearbyfyi.com>.
Hi,
Updated my local with the following
https://github.com/apache/nutch/pull/1/files and the crawler is
adhering to the db.ignore.external.links property when encountering
local redirects that point to external domains.
Thanks,
Matt
On Sat, Sep 8, 2012 at 1:44 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
> Hi,
>
> Likely this is a question for Ferdy but if anyone else has input
> that'd be great. When running a crawl that I would expect to be
> contained to a single domain I'm seeing the crawler jump out to other
> domains. I'm using the trunk of Nutch 2.x which includes the following
> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
>
> The goal is to perform a focused crawl against a single domain and
> restrict the crawler from expanding beyond that domain. I've set the
> db.ignore.external.links property to true. I do not want to add a
> regex to regex-urlfilter.txt as I will be adding several thousand
> urls. The domain that I am crawling has documents with outlinks that
> are still within the domain but then redirect to external domains.
>
> cat urls/seed.txt
> http://www.ci.watertown.ma.us/
>
> cat conf/nutch-site.xml
> ...
> <property>
> <name>db.ignore.external.links</name>
> <value>true</value>
> <description>If true, outlinks leading from a page to external hosts
> will be ignored. This is an effective way to limit the crawl to include
> only initially injected hosts, without creating complex URLFilters.
> </description>
> </property>
>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin. By
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins. In order to use HTTPS please enable
> protocol-httpclient, but be aware of possible intermittent
> problems with the
> underlying commons-httpclient library.
> </description>
> </property>
> ...
>
> Running
> bin/nutch crawl urls -depth 8 -topN 100000
>
> results in the the crawl eventually fetching and parsing documents on
> domains external to the only link in the seed.txt file.
>
> I would not expect to see urls like the following in my logs and in
> the HBase webpage table:
>
> fetching http://www.masshome.com/tourism.html
> Parsing http://www.disabilityinfo.org/
>
> I'm reviewing the code changes but am still getting up to speed on the
> code base. Any ideas while I continue to dig around? Configuration
> issue or code?
>
> Thanks,
> Matt