You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matt MacDonald <ma...@nearbyfyi.com> on 2012/09/08 19:44:35 UTC

Nutch 2.x trunk, focused domain crawl that contains links with HTTP redirects pointing to external domains

Hi,

Likely this is a question for Ferdy but if anyone else has input
that'd be great. When running a crawl that I would expect to be
contained to a single domain I'm seeing the crawler jump out to other
domains. I'm using the trunk of Nutch 2.x which includes the following
commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200

The goal is to perform a focused crawl against a single domain and
restrict the crawler from expanding beyond that domain. I've set the
db.ignore.external.links property to true. I do not want to add a
regex to regex-urlfilter.txt as I will be adding several thousand
urls. The domain that I am crawling has documents with outlinks that
are still within the domain but then redirect to external domains.

cat urls/seed.txt
http://www.ci.watertown.ma.us/

cat conf/nutch-site.xml
...
  <property>
    <name>db.ignore.external.links</name>
    <value>true</value>
    <description>If true, outlinks leading from a page to external hosts
    will be ignored. This is an effective way to limit the crawl to include
    only initially injected hosts, without creating complex URLFilters.
    </description>
  </property>

  <property>
    <name>plugin.includes</name>
   <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
   <description>Regular expression naming plugin directory names to
    include.  Any plugin not matching this expression is excluded.
    In any case you need at least include the nutch-extensionpoints plugin. By
    default Nutch includes crawling just HTML and plain text via HTTP,
    and basic indexing and search plugins. In order to use HTTPS please enable
    protocol-httpclient, but be aware of possible intermittent
problems with the
    underlying commons-httpclient library.
    </description>
  </property>
...

Running
bin/nutch crawl urls -depth 8 -topN 100000

results in the the crawl eventually fetching and parsing documents on
domains external to the only link in the seed.txt file.

I would not expect to see urls like the following in my logs and in
the HBase webpage table:

fetching http://www.masshome.com/tourism.html
Parsing http://www.disabilityinfo.org/

I'm reviewing the code changes but am still getting up to speed on the
code base. Any ideas while I continue to dig around? Configuration
issue or code?

Thanks,
Matt

Re: Nutch 2.x trunk, focused domain crawl that contains links with HTTP redirects pointing to external domains

Posted by Matt MacDonald <ma...@nearbyfyi.com>.
Hi Lewis,

Wonderful. I didn't see the issue in JIRA so I've created a new ticket
here: https://issues.apache.org/jira/browse/NUTCH-1468 with a patch
attached. Sorry for the pull request I was just in github and it was
easier to send the pull request than file a ticket.

Thanks,
Matt

On Sun, Sep 9, 2012 at 7:18 AM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi Matt,
>
> I don't know if you got my message on the github mirrior issue.
> If you could get the patch uploaded to a new Nutch Jira ticket (unless
> one is already open) then I will be very happy to test as some free
> time now means I am able to test a few patches.
>
> Thanks
>
> Lewis
>
> On Sat, Sep 8, 2012 at 8:36 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
>> Hi,
>>
>> Updated my local with the following
>> https://github.com/apache/nutch/pull/1/files and the crawler is
>> adhering to the db.ignore.external.links property when encountering
>> local redirects that point to external domains.
>>
>> Thanks,
>> Matt
>>
>> On Sat, Sep 8, 2012 at 1:44 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
>>> Hi,
>>>
>>> Likely this is a question for Ferdy but if anyone else has input
>>> that'd be great. When running a crawl that I would expect to be
>>> contained to a single domain I'm seeing the crawler jump out to other
>>> domains. I'm using the trunk of Nutch 2.x which includes the following
>>> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
>>>
>>> The goal is to perform a focused crawl against a single domain and
>>> restrict the crawler from expanding beyond that domain. I've set the
>>> db.ignore.external.links property to true. I do not want to add a
>>> regex to regex-urlfilter.txt as I will be adding several thousand
>>> urls. The domain that I am crawling has documents with outlinks that
>>> are still within the domain but then redirect to external domains.
>>>
>>> cat urls/seed.txt
>>> http://www.ci.watertown.ma.us/
>>>
>>> cat conf/nutch-site.xml
>>> ...
>>>   <property>
>>>     <name>db.ignore.external.links</name>
>>>     <value>true</value>
>>>     <description>If true, outlinks leading from a page to external hosts
>>>     will be ignored. This is an effective way to limit the crawl to include
>>>     only initially injected hosts, without creating complex URLFilters.
>>>     </description>
>>>   </property>
>>>
>>>   <property>
>>>     <name>plugin.includes</name>
>>>    <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>>>    <description>Regular expression naming plugin directory names to
>>>     include.  Any plugin not matching this expression is excluded.
>>>     In any case you need at least include the nutch-extensionpoints plugin. By
>>>     default Nutch includes crawling just HTML and plain text via HTTP,
>>>     and basic indexing and search plugins. In order to use HTTPS please enable
>>>     protocol-httpclient, but be aware of possible intermittent
>>> problems with the
>>>     underlying commons-httpclient library.
>>>     </description>
>>>   </property>
>>> ...
>>>
>>> Running
>>> bin/nutch crawl urls -depth 8 -topN 100000
>>>
>>> results in the the crawl eventually fetching and parsing documents on
>>> domains external to the only link in the seed.txt file.
>>>
>>> I would not expect to see urls like the following in my logs and in
>>> the HBase webpage table:
>>>
>>> fetching http://www.masshome.com/tourism.html
>>> Parsing http://www.disabilityinfo.org/
>>>
>>> I'm reviewing the code changes but am still getting up to speed on the
>>> code base. Any ideas while I continue to dig around? Configuration
>>> issue or code?
>>>
>>> Thanks,
>>> Matt
>
>
>
> --
> Lewis

Re: Nutch 2.x trunk, focused domain crawl that contains links with HTTP redirects pointing to external domains

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Matt,

I don't know if you got my message on the github mirrior issue.
If you could get the patch uploaded to a new Nutch Jira ticket (unless
one is already open) then I will be very happy to test as some free
time now means I am able to test a few patches.

Thanks

Lewis

On Sat, Sep 8, 2012 at 8:36 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
> Hi,
>
> Updated my local with the following
> https://github.com/apache/nutch/pull/1/files and the crawler is
> adhering to the db.ignore.external.links property when encountering
> local redirects that point to external domains.
>
> Thanks,
> Matt
>
> On Sat, Sep 8, 2012 at 1:44 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
>> Hi,
>>
>> Likely this is a question for Ferdy but if anyone else has input
>> that'd be great. When running a crawl that I would expect to be
>> contained to a single domain I'm seeing the crawler jump out to other
>> domains. I'm using the trunk of Nutch 2.x which includes the following
>> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
>>
>> The goal is to perform a focused crawl against a single domain and
>> restrict the crawler from expanding beyond that domain. I've set the
>> db.ignore.external.links property to true. I do not want to add a
>> regex to regex-urlfilter.txt as I will be adding several thousand
>> urls. The domain that I am crawling has documents with outlinks that
>> are still within the domain but then redirect to external domains.
>>
>> cat urls/seed.txt
>> http://www.ci.watertown.ma.us/
>>
>> cat conf/nutch-site.xml
>> ...
>>   <property>
>>     <name>db.ignore.external.links</name>
>>     <value>true</value>
>>     <description>If true, outlinks leading from a page to external hosts
>>     will be ignored. This is an effective way to limit the crawl to include
>>     only initially injected hosts, without creating complex URLFilters.
>>     </description>
>>   </property>
>>
>>   <property>
>>     <name>plugin.includes</name>
>>    <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>>    <description>Regular expression naming plugin directory names to
>>     include.  Any plugin not matching this expression is excluded.
>>     In any case you need at least include the nutch-extensionpoints plugin. By
>>     default Nutch includes crawling just HTML and plain text via HTTP,
>>     and basic indexing and search plugins. In order to use HTTPS please enable
>>     protocol-httpclient, but be aware of possible intermittent
>> problems with the
>>     underlying commons-httpclient library.
>>     </description>
>>   </property>
>> ...
>>
>> Running
>> bin/nutch crawl urls -depth 8 -topN 100000
>>
>> results in the the crawl eventually fetching and parsing documents on
>> domains external to the only link in the seed.txt file.
>>
>> I would not expect to see urls like the following in my logs and in
>> the HBase webpage table:
>>
>> fetching http://www.masshome.com/tourism.html
>> Parsing http://www.disabilityinfo.org/
>>
>> I'm reviewing the code changes but am still getting up to speed on the
>> code base. Any ideas while I continue to dig around? Configuration
>> issue or code?
>>
>> Thanks,
>> Matt



-- 
Lewis

Re: Nutch 2.x trunk, focused domain crawl that contains links with HTTP redirects pointing to external domains

Posted by Matt MacDonald <ma...@nearbyfyi.com>.
Hi,

Updated my local with the following
https://github.com/apache/nutch/pull/1/files and the crawler is
adhering to the db.ignore.external.links property when encountering
local redirects that point to external domains.

Thanks,
Matt

On Sat, Sep 8, 2012 at 1:44 PM, Matt MacDonald <ma...@nearbyfyi.com> wrote:
> Hi,
>
> Likely this is a question for Ferdy but if anyone else has input
> that'd be great. When running a crawl that I would expect to be
> contained to a single domain I'm seeing the crawler jump out to other
> domains. I'm using the trunk of Nutch 2.x which includes the following
> commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200
>
> The goal is to perform a focused crawl against a single domain and
> restrict the crawler from expanding beyond that domain. I've set the
> db.ignore.external.links property to true. I do not want to add a
> regex to regex-urlfilter.txt as I will be adding several thousand
> urls. The domain that I am crawling has documents with outlinks that
> are still within the domain but then redirect to external domains.
>
> cat urls/seed.txt
> http://www.ci.watertown.ma.us/
>
> cat conf/nutch-site.xml
> ...
>   <property>
>     <name>db.ignore.external.links</name>
>     <value>true</value>
>     <description>If true, outlinks leading from a page to external hosts
>     will be ignored. This is an effective way to limit the crawl to include
>     only initially injected hosts, without creating complex URLFilters.
>     </description>
>   </property>
>
>   <property>
>     <name>plugin.includes</name>
>    <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>    <description>Regular expression naming plugin directory names to
>     include.  Any plugin not matching this expression is excluded.
>     In any case you need at least include the nutch-extensionpoints plugin. By
>     default Nutch includes crawling just HTML and plain text via HTTP,
>     and basic indexing and search plugins. In order to use HTTPS please enable
>     protocol-httpclient, but be aware of possible intermittent
> problems with the
>     underlying commons-httpclient library.
>     </description>
>   </property>
> ...
>
> Running
> bin/nutch crawl urls -depth 8 -topN 100000
>
> results in the the crawl eventually fetching and parsing documents on
> domains external to the only link in the seed.txt file.
>
> I would not expect to see urls like the following in my logs and in
> the HBase webpage table:
>
> fetching http://www.masshome.com/tourism.html
> Parsing http://www.disabilityinfo.org/
>
> I'm reviewing the code changes but am still getting up to speed on the
> code base. Any ideas while I continue to dig around? Configuration
> issue or code?
>
> Thanks,
> Matt