You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by George Herlin <gh...@gmail.com> on 2009/04/01 12:25:57 UTC

Infinite loop bug in Nutch 0.9

Hello, there.

I believe I may have found a infinite loop in Nutch 0.9.

It happens when a site has a page that refers to itself through a
redirection.

The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a
little modified, line numbers may vary a little - says, for that case:

output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED);

What that does is, inserts an extra (empty) crawl datum for the new url,
with a re-fetch interval of 0.0.

However, (see Generator.Selector.map(), particularly lines 144-145), the
non-refetch condition used seems to be last-fetch+refetch-interval>now ...
which is always false if refetch-interval==0.0!

Now, if there is a new link to the new url in that page, that crawl datum is
re-used, and the whole thing loops indefinitely.

I've fixed that for myself by changing the quoted line (twice) by:

output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null,
CrawlDatum.STATUS_LINKED);

and that works (btw the 30F should really be the value of
"db.default.fetch.interval", but I haven't the time right now to work out
the issues, but in reality the default constructor and the appropriate
updater method should, if I am right in analysing the algorithm always
enforce a positive refetch interval.

Of course, another method could be used to remove this self-reference, but
that couls be complicated, as that may happen through a loop (2 or more
pages etc..., you know what I mean).

Has that been fixed already, and by what method?

Best regards

George Herlin

Re: Infinite loop bug in Nutch 0.9

Posted by Doğacan Güney <do...@gmail.com>.

On Wed, Apr 1, 2009 at 13:29, George Herlin <gh...@gmail.com> wrote:

> Sorry, forgot to say, there is an added precondition to causing the bug:
>
> The redirection has to be fetched before the page it redirects to... if
> not, there will be a pre.existing crawl datum with an reasonable
> refetch-interval.
>

Maybe this is something fixed between 0.9 and 1.0, but I think
CrawlDbReducer fixes these datums, around line 147 (case
CrawlDatum.STATUS_LINKED). Have you even got stuck in an infinite loop
because of it?


>
>
> 2009/4/1 George Herlin <gh...@gmail.com>
>
> Hello, there.
>>
>> I believe I may have found a infinite loop in Nutch 0.9.
>>
>> It happens when a site has a page that refers to itself through a
>> redirection.
>>
>> The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a
>> little modified, line numbers may vary a little - says, for that case:
>>
>> output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED);
>>
>> What that does is, inserts an extra (empty) crawl datum for the new url,
>> with a re-fetch interval of 0.0.
>>
>> However, (see Generator.Selector.map(), particularly lines 144-145), the
>> non-refetch condition used seems to be last-fetch+refetch-interval>now ...
>> which is always false if refetch-interval==0.0!
>>
>> Now, if there is a new link to the new url in that page, that crawl datum
>> is re-used, and the whole thing loops indefinitely.
>>
>> I've fixed that for myself by changing the quoted line (twice) by:
>>
>> output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null,
>> CrawlDatum.STATUS_LINKED);
>>
>> and that works (btw the 30F should really be the value of
>> "db.default.fetch.interval", but I haven't the time right now to work out
>> the issues, but in reality the default constructor and the appropriate
>> updater method should, if I am right in analysing the algorithm always
>> enforce a positive refetch interval.
>>
>> Of course, another method could be used to remove this self-reference, but
>> that couls be complicated, as that may happen through a loop (2 or more
>> pages etc..., you know what I mean).
>>
>> Has that been fixed already, and by what method?
>>
>> Best regards
>>
>> George Herlin
>>
>>
>>
>>
>


-- 
Doğacan Güney

Re: Infinite loop bug in Nutch 0.9

Posted by George Herlin <gh...@gmail.com>.

Sorry, forgot to say, there is an added precondition to causing the bug:

The redirection has to be fetched before the page it redirects to... if not,
there will be a pre.existing crawl datum with an reasonable
refetch-interval.


2009/4/1 George Herlin <gh...@gmail.com>

> Hello, there.
>
> I believe I may have found a infinite loop in Nutch 0.9.
>
> It happens when a site has a page that refers to itself through a
> redirection.
>
> The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a
> little modified, line numbers may vary a little - says, for that case:
>
> output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED);
>
> What that does is, inserts an extra (empty) crawl datum for the new url,
> with a re-fetch interval of 0.0.
>
> However, (see Generator.Selector.map(), particularly lines 144-145), the
> non-refetch condition used seems to be last-fetch+refetch-interval>now ...
> which is always false if refetch-interval==0.0!
>
> Now, if there is a new link to the new url in that page, that crawl datum
> is re-used, and the whole thing loops indefinitely.
>
> I've fixed that for myself by changing the quoted line (twice) by:
>
> output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null,
> CrawlDatum.STATUS_LINKED);
>
> and that works (btw the 30F should really be the value of
> "db.default.fetch.interval", but I haven't the time right now to work out
> the issues, but in reality the default constructor and the appropriate
> updater method should, if I am right in analysing the algorithm always
> enforce a positive refetch interval.
>
> Of course, another method could be used to remove this self-reference, but
> that couls be complicated, as that may happen through a loop (2 or more
> pages etc..., you know what I mean).
>
> Has that been fixed already, and by what method?
>
> Best regards
>
> George Herlin
>
>
>
>