You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marko Asplund <ma...@gmail.com> on 2015/03/18 11:02:32 UTC

Problems with redirect handling: redirect count exceeded

Hi,

I'm a newbie having trouble getting Nutch 1.9 to crawl a site that does a
HTTP 301 redirect from http/80 to https/443.
Nutch fetch job issues the following message:

redirect count exceeded http://www.foo.com/

and it seems that nothing actually gets fetched.
I've set http.redirect.max parameter value to 50.

I've only injected one seed URL to Nutch.
The first fetch seems to download something, but the second generate job
doesn't appear to produce a new segment,
since there's only one segment in crawl DB after running it.

How can I debug problem?

Is there a way to make Nutch logging more verbose? I've set
http.verbose, but that didn't help.

How can I look at the crawl db and segment data contents (esp. fetch list)?
I'm running Nutch in local mode.

marko

Re: Problems with redirect handling: redirect count exceeded

Posted by Sebastian Nagel <wa...@googlemail.com>.
See also https://issues.apache.org/jira/browse/NUTCH-1939
(it's a bug in Nutch 1.9)

On 03/19/2015 10:10 PM, Sebastian Nagel wrote:
> Hi Marko,
> 
> even with
>   http.redirect.max == 0
> Nutch follows redirect but they are like ordinary links
> recorded for fetch in the next round(s).
> 
>> The first fetch seems to download something, but the second generate job
>> doesn't appear to produce a new segment,
> Are the redirect targets accepted by the URL filter patterns?
> 
>> How can I look at the crawl db and segment data contents (esp. fetch list)?
>> I'm running Nutch in local mode.
> % bin/nutch readdb ...
> % bin/nutch readseg ...
> Help is shown when called without arguments.
> 
> Best,
> Sebastian
> 
> On 03/18/2015 11:02 AM, Marko Asplund wrote:
>> Hi,
>>
>> I'm a newbie having trouble getting Nutch 1.9 to crawl a site that does a
>> HTTP 301 redirect from http/80 to https/443.
>> Nutch fetch job issues the following message:
>>
>> redirect count exceeded http://www.foo.com/
>>
>> and it seems that nothing actually gets fetched.
>> I've set http.redirect.max parameter value to 50.
>>
>> I've only injected one seed URL to Nutch.
>> The first fetch seems to download something, but the second generate job
>> doesn't appear to produce a new segment,
>> since there's only one segment in crawl DB after running it.
>>
>> How can I debug problem?
>>
>> Is there a way to make Nutch logging more verbose? I've set
>> http.verbose, but that didn't help.
>>
>> How can I look at the crawl db and segment data contents (esp. fetch list)?
>> I'm running Nutch in local mode.
>>
>> marko
>>
> 


Re: Problems with redirect handling: redirect count exceeded

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Marko,

even with
  http.redirect.max == 0
Nutch follows redirect but they are like ordinary links
recorded for fetch in the next round(s).

> The first fetch seems to download something, but the second generate job
> doesn't appear to produce a new segment,
Are the redirect targets accepted by the URL filter patterns?

> How can I look at the crawl db and segment data contents (esp. fetch list)?
> I'm running Nutch in local mode.
% bin/nutch readdb ...
% bin/nutch readseg ...
Help is shown when called without arguments.

Best,
Sebastian

On 03/18/2015 11:02 AM, Marko Asplund wrote:
> Hi,
> 
> I'm a newbie having trouble getting Nutch 1.9 to crawl a site that does a
> HTTP 301 redirect from http/80 to https/443.
> Nutch fetch job issues the following message:
> 
> redirect count exceeded http://www.foo.com/
> 
> and it seems that nothing actually gets fetched.
> I've set http.redirect.max parameter value to 50.
> 
> I've only injected one seed URL to Nutch.
> The first fetch seems to download something, but the second generate job
> doesn't appear to produce a new segment,
> since there's only one segment in crawl DB after running it.
> 
> How can I debug problem?
> 
> Is there a way to make Nutch logging more verbose? I've set
> http.verbose, but that didn't help.
> 
> How can I look at the crawl db and segment data contents (esp. fetch list)?
> I'm running Nutch in local mode.
> 
> marko
>