You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by devang pandey <de...@gmail.com> on 2013/07/11 11:48:40 UTC

nutch redirection issue

Hello,

I am bit new to nutch . Thing is I am crawling a url which redirects to
another url .Now when analysing my crawl results I get content of first url
along with status code : temp redirected to (second url name) . Now my
question is that why I am not getting content and details of that second
url . Please help

Re: nutch redirection issue

Posted by Rerngvit Yanggratoke <re...@gmail.com>.

Could you be a bit more specific. For example.
 what version of Nutch you are running? 
On which system? Linux , Window, which version?
How do you run the crawl? Which class or command you use to execute? 
Did specify the depth of the crawl?
On Jul 11, 2556 BE, at 11:48 AM, devang pandey wrote:

> Hello,
> 
> I am bit new to nutch . Thing is I am crawling a url which redirects to
> another url .Now when analysing my crawl results I get content of first url
> along with status code : temp redirected to (second url name) . Now my
> question is that why I am not getting content and details of that second
> url . Please help

Re: nutch redirection issue

Posted by Sebastian Nagel <wa...@googlemail.com>.

> If I remember correctly, there used to be a setting that would have Nutch
> follow the redirect instead of storing it as a new url, but I can't seem to
> find it at the moment.

The property is:

<property>
  <name>http.redirect.max</name>
  <value>0</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

> Have you done another crawl?  By default, Nutch puts the redirect into the
> database as a new url to be crawled.  So you will find the content under
> the location of the redirect.

Sometimes you'll find the content of the redirect target indexed under the
source URL. In general, if the source is clearly simpler, e.g. (www.asdf.net)
as the target (www.asdf.net/page/index.asp?page=main) the source is given
precende. For details, see URLUtil.chooseRepr().

On 07/11/2013 01:21 PM, Bai Shen wrote:
> Have you done another crawl?  By default, Nutch puts the redirect into the
> database as a new url to be crawled.  So you will find the content under
> the location of the redirect.
> 
> If I remember correctly, there used to be a setting that would have Nutch
> follow the redirect instead of storing it as a new url, but I can't seem to
> find it at the moment.
> 
> 
> On Thu, Jul 11, 2013 at 5:48 AM, devang pandey <de...@gmail.com>wrote:
> 
>> Hello,
>>
>> I am bit new to nutch . Thing is I am crawling a url which redirects to
>> another url .Now when analysing my crawl results I get content of first url
>> along with status code : temp redirected to (second url name) . Now my
>> question is that why I am not getting content and details of that second
>> url . Please help
>>
>

Re: nutch redirection issue

Posted by Bai Shen <ba...@gmail.com>.

Have you done another crawl?  By default, Nutch puts the redirect into the
database as a new url to be crawled.  So you will find the content under
the location of the redirect.

If I remember correctly, there used to be a setting that would have Nutch
follow the redirect instead of storing it as a new url, but I can't seem to
find it at the moment.

On Thu, Jul 11, 2013 at 5:48 AM, devang pandey <de...@gmail.com>wrote:

> Hello,
>
> I am bit new to nutch . Thing is I am crawling a url which redirects to
> another url .Now when analysing my crawl results I get content of first url
> along with status code : temp redirected to (second url name) . Now my
> question is that why I am not getting content and details of that second
> url . Please help
>