You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Larsson85 <kr...@hotmail.com> on 2009/06/12 12:35:05 UTC

Make nutch follow redirections

When I do a dump of my segments I often find entries that looks like the
following

<HTML><HEAD>
<TITLE>301 Moved Permanently</TITLE>
</HEAD><BODY>
<H1>Moved Permanently</H1> 

I suppose that this means that the page wants to redirect. How can I make
nutch follow that redirection and crawl that page instead?

It's not just one or two pages that looks like this, it's very frequently.
-- 
View this message in context: http://www.nabble.com/Make-nutch-follow-redirections-tp23996457p23996457.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Make nutch follow redirections

Posted by Dennis Kubes <ku...@apache.org>.
Set the http.redirect.max property in nutch-site.xml to > 0, usually 
around 3.  Default is 0 so won't follow redirects.

Dennis

<property>
   <name>http.redirect.max</name>
   <value>0</value>
   <description>The maximum number of redirects the fetcher will follow when
   trying to fetch a page. If set to negative or 0, fetcher won't 
immediately
   follow redirected URLs, instead it will record them for later fetching.
   </description>
</property>

Larsson85 wrote:
> When I do a dump of my segments I often find entries that looks like the
> following
> 
> <HTML><HEAD>
> <TITLE>301 Moved Permanently</TITLE>
> </HEAD><BODY>
> <H1>Moved Permanently</H1> 
> 
> I suppose that this means that the page wants to redirect. How can I make
> nutch follow that redirection and crawl that page instead?
> 
> It's not just one or two pages that looks like this, it's very frequently.