You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Larsson85 <kr...@hotmail.com> on 2009/06/12 12:35:05 UTC
Make nutch follow redirections
When I do a dump of my segments I often find entries that looks like the
following
<HTML><HEAD>
<TITLE>301 Moved Permanently</TITLE>
</HEAD><BODY>
<H1>Moved Permanently</H1>
I suppose that this means that the page wants to redirect. How can I make
nutch follow that redirection and crawl that page instead?
It's not just one or two pages that looks like this, it's very frequently.
--
View this message in context: http://www.nabble.com/Make-nutch-follow-redirections-tp23996457p23996457.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Make nutch follow redirections
Posted by Dennis Kubes <ku...@apache.org>.
Set the http.redirect.max property in nutch-site.xml to > 0, usually
around 3. Default is 0 so won't follow redirects.
Dennis
<property>
<name>http.redirect.max</name>
<value>0</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't
immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
Larsson85 wrote:
> When I do a dump of my segments I often find entries that looks like the
> following
>
> <HTML><HEAD>
> <TITLE>301 Moved Permanently</TITLE>
> </HEAD><BODY>
> <H1>Moved Permanently</H1>
>
> I suppose that this means that the page wants to redirect. How can I make
> nutch follow that redirection and crawl that page instead?
>
> It's not just one or two pages that looks like this, it's very frequently.