You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Oleg Mürk <ol...@gmail.com> on 2011/09/21 17:21:43 UTC

Nutch redirect handling problem

Hello,

When I fetch the following links with nutch 1.3:
  http://blog.mises.org/archives/010450.asp
  http://feedproxy.google.com/~r/readwriteweb/~3/frC1ndi7-V8/google_docs_goes_back_to_schoo.php
and
  http.redirect.max = 2
The first of these links is fetched OK, including the two redirects:
  http://blog.mises.org/?p=010450
  http://blog.mises.org/10450/what-the-bubble-did-to-technology/
However for the second link (feedproxy.google.com) the redirects are
not being followed during the fetch.
Both redirects are "301 Moved Permanently".

May be somebody could suggest what is causing such behavior? I am
using the default settings + http.agent.name and http.robots.agents.

Further, if I update the crawldb with the results of the fetch and
then generate a new segment, the link
   http://www.readwriteweb.com/archives/google_docs_goes_back_to_schoo.php?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+readwriteweb+%28ReadWriteWeb%29
which is redirected from
   http://feedproxy.google.com/~r/readwriteweb/~3/frC1ndi7-V8/google_docs_goes_back_to_schoo.php
is never added to the new segment.

What am I doing wrong? :)

Thank You!
Oleg Mürk

Re: Nutch redirect handling problem

Posted by Oleg Mürk <ol...@gmail.com>.

Hi Markus,

On Wed, Sep 21, 2011 at 6:28 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Check your URL filters. It's most likely thrown away.

Thanks, the problem was solved by editing "regex-urlfilter.txt".
Apparently it is also used by the fetcher on redirects.

On a related note, the fetcher is also filtering redirects according
to parameter: "db.ignore.external.links", which is not very obvious
from the name.

Thank again!
Oleg

Re: Nutch redirect handling problem

Posted by Markus Jelsma <ma...@openindex.io>.


On Wednesday 21 September 2011 17:21:43 Oleg Mürk wrote:
> Hello,
> 
> When I fetch the following links with nutch 1.3:
>   http://blog.mises.org/archives/010450.asp
>  
> http://feedproxy.google.com/~r/readwriteweb/~3/frC1ndi7-V8/google_docs_goe
> s_back_to_schoo.php and
>   http.redirect.max = 2
> The first of these links is fetched OK, including the two redirects:
>   http://blog.mises.org/?p=010450
>   http://blog.mises.org/10450/what-the-bubble-did-to-technology/
> However for the second link (feedproxy.google.com) the redirects are
> not being followed during the fetch.
> Both redirects are "301 Moved Permanently".
> 
> May be somebody could suggest what is causing such behavior? I am
> using the default settings + http.agent.name and http.robots.agents.
> 
> Further, if I update the crawldb with the results of the fetch and
> then generate a new segment, the link
>   
> http://www.readwriteweb.com/archives/google_docs_goes_back_to_schoo.php?ut
> m_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+readwriteweb+%28Re
> adWriteWeb%29 which is redirected from
>   
> http://feedproxy.google.com/~r/readwriteweb/~3/frC1ndi7-V8/google_docs_goe
> s_back_to_schoo.php is never added to the new segment.

Check your URL filters. It's most likely thrown away.

> 
> What am I doing wrong? :)
> 
> Thank You!
> Oleg Mürk

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350