You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/04/15 15:01:15 UTC

redirect treatment

How are redirects listed in version 0.7?  If the crawler finds a link like:
www.domain.com/?code.aspx&redirect=445454
and that link redirects through to www.another-domain.com, which of 
those two links will show up in nutch?

(I'm wondering if I can use nutch to crawl sites with a lot of 
redirects, and still end up with the correct redirected domain in the 
listings).


Re: redirect treatment

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> The meta-refresh was working in 7.2 but is broken in 0.8.  Andrzej 
> Bialecki said he was looking into fixing it.  Hope this helps you 
> understand what is happening with the fetch.

It should work again, please test revisions later than r393297. (Note: 
I'm away till 24th, so I may not respond before that).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: redirect treatment

Posted by Dennis Kubes <nu...@dragonflymc.com>.
There are three kinds of "redirects".  One is where the server behind 
the scenes forwards to a different page and returns the output.  This is 
usually called a forward.  Two is where the server send a redirect code 
(usually in the 300 range).  The browser then requests the page it was 
redirected to.  This is usually called a protocol redirect or just a 
redirect in JSP and ASP terms.  Three is where the page has a 
meta-refresh tag in the header.  This is known as a content redirect or 
a meta redirect.  Here the client doesn't get a redirect code from the 
header but after a certain amount of time will request the page in the 
url section of the meta-refresh tag.

If (www.domain.com/?code.asp&redirect=444) sends a forward then nutch 
doesn't know anything about it and will just index the content returned 
under the original url.  If it sends a protocol redirect, then nutch 
goes and requests the new page and will index the new page under the new 
url.  Nutch will follow redirects up to http.redirect.max times.  So if 
the redirect page redirects again Nutch will follow that one as well up 
to the max times.  If the url variable "redirect" is used to populate a 
meta-refresh tag then as of right now Nutch won't follow the redirect.  
I think it fails with a NullPointer right now.

The meta-refresh was working in 7.2 but is broken in 0.8.  Andrzej 
Bialecki said he was looking into fixing it.  Hope this helps you 
understand what is happening with the fetch.

Dennis

Insurance Squared Inc. wrote:
> Perhaps a point of clarification - I'm assuming that the 
> www.domain.com/?code.asp&redirect=444 actually sends a redirect header 
> to the new page.  In that case (I don't know enough about protocols 
> personally to be sure) it seems that nutch would have to recognize 
> that it's being redirected and refetch at the new location.  Am I 
> correct?  And if so, wouldn't nutch then index and display the new, 
> redirected page?
> I'm using version .7 btw.
>
> thanks,
> Glenn
>
>
> Dennis Kubes wrote:
>
>> Protocol level redirects (asp redirects), meaning the server sends a 
>> redirect response 3xx code, work correctly in Nutch 0.8 dev.  It 
>> processes it as a completely new page.  If you are doing asp forwards 
>> I believe that the original page 
>> (www.domain.com/?code.aspx&redirect=445454) would be the URL that 
>> shows up in the search because Nutch doesn't know what is going on 
>> behind the scenes in the ASP code.  It knows url and content recieved.
>> As of right now in 0.8 dev meta level redirects (meta refesh tags) 
>> don't work correctly.  They did in 0.7 but I don't think that 
>> functionality has been ported.
>>
>> Dennis
>>
>> Insurance Squared Inc. wrote:
>>
>>> How are redirects listed in version 0.7?  If the crawler finds a 
>>> link like:
>>> www.domain.com/?code.aspx&redirect=445454
>>> and that link redirects through to www.another-domain.com, which of 
>>> those two links will show up in nutch?
>>>
>>> (I'm wondering if I can use nutch to crawl sites with a lot of 
>>> redirects, and still end up with the correct redirected domain in 
>>> the listings).
>>>
>>

Re: redirect treatment

Posted by "Insurance Squared Inc." <gc...@insurancesquared.com>.
Perhaps a point of clarification - I'm assuming that the 
www.domain.com/?code.asp&redirect=444 actually sends a redirect header 
to the new page.  In that case (I don't know enough about protocols 
personally to be sure) it seems that nutch would have to recognize that 
it's being redirected and refetch at the new location.  Am I correct?  
And if so, wouldn't nutch then index and display the new, redirected page? 

I'm using version .7 btw.

thanks,
Glenn


Dennis Kubes wrote:

> Protocol level redirects (asp redirects), meaning the server sends a 
> redirect response 3xx code, work correctly in Nutch 0.8 dev.  It 
> processes it as a completely new page.  If you are doing asp forwards 
> I believe that the original page 
> (www.domain.com/?code.aspx&redirect=445454) would be the URL that 
> shows up in the search because Nutch doesn't know what is going on 
> behind the scenes in the ASP code.  It knows url and content recieved.
> As of right now in 0.8 dev meta level redirects (meta refesh tags) 
> don't work correctly.  They did in 0.7 but I don't think that 
> functionality has been ported.
>
> Dennis
>
> Insurance Squared Inc. wrote:
>
>> How are redirects listed in version 0.7?  If the crawler finds a link 
>> like:
>> www.domain.com/?code.aspx&redirect=445454
>> and that link redirects through to www.another-domain.com, which of 
>> those two links will show up in nutch?
>>
>> (I'm wondering if I can use nutch to crawl sites with a lot of 
>> redirects, and still end up with the correct redirected domain in the 
>> listings).
>>
>

Re: redirect treatment

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Protocol level redirects (asp redirects), meaning the server sends a 
redirect response 3xx code, work correctly in Nutch 0.8 dev.  It 
processes it as a completely new page.  If you are doing asp forwards I 
believe that the original page 
(www.domain.com/?code.aspx&redirect=445454) would be the URL that shows 
up in the search because Nutch doesn't know what is going on behind the 
scenes in the ASP code.  It knows url and content recieved. 

As of right now in 0.8 dev meta level redirects (meta refesh tags) don't 
work correctly.  They did in 0.7 but I don't think that functionality 
has been ported.

Dennis

Insurance Squared Inc. wrote:
> How are redirects listed in version 0.7?  If the crawler finds a link 
> like:
> www.domain.com/?code.aspx&redirect=445454
> and that link redirects through to www.another-domain.com, which of 
> those two links will show up in nutch?
>
> (I'm wondering if I can use nutch to crawl sites with a lot of 
> redirects, and still end up with the correct redirected domain in the 
> listings).
>