You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2006/05/26 22:28:31 UTC

[jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.

    [ http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12413528 ] 

Doug Cutting commented on NUTCH-273:
------------------------------------

Redirects should really not be followed immediately anyway.  We should instead note that it was redirected and to which URL in the fetcher output.  Then, when the crawl db is updated with the fetcher output, the target of the redirect should be added, with the full OPIC score of the original URL.  This will enable proper politeness guarantees.

It would be nice to still associate the original URL with the content of the redirect URL when indexing.  Perhaps a list of URLs that redirected to each page could be kept in the CrawlDatum metadata?  Can anyone think of a better way to implement this?


> When a page is redirected, the original url is NOT updated.
> -----------------------------------------------------------
>
>          Key: NUTCH-273
>          URL: http://issues.apache.org/jira/browse/NUTCH-273
>      Project: Nutch
>         Type: Bug

>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: n/a
>     Reporter: Lukas Vlcek

>
> [Excerpt from maillist, sender: Andrzej Bialecki]
> When a page is redirected, the original url is NOT updated - so, CrawlDB will never know that a redirect occured, it won't even know that a fetch occured... This looks like a bug.
> In 0.7 this was recorded in the segment, and then it would affect the Page status during updatedb. It should do so 0.8, too...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira