You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2012/07/06 16:05:34 UTC
[jira] [Created] (NUTCH-1422) reset signature for redirects
Sebastian Nagel created NUTCH-1422:
--------------------------------------
Summary: reset signature for redirects
Key: NUTCH-1422
URL: https://issues.apache.org/jira/browse/NUTCH-1422
Project: Nutch
Issue Type: Bug
Components: crawldb, fetcher
Affects Versions: 1.4
Reporter: Sebastian Nagel
Fix For: 1.6
In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short protocol (cf. attached dumped segment / CrawlDb data):
2012-02-23 : injected
2012-02-24 : fetched
2012-03-30 : re-fetched, signature changed
2012-04-20 : re-fetched, redirected
2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content!
The signature of a previously fetched document is not reset when the URL/doc is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the status to db_notmodified because the new signature in with fetch status is identical to the old one.
Possible fixes (??):
* reset the signature in Fetcher
* handle this case in CrawlDbReducer.reduce
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1422) reset signature for redirects
Posted by "Sebastian Nagel (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-1422:
-----------------------------------
Attachment: NUTCH-1422_redir_notmodified_log.txt
> reset signature for redirects
> -----------------------------
>
> Key: NUTCH-1422
> URL: https://issues.apache.org/jira/browse/NUTCH-1422
> Project: Nutch
> Issue Type: Bug
> Components: crawldb, fetcher
> Affects Versions: 1.4
> Reporter: Sebastian Nagel
> Fix For: 1.6
>
> Attachments: NUTCH-1422_redir_notmodified_log.txt
>
>
> In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short protocol (cf. attached dumped segment / CrawlDb data):
> 2012-02-23 : injected
> 2012-02-24 : fetched
> 2012-03-30 : re-fetched, signature changed
> 2012-04-20 : re-fetched, redirected
> 2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content!
> The signature of a previously fetched document is not reset when the URL/doc is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the status to db_notmodified because the new signature in with fetch status is identical to the old one.
> Possible fixes (??):
> * reset the signature in Fetcher
> * handle this case in CrawlDbReducer.reduce
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1422) reset signature for redirects
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482224#comment-13482224 ]
Markus Jelsma commented on NUTCH-1422:
--------------------------------------
I think we can reset the signature in the fetcher/parser.
> reset signature for redirects
> -----------------------------
>
> Key: NUTCH-1422
> URL: https://issues.apache.org/jira/browse/NUTCH-1422
> Project: Nutch
> Issue Type: Bug
> Components: crawldb, fetcher
> Affects Versions: 1.4
> Reporter: Sebastian Nagel
> Fix For: 1.6
>
> Attachments: NUTCH-1422_redir_notmodified_log.txt
>
>
> In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short protocol (cf. attached dumped segment / CrawlDb data):
> 2012-02-23 : injected
> 2012-02-24 : fetched
> 2012-03-30 : re-fetched, signature changed
> 2012-04-20 : re-fetched, redirected
> 2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content!
> The signature of a previously fetched document is not reset when the URL/doc is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the status to db_notmodified because the new signature in with fetch status is identical to the old one.
> Possible fixes (??):
> * reset the signature in Fetcher
> * handle this case in CrawlDbReducer.reduce
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1422) reset signature for redirects
Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482308#comment-13482308 ]
Lewis John McGibbney commented on NUTCH-1422:
---------------------------------------------
I feel that this should be done within the fetcher as there can be a significant delay between the parsing stage (assuming of course that parsing is not executed within the fetching phase) for large crawls as is the case submitted by Seb.
> reset signature for redirects
> -----------------------------
>
> Key: NUTCH-1422
> URL: https://issues.apache.org/jira/browse/NUTCH-1422
> Project: Nutch
> Issue Type: Bug
> Components: crawldb, fetcher
> Affects Versions: 1.4
> Reporter: Sebastian Nagel
> Fix For: 1.6
>
> Attachments: NUTCH-1422_redir_notmodified_log.txt
>
>
> In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short protocol (cf. attached dumped segment / CrawlDb data):
> 2012-02-23 : injected
> 2012-02-24 : fetched
> 2012-03-30 : re-fetched, signature changed
> 2012-04-20 : re-fetched, redirected
> 2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content!
> The signature of a previously fetched document is not reset when the URL/doc is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the status to db_notmodified because the new signature in with fetch status is identical to the old one.
> Possible fixes (??):
> * reset the signature in Fetcher
> * handle this case in CrawlDbReducer.reduce
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira