You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2014/07/04 00:22:34 UTC

[jira] [Comment Edited] (NUTCH-1502) Test for CrawlDatum state transitions

    [ https://issues.apache.org/jira/browse/NUTCH-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051968#comment-14051968 ] 

Sebastian Nagel edited comment on NUTCH-1502 at 7/3/14 10:22 PM:
-----------------------------------------------------------------

Patch which adds the following test units:
* test matrix of state transitions with
** CrawlDbReducer and InjectReducer
** Default and AdaptiveFetchSchedule
* fetch_gone -> db_gone (NUTCH-1245)
* not modified time (cf. NUTCH-933)
* fetch_retry -> db_gone after max retries (NUTCH-578)
* immediate refetch by sync_delta of AdaptiveFetchSchedule (NUTCH-1564)
* signature reset / erroneous db_notmodified (NUTCH-1422)

The latter four points are open issues, the corresponding tests are in a separate TODO test class or marked as such. The tests should make it easier to find a solutions for these issues: they are now reproducible. That's the main improvement: the tests log lot of information which makes it possible to understand what's going wrong. Since these problems happen only after a long time it's hard to do the investigations in real crawls (need to check dozens of segments).


was (Author: wastl-nagel):
Patch which adds the following test units:
* test matrix of state transitions with
** CrawlDbReducer and InjectReducer
** Default and AdaptiveFetchSchedule
* fetch_gone -> db_gone (NUTCH-1245)
* not modified time (cf. NUTCH-933)
* fetch_retry -> db_gone after max retries (NUTCH-578)
* immediate refetch by sync_delta of AdaptiveFetchSchedule (NUTCH-1564)
* signature reset / erroneous db_notmodified (NUTCH-1422)

The latter for points are open issues, the corresponding tests are in a separate TODO test class or marked as such. The tests should make it easier to find a solutions for these issues: they are now reproducible. That's the main improvement: the tests log lot of information which makes it possible to understand what's going wrong. Since these problems happen only after a long time it's hard to do the investigations in real crawls (need to check dozens of segments).

> Test for CrawlDatum state transitions
> -------------------------------------
>
>                 Key: NUTCH-1502
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1502
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.7, 2.2
>            Reporter: Sebastian Nagel
>             Fix For: 2.4, 1.9
>
>         Attachments: NUTCH-1502-trunk-v1.patch
>
>
> An exhaustive test to check the matrix of CrawlDatum state transitions (CrawlStatus in 2.x) would be useful to detect errors esp. for continuous crawls where the number of possible transitions is quite large. Additional factors with impact on state transitions (retry counters, static and dynamic intervals) are also tested.
> The tests will help to address the NUTCH-578 and NUTCH-1245. See the latter for a first sketchy patch.



--
This message was sent by Atlassian JIRA
(v6.2#6252)