You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cook (JIRA)" <ji...@apache.org> on 2006/12/20 23:40:22 UTC

[jira] Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring

    [ http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ] 
            
Doug Cook commented on NUTCH-416:
---------------------------------

You may also want to make the status codes ORed values, so that, for example, all of the various kinds of failure all have a FAILURE code ORed in, making it clean & easy in the code to check for "any failure case" while still allowing different failure codes. So at  the lowest levels, the values might be things like FAILED, FETCHED, and UNFETCHED, while REDIRECT might be (FETCHED | something), specific redirect codes would be (REDIRECT | something), specific failure codes would be (FAILED | something), etc. This way we can keep all of the specific failure codes, all the specific redirect codes, etc. while making the code cleaner and more reliable. We won't have to worry about keeping range checks or switch statements in sync if we add new codes; a statement like
   if (code & FAILED != 0) {
   }
will always tell us whether a URL fetch failed, regardless of how many codes we add. The way the code currently is, adding status codes is likely to break things if one is not careful to go through every place where status codes are examined to ensure that the new code is properly accounted for.

While you're changing the CrawlDatum, it might also make sense to store a second URL,e.g. that of the redirect target. I have a hunch this will be very useful.

Just some thoughts. Thanks for making this happen.

Doug



> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>
>                 Key: NUTCH-416
>                 URL: http://issues.apache.org/jira/browse/NUTCH-416
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> CrawlDatum needs more status codes, e.g. to reflect redirected pages. However, current values of status codes are linear, which prevents us from adding new codes in proper places. This is also related to the logic in CrawlDbReducer, which makes decisions based on arithmetic ordering of status code values.
> I propose to change the codes so that they are grouped into related values, with significant gaps between groups for adding new codes without causing significant reordering. I also propose to change the logic in CrawlDbReducer so that its operation is not so dependent on actual code values.
> A mapping should also be added between old and new codes to facilitate backward-compatibility of existing data. This mapping should be applied on the fly, without requiring explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira