You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/12/15 13:47:20 UTC

[jira] Created: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring

CrawlDatum status and CrawlDbReducer refactoring
------------------------------------------------

                 Key: NUTCH-416
                 URL: http://issues.apache.org/jira/browse/NUTCH-416
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.9.0
            Reporter: Andrzej Bialecki 
         Assigned To: Andrzej Bialecki 
             Fix For: 0.9.0


CrawlDatum needs more status codes, e.g. to reflect redirected pages. However, current values of status codes are linear, which prevents us from adding new codes in proper places. This is also related to the logic in CrawlDbReducer, which makes decisions based on arithmetic ordering of status code values.

I propose to change the codes so that they are grouped into related values, with significant gaps between groups for adding new codes without causing significant reordering. I also propose to change the logic in CrawlDbReducer so that its operation is not so dependent on actual code values.

A mapping should also be added between old and new codes to facilitate backward-compatibility of existing data. This mapping should be applied on the fly, without requiring explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring

Posted by "Doug Cook (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ] 
            
Doug Cook commented on NUTCH-416:
---------------------------------

You may also want to make the status codes ORed values, so that, for example, all of the various kinds of failure all have a FAILURE code ORed in, making it clean & easy in the code to check for "any failure case" while still allowing different failure codes. So at  the lowest levels, the values might be things like FAILED, FETCHED, and UNFETCHED, while REDIRECT might be (FETCHED | something), specific redirect codes would be (REDIRECT | something), specific failure codes would be (FAILED | something), etc. This way we can keep all of the specific failure codes, all the specific redirect codes, etc. while making the code cleaner and more reliable. We won't have to worry about keeping range checks or switch statements in sync if we add new codes; a statement like
   if (code & FAILED != 0) {
   }
will always tell us whether a URL fetch failed, regardless of how many codes we add. The way the code currently is, adding status codes is likely to break things if one is not careful to go through every place where status codes are examined to ensure that the new code is properly accounted for.

While you're changing the CrawlDatum, it might also make sense to store a second URL,e.g. that of the redirect target. I have a hunch this will be very useful.

Just some thoughts. Thanks for making this happen.

Doug



> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>
>                 Key: NUTCH-416
>                 URL: http://issues.apache.org/jira/browse/NUTCH-416
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> CrawlDatum needs more status codes, e.g. to reflect redirected pages. However, current values of status codes are linear, which prevents us from adding new codes in proper places. This is also related to the logic in CrawlDbReducer, which makes decisions based on arithmetic ordering of status code values.
> I propose to change the codes so that they are grouped into related values, with significant gaps between groups for adding new codes without causing significant reordering. I also propose to change the logic in CrawlDbReducer so that its operation is not so dependent on actual code values.
> A mapping should also be added between old and new codes to facilitate backward-compatibility of existing data. This mapping should be applied on the fly, without requiring explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460091 ] 
            
Andrzej Bialecki  commented on NUTCH-416:
-----------------------------------------

There are two main distinct groups of status codes, but not along the lines of success/failure - these are DB and Fetch status codes. Additionally, the number of available bits for a bitmask is very small, because the status needs to fit in a byte.

My patch in progress contains the following now:

  public static final byte STATUS_DB_UNFETCHED      = 0x01;
  public static final byte STATUS_DB_FETCHED        = 0x02;
  public static final byte STATUS_DB_GONE           = 0x03;
  public static final byte STATUS_DB_REDIR_TEMP     = 0x04;
  public static final byte STATUS_DB_REDIR_PERM     = 0x05;
  
  /** Maximum value of DB-related status. */
  public static final byte STATUS_DB_MAX            = 0x1f;
  
  public static final byte STATUS_FETCH_SUCCESS     = 0x21;
  public static final byte STATUS_FETCH_RETRY       = 0x22;
  public static final byte STATUS_FETCH_REDIR_TEMP  = 0x23;
  public static final byte STATUS_FETCH_REDIR_PERM  = 0x24;
  public static final byte STATUS_FETCH_GONE        = 0x25;
  
  /** Maximum value of fetch-related status. */
  public static final byte STATUS_FETCH_MAX         = 0x3f;
  
  public static final byte STATUS_SIGNATURE         = 0x41;
  public static final byte STATUS_INJECTED          = 0x42;
  public static final byte STATUS_LINKED            = 0x43;
  
  public static boolean hasDbStatus(CrawlDatum datum) {
    if (datum.status <= STATUS_DB_MAX) return true;
    return false;
  }

  public static boolean hasFetchStatus(CrawlDatum datum) {
    if (datum.status > STATUS_DB_MAX && datum.status <= STATUS_FETCH_MAX) return true;
    return false;
  }

... so, I went with ranges of values. The most unwieldy switch() statements in the current code were related to the checking between DB or Fetch status, and the above two static methods handle this and simplify the code.

Regarding the redirect URL - because of space constraints I'd rather use Metadata for this. We already handle metadata efficiently, so that performance doesn't suffer if we don't have any metadata to keep. It would make sense, though, to have a predefined key for this URL.

> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>
>                 Key: NUTCH-416
>                 URL: http://issues.apache.org/jira/browse/NUTCH-416
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> CrawlDatum needs more status codes, e.g. to reflect redirected pages. However, current values of status codes are linear, which prevents us from adding new codes in proper places. This is also related to the logic in CrawlDbReducer, which makes decisions based on arithmetic ordering of status code values.
> I propose to change the codes so that they are grouped into related values, with significant gaps between groups for adding new codes without causing significant reordering. I also propose to change the logic in CrawlDbReducer so that its operation is not so dependent on actual code values.
> A mapping should also be added between old and new codes to facilitate backward-compatibility of existing data. This mapping should be applied on the fly, without requiring explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-416?page=all ]

Andrzej Bialecki  closed NUTCH-416.
-----------------------------------

    Resolution: Fixed

Fixed in trunk, rev. 490607. As a side effect it is now possible to correctly update CrawlDB from multiple segments, even if they contain duplicate pages - the code in CrawlDbReducer will correctly apply only the latest version of CrawlDatum.

> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>
>                 Key: NUTCH-416
>                 URL: http://issues.apache.org/jira/browse/NUTCH-416
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> CrawlDatum needs more status codes, e.g. to reflect redirected pages. However, current values of status codes are linear, which prevents us from adding new codes in proper places. This is also related to the logic in CrawlDbReducer, which makes decisions based on arithmetic ordering of status code values.
> I propose to change the codes so that they are grouped into related values, with significant gaps between groups for adding new codes without causing significant reordering. I also propose to change the logic in CrawlDbReducer so that its operation is not so dependent on actual code values.
> A mapping should also be added between old and new codes to facilitate backward-compatibility of existing data. This mapping should be applied on the fly, without requiring explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira