You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2011/06/10 18:20:58 UTC

[jira] [Created] (CASSANDRA-2759) Scrub could lose increments and replicate that loss

Scrub could lose increments and replicate that loss
---------------------------------------------------

                 Key: CASSANDRA-2759
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2759
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.8.0
            Reporter: Sylvain Lebresne
             Fix For: 0.8.1


If scrub cannot 'repair' a corrupted row, it will skip it. On node A, if the row contains some sub-count for A id, those will be lost forever since A is the source of truth on it's current id. We should thus renew node A id when that happens to avoid this (not unlike we do in cleanup).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2759) Scrub could lose increments and replicate that loss

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047306#comment-13047306 ] 

Sylvain Lebresne commented on CASSANDRA-2759:
---------------------------------------------

It may be that the best short fix here is to make scrub *not* skipping row on counter column families (though CASSANDRA-2614 would change that to 'never ever skipping row') and just throw a RuntimeException.

> Scrub could lose increments and replicate that loss
> ---------------------------------------------------
>
>                 Key: CASSANDRA-2759
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2759
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: counters
>             Fix For: 0.8.1
>
>         Attachments: 0001-Renew-nodeId-in-scrub-when-skipping-rows.patch
>
>
> If scrub cannot 'repair' a corrupted row, it will skip it. On node A, if the row contains some sub-count for A id, those will be lost forever since A is the source of truth on it's current id. We should thus renew node A id when that happens to avoid this (not unlike we do in cleanup).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2759) Scrub could lose increments and replicate that loss

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2759:
----------------------------------------

    Attachment: 0001-Don-t-skip-rows-on-scrub-for-counter-CFs.patch

Attaching patch to simply re-throw the exception instead of skipping the row for counter column families.

bq. Only if you actually did have a counter in the column_metadata, right?

right.

> Scrub could lose increments and replicate that loss
> ---------------------------------------------------
>
>                 Key: CASSANDRA-2759
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2759
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: counters
>             Fix For: 0.8.1
>
>         Attachments: 0001-Don-t-skip-rows-on-scrub-for-counter-CFs.patch, 0001-Renew-nodeId-in-scrub-when-skipping-rows.patch
>
>
> If scrub cannot 'repair' a corrupted row, it will skip it. On node A, if the row contains some sub-count for A id, those will be lost forever since A is the source of truth on it's current id. We should thus renew node A id when that happens to avoid this (not unlike we do in cleanup).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2759) Scrub could lose increments and replicate that loss

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047422#comment-13047422 ] 

Jonathan Ellis commented on CASSANDRA-2759:
-------------------------------------------

bq. make scrub not skip rows on counter column families

+1

bq. CASSANDRA-2614 would change that to 'never ever skipping row'

Only if you actually did have a counter in the column_metadata, right?

> Scrub could lose increments and replicate that loss
> ---------------------------------------------------
>
>                 Key: CASSANDRA-2759
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2759
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: counters
>             Fix For: 0.8.1
>
>         Attachments: 0001-Renew-nodeId-in-scrub-when-skipping-rows.patch
>
>
> If scrub cannot 'repair' a corrupted row, it will skip it. On node A, if the row contains some sub-count for A id, those will be lost forever since A is the source of truth on it's current id. We should thus renew node A id when that happens to avoid this (not unlike we do in cleanup).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2759) Scrub could lose increments and replicate that loss

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne updated CASSANDRA-2759:
----------------------------------------

    Attachment: 0001-Renew-nodeId-in-scrub-when-skipping-rows.patch

Attached patch against 0.8.

The patch also add a new startup option to renew the node id on startup. This could be useful if someone lose one of it's sstable (because of a bad disk for instance) and don't want to fully decommission that node.

This could arguably be splitted in another ticket though.

> Scrub could lose increments and replicate that loss
> ---------------------------------------------------
>
>                 Key: CASSANDRA-2759
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2759
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: counters
>             Fix For: 0.8.1
>
>         Attachments: 0001-Renew-nodeId-in-scrub-when-skipping-rows.patch
>
>
> If scrub cannot 'repair' a corrupted row, it will skip it. On node A, if the row contains some sub-count for A id, those will be lost forever since A is the source of truth on it's current id. We should thus renew node A id when that happens to avoid this (not unlike we do in cleanup).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2759) Scrub could lose increments and replicate that loss

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047285#comment-13047285 ] 

Sylvain Lebresne commented on CASSANDRA-2759:
---------------------------------------------

It's picking a new UUID for the current node to use for new counter increment.

The problem is that on a given node we store deltas for it's current nodeId (to avoid synchronized read-before-write, but I'm starting to wonder is that was the smartest ever). Anyway, if scrub skips a row, it may skip some of those deltas. Let's say at first there is no increments coming for this row for A as 'first distinguished replica'. So far we are still kind of good, because on a read (with CL > ONE) the result coming from A will have a 'version' for it's own sub-count smaller that the one on the other replica, so we will us the sub-count on those replica and return the correct value.

However, as soon as A acknowledge new increments for this row, it will start inserting new deltas while he is not intrinsically up to date. Which will result in an definitive undercount.

The goal of renewing the node id of A is to make sure that second part never happen (because after the renew A will add new deltas as A', not A anymore).

Anyway, now that I've plugged the brain this patch doesn't really works because A will never be repaired by the other nodes of it's now inconsistent value.

So I have no clue how to actually fix that.

> Scrub could lose increments and replicate that loss
> ---------------------------------------------------
>
>                 Key: CASSANDRA-2759
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2759
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: counters
>             Fix For: 0.8.1
>
>         Attachments: 0001-Renew-nodeId-in-scrub-when-skipping-rows.patch
>
>
> If scrub cannot 'repair' a corrupted row, it will skip it. On node A, if the row contains some sub-count for A id, those will be lost forever since A is the source of truth on it's current id. We should thus renew node A id when that happens to avoid this (not unlike we do in cleanup).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2759) Scrub could lose increments and replicate that loss

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049276#comment-13049276 ] 

Hudson commented on CASSANDRA-2759:
-----------------------------------

Integrated in Cassandra-0.8 #170 (See [https://builds.apache.org/job/Cassandra-0.8/170/])
    

> Scrub could lose increments and replicate that loss
> ---------------------------------------------------
>
>                 Key: CASSANDRA-2759
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2759
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: counters
>             Fix For: 0.8.1
>
>         Attachments: 0001-Don-t-skip-rows-on-scrub-for-counter-CFs.patch, 0001-Renew-nodeId-in-scrub-when-skipping-rows.patch
>
>
> If scrub cannot 'repair' a corrupted row, it will skip it. On node A, if the row contains some sub-count for A id, those will be lost forever since A is the source of truth on it's current id. We should thus renew node A id when that happens to avoid this (not unlike we do in cleanup).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (CASSANDRA-2759) Scrub could lose increments and replicate that loss

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sylvain Lebresne reassigned CASSANDRA-2759:
-------------------------------------------

    Assignee: Sylvain Lebresne

> Scrub could lose increments and replicate that loss
> ---------------------------------------------------
>
>                 Key: CASSANDRA-2759
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2759
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: counters
>             Fix For: 0.8.1
>
>
> If scrub cannot 'repair' a corrupted row, it will skip it. On node A, if the row contains some sub-count for A id, those will be lost forever since A is the source of truth on it's current id. We should thus renew node A id when that happens to avoid this (not unlike we do in cleanup).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2759) Scrub could lose increments and replicate that loss

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049065#comment-13049065 ] 

Jonathan Ellis commented on CASSANDRA-2759:
-------------------------------------------

+1

can you add a link to this issue in the "dangerous" comment?

> Scrub could lose increments and replicate that loss
> ---------------------------------------------------
>
>                 Key: CASSANDRA-2759
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2759
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: counters
>             Fix For: 0.8.1
>
>         Attachments: 0001-Don-t-skip-rows-on-scrub-for-counter-CFs.patch, 0001-Renew-nodeId-in-scrub-when-skipping-rows.patch
>
>
> If scrub cannot 'repair' a corrupted row, it will skip it. On node A, if the row contains some sub-count for A id, those will be lost forever since A is the source of truth on it's current id. We should thus renew node A id when that happens to avoid this (not unlike we do in cleanup).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2759) Scrub could lose increments and replicate that loss

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047273#comment-13047273 ] 

Jonathan Ellis commented on CASSANDRA-2759:
-------------------------------------------

what is "renewing a node id?"

> Scrub could lose increments and replicate that loss
> ---------------------------------------------------
>
>                 Key: CASSANDRA-2759
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2759
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: counters
>             Fix For: 0.8.1
>
>         Attachments: 0001-Renew-nodeId-in-scrub-when-skipping-rows.patch
>
>
> If scrub cannot 'repair' a corrupted row, it will skip it. On node A, if the row contains some sub-count for A id, those will be lost forever since A is the source of truth on it's current id. We should thus renew node A id when that happens to avoid this (not unlike we do in cleanup).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira