You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Senthilvel Rangaswamy (JIRA)" <ji...@apache.org> on 2012/07/06 09:01:36 UTC

[jira] [Created] (CASSANDRA-4417) invalid counter shard detected

Senthilvel Rangaswamy created CASSANDRA-4417:
------------------------------------------------

             Summary: invalid counter shard detected 
                 Key: CASSANDRA-4417
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.1.1
         Environment: Amazon Linux
            Reporter: Senthilvel Rangaswamy


Seeing errors like these:

2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard

What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected

Posted by "Charles Brophy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450766#comment-13450766 ] 

Charles Brophy edited comment on CASSANDRA-4417 at 9/8/12 3:39 AM:
-------------------------------------------------------------------

We have a six node cluster [1.1.3, jdk 1.6.33, CentOs 6] with even key range balance, random partitioner, and with replication factor=2. I get these errors immediately following running "nodetool repair" but ONLY if a streaming repair happens as a result. We are serving live updates to our counters from our clickstream. My guess is that the sstable being streamed between the servers winds up becoming "out of date" for the duration of the streaming process and ends up containing these "duplicates" that are vetted during the subsequent compaction. In any case, for us it is 100% reproducible via: nodetool repair -> streaming repair -> subsequent compaction. Let me know if you need more details. Hope this helps!
                
      was (Author: charlesb_zulily):
    We have a six node cluster with even key range balance, random partitioner, and with replication factor=2. I get these errors immediately following running "nodetool repair" but ONLY if a streaming repair happens as a result. We are serving live updates to our counters from our clickstream. My guess is that the sstable being streamed between the servers winds up becoming "out of date" for the duration of the streaming process and ends up containing these "duplicates" that are vetted during the subsequent compaction. In any case, for us it is 100% reproducible via: nodetool repair -> streaming repair -> subsequent compaction. Let me know if you need more details. Hope this helps!
                  
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Chris Herron (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477556#comment-13477556 ] 

Chris Herron commented on CASSANDRA-4417:
-----------------------------------------

We are seeing large volumes of this error on all nodes when running a load test while also running upgradesstables on multiple CF's on each node.

After reading Sylvain's comments above, tried running the same test with commitlog_sync: batch - we get a similar volume of the same errors.

(Running a build from branch cassandra-1.1 at commit 4d2e5e73b127dc0b335176ddc1dec1f0244e7f6d, with Java 6u35 on Amazon Linux 2.6.35)
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Omid Aladini (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453537#comment-13453537 ] 

Omid Aladini commented on CASSANDRA-4417:
-----------------------------------------

{quote}
A simple "workaround" is to use batch commit log, but that has a potentially important performance impact.
{quote}

I'm a bit confused why batch commit would solve the problem. If cassandra crashes before the batch is fsynced, the counter mutations which it was the leader for will still be lost although they might have been applied on other replicas. The difference would be that the mutations won't be acknowledged to the client, and since counters aren't idempotent, the client won't know weather to retry or not. Am I missing something?
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486232#comment-13486232 ] 

Jonathan Ellis commented on CASSANDRA-4417:
-------------------------------------------

"On a bootstrap" sounds more like CASSANDRA-4071.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Eric Lubow (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485333#comment-13485333 ] 

Eric Lubow commented on CASSANDRA-4417:
---------------------------------------

We are getting this on DSE 2.2 (C* 1.1.5) on a new node during bootstrap.  We upgraded the cluster from C* 1.0.10 about 10 days ago and upgradesstables was run on every node and we repaired the entire cluster.  We ran We've been getting this error sporadically on various nodes at various points but it's not consistent.  I've double and triple checked every node looking for sstable files named "*-hd-*" and I don't see any (assuming that's enough to tell that the sstable has been upgraded.  If this error is an effect of requiring one to run upgradesstables, then how would it happen during a bootstrap? All nodes involved in this cluster are 1.1.5.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478255#comment-13478255 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
---------------------------------------------

bq. Is that equivalent?

Probably, how was it deleted/recreated. Did you drop and recreate?

bq. This does reproduce for every test run.

Interesting, you may on something. Does it make sense for you to perform the same test without the upgradesstables part (i.e. only the writes and reads). If so, does that change something? Also, during that test, is there anything involving streaming going on (a repair, a node bootstrapping/moving/decommissioning)? Trying to narrow down what's involved in you test.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected

Posted by "Eric Lubow (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485333#comment-13485333 ] 

Eric Lubow edited comment on CASSANDRA-4417 at 10/27/12 2:22 AM:
-----------------------------------------------------------------

We are getting this on DSE 2.2 (C* 1.1.5) on a new node during bootstrap.  We upgraded the cluster from C* 1.0.10 about 10 days ago and upgradesstables was run on every node and we repaired the entire cluster.  We ran We've been getting this error sporadically on various nodes at various points but it's not consistent.  I've double and triple checked every node looking for sstable files named "*- hd -*" and I don't see any (assuming that's enough to tell that the sstable has been upgraded.  If this error is an effect of requiring one to run upgradesstables, then how would it happen during a bootstrap? All nodes involved in this cluster are 1.1.5.
                
      was (Author: elubow):
    We are getting this on DSE 2.2 (C* 1.1.5) on a new node during bootstrap.  We upgraded the cluster from C* 1.0.10 about 10 days ago and upgradesstables was run on every node and we repaired the entire cluster.  We ran We've been getting this error sporadically on various nodes at various points but it's not consistent.  I've double and triple checked every node looking for sstable files named "*-hd-*" and I don't see any (assuming that's enough to tell that the sstable has been upgraded.  If this error is an effect of requiring one to run upgradesstables, then how would it happen during a bootstrap? All nodes involved in this cluster are 1.1.5.
                  
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Peter Schuller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452689#comment-13452689 ] 

Peter Schuller commented on CASSANDRA-4417:
-------------------------------------------

@Fabien That sounds plausible to me upon first read at least. I cannot confirm or deny whether it's possible that I've only seen this under circumstances where a non-clean shutdown has taken place. In particular, I cannot (at the top of my head anyway) think of a way in which the situation you describe is ever *prevented* from happening. So regardless of whether or not this is *the* explanation for this problem, it seems to me to at least be *an* explanation for it.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Fabien Rousseau (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451823#comment-13451823 ] 

Fabien Rousseau commented on CASSANDRA-4417:
--------------------------------------------

Could it be that counter increment is written in commitlog and sent to other replicas, BUT : commitlog is not yet flushed on disk AND cassandra is stopped/killed.

Ex :
 - node A receives an increment of 1 for c1
 - it stores (A, 1, 1)
 - it sends the increment to the replicas (A, 1, 1)
 - node is killed (without commitog being flushed to disk)
 - on restart, node A receives an increment of 3 for c1
 - it stores (A, 1, 3) (because it has no way of knowing the clock "1" was already attributed)

                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Charles Brophy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452016#comment-13452016 ] 

Charles Brophy commented on CASSANDRA-4417:
-------------------------------------------

Yes, we operate under a heavy write load and we do have frequent compaction as a result, but under normal conditions I never see this exception - or at least it doesn't happen often enough for me to catch. Following a repair, however, it's a guarantee for us. Could it be as simple as:

* the two servers are participants in the same key-range replicant and the sstables contain the same key/row/column references
* the process of streaming repair is sending a set of key/row/column references to the requestor in the same sstable as the out-of-sync data it's already aware of via repair
* Compaction finds the duplicate references in the recently received sstables - they're basically the other replicant's copies of that data

It seems that the act of sending the sstables from one server to the other when both are replicants of the same key range would be expected to result in duplicate references. I'm probably way off.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Bartłomiej Romański (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457528#comment-13457528 ] 

Bartłomiej Romański commented on CASSANDRA-4417:
------------------------------------------------

And the last comment. Could this be related to: CASSANDRA-4071? If I understand the description correctly any topology changes (adding a node, moving a node) when the counter is spread across more than one sstable can result in the "invalid counter shard detected" error message during reads. Am I right?
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500026#comment-13500026 ] 

Michael Kjellman commented on CASSANDRA-4417:
---------------------------------------------

unfortunately, hitting this as well. We increment by different values as well. RF=3 on 1.1.6. Happened to me after i did a nodetool drain and restarted a node. When it came back up started seeing it being logged.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>         Attachments: cassandra-mck.log.bz2, err.txt
>
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Chris Herron (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480121#comment-13480121 ] 

Chris Herron commented on CASSANDRA-4417:
-----------------------------------------

Another observation since: in previous runs with key cache disabled we were not seeing any errors. However I've since found some invalid counter shard errors that are occurring during normal compaction. 

{code}
ERROR [CompactionExecutor:6] 2012-10-19 15:43:50,920 org.apache.cassandra.db.context.CounterContext invalid counter shard detected; (15b843e0-ff7c-11e0-0000-07f4b18563ff, 1, 1) and (15b843e0-ff7c-11e0-0000-07f4b18563ff, 1, 2) differ only
 in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
{code}

So to be clear, this particular scenario is:
* C* 1.1.6 with key cache disabled. 
* Load test ran earlier against this same setup; but no upgradesstables during that run; no errors under load during that test run.
* Later, some nightly jobs ran that read from Super CF counters, write to other CFs.
* Compaction activity occurs later after load test and nightly jobs complete. Invalid counter shard errors are seen for some CFs. Gleaning from the log output order, the affected CF's:
** *Did* have upgradesstables run upon them in previous configurations (1.1.6, key cache on)
** Have not been written to at all for the purpose of the load test I've been mentioning.
** Have been read from for these nightly jobs.








                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Peter Schuller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451240#comment-13451240 ] 

Peter Schuller commented on CASSANDRA-4417:
-------------------------------------------

I am not 100% certain, but I am fairly certain, that we've seen this on nodes that haven't done any streaming whatsoever.

With respect to duplicates: It's certainly not *supposed* to happen. A given counter shard, from a given node id, and with a given clock, should only ever be produced exactly once by exactly one node. Obviously the bug isn't supposed to happen to begin with, so that doesn't mean the bug isn't related to streaming.

Hmmm.

Do you have a lot of writes normally? Is it possible that the correlation with streaming is because of the fact that it initiates significant amounts of compaction?
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected

Posted by "Omid Aladini (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453537#comment-13453537 ] 

Omid Aladini edited comment on CASSANDRA-4417 at 9/12/12 10:18 AM:
-------------------------------------------------------------------

{quote}
A simple "workaround" is to use batch commit log, but that has a potentially important performance impact.
{quote}

I'm a bit confused why batch commit would solve the problem. If cassandra crashes before the batch is fsynced, the counter mutations in the batch which it was the leader for will still be lost although they might have been applied on other replicas. The difference would be that the mutations won't be acknowledged to the client, and since counters aren't idempotent, the client won't know weather to retry or not. Am I missing something?
                
      was (Author: omid):
    {quote}
A simple "workaround" is to use batch commit log, but that has a potentially important performance impact.
{quote}

I'm a bit confused why batch commit would solve the problem. If cassandra crashes before the batch is fsynced, the counter mutations which it was the leader for will still be lost although they might have been applied on other replicas. The difference would be that the mutations won't be acknowledged to the client, and since counters aren't idempotent, the client won't know weather to retry or not. Am I missing something?
                  
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481539#comment-13481539 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
---------------------------------------------

bq. Could this be related to: CASSANDRA-4071?

I missed that earlier on, but yes, Bartłomiej is correct, CASSANDRA-4071 will totally trigger 'invalid counter shard detected' messages. As described in the ticket, if you don't use RF=1, this shouldn't actually create data loss, but it would still trigger the log until things get compacted away.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Chris Herron (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481729#comment-13481729 ] 

Chris Herron commented on CASSANDRA-4417:
-----------------------------------------

bq. Quick question: do you always increment by the same value by any chance?

No, sorry just happened to pick that example. We have many other log entries where both values are higher and don't differ by 1.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482370#comment-13482370 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
---------------------------------------------

bq. No, sorry just happened to pick that example.

That's ok, thanks nonetheless.

I'm really starting to think that CASSANDRA-4071 is likely the main cause for this and is very easy to reproduce in that case. The commit log we've discussed earlier can also trigger that error, but it's probably much harder to trigger. The good news is that as explained on the ticket, CASSANDRA-4071 won't corrupt counters unless you are at RF=1 (in which case that's not a good news). The bad one is I'm not really sure how to fix it. 
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Ivan Sobolev (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486872#comment-13486872 ] 

Ivan Sobolev commented on CASSANDRA-4417:
-----------------------------------------

{quote}
[~slebresne]
Quick question: do you always increment by the same value by any chance? {quote}

Attached a log has not only +1 increments(though, think not any log would help you there :) )

We run 1.1.5, no upgradesstables, most probably unclean shutdown too.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>         Attachments: err.txt
>
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492245#comment-13492245 ] 

Mck SembWever edited comment on CASSANDRA-4417 at 11/7/12 10:20 AM:
--------------------------------------------------------------------

Sylvain, here's log from one node. For most of the log we were running 1.0.8. And then at line 2883399 we upgraded (and this was the first node to upgrade) to 1.1.6.

The error msg comes every few seconds.
Our counters are sub-columns inside supercolumns.
We completed the upgrade on all nodes. Then restarted again (because jna was missing).

We are now running upgradesstables but that's not in this logfile. The error msgs still appear.

An operational problem we've had recently is that we had one node down for ~one month (faulty raid controller) and when we finally brought the node back into the cluster nightly repairs would never finish. In the end we just disabled nightly repairs (we don't have tombstones) with the plan that an upgrade and upgradesstables would bring us back to a state where repairs would work again. I have no idea if this can be related. 
                
      was (Author: michaelsembwever):
    Sylvain, here's log from one node. For most of the log we were running 1.0.8. And then at line 2883399 we upgraded (and this was the first node to upgrade) to 1.1.6.

The error msg comes every few seconds.
Our counters are sub-columns inside supercolumns.
We completed the upgrade on all nodes. Then restarted again (because jna was missing).

We are now running upgradesstables but that's not in this logfile. The error msgs still appear.

An operational problem we're had recently is that we had one node down for ~one month (faulty raid controller) and when we finally brought the node back into the cluster nightly repairs would never finish. In the end we just disabled nightly repairs (we don't have tombstones) with the plan that an upgrade and upgradesstables would bring us back to a state where repairs would work again. I have no idea if this can be related. 
                  
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>         Attachments: cassandra-mck.log.bz2, err.txt
>
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453123#comment-13453123 ] 

Jonathan Ellis commented on CASSANDRA-4417:
-------------------------------------------

You would do more fsyncs, but only the CFs in actual batch mode would have to block for them.  Periodic mode just queues the CL op and moves on to memtable append immediately.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Michael Theroux (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421432#comment-13421432 ] 

Michael Theroux commented on CASSANDRA-4417:
--------------------------------------------

I just hit this myself on 1.1.2 on two nodes of a six node cluster.  The cluster has been stable for a couple of weeks.

If it makes any difference, we recently enabled row-caching.

...
INFO [AntiEntropyStage:1] 2012-07-24 11:05:55,537 AntiEntropyService.java (line 206) [repair #b9355020-d57e-11e1-0000-7c4549350fdf] Received merkle tree for caches from /10.29.214.111
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,532 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,533 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,535 CounterContext.java (line 381) invalid counter shard detected; (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, 1) and (6be74ab0-6cc6-11e1-0000-242d50cf1fd7, 1, -1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
 INFO [AntiEntropyStage:1] 2012-07-24 11:06:05,541 AntiEntropyService.java (line 206) [repair #b9355020-d57e-11e1-0000-7c4549350fdf] Received merkle tree for caches from /10.144.15.6
...
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Chris Herron (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478505#comment-13478505 ] 

Chris Herron commented on CASSANDRA-4417:
-----------------------------------------

bq. Probably, how was it deleted/recreated. Did you drop and recreate?

Yes, dropped (the schema migration flavor) and recreated a CF of the same name.

bq. Perform the same test without the upgradesstables part (i.e. only the writes and reads). If so, does that change something?

Have already tested that scenario. Running this load test without the concurrent upgradesstables compaction activity, the problem does not exhibit.

bq. during that test, is there anything involving streaming going on (a repair, a node bootstrapping/moving/decommissioning)?

Not that I know of. I can test again and monitor for streaming activity to see.

By the way, as we've been testing in preparation for a 1.1.x upgrade, we were seeing symptoms of CASSANDRA-4571, CASSANDRA-4687 as well as this issue on C* 1.1.6. In investigating CASSANDRA-4687 we disabled key cache, repeated the load+upgradesstables test and these invalid counter shard warnings did not appear.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Michael Theroux (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421440#comment-13421440 ] 

Michael Theroux commented on CASSANDRA-4417:
--------------------------------------------

Ignore that comment about row-caching.  I see these errors in the log dating back the the 11th of July (long before we enabled rowcaching)
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-4417) invalid counter shard detected

Posted by "Mck SembWever (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mck SembWever updated CASSANDRA-4417:
-------------------------------------

    Attachment: cassandra-mck.log.bz2

Sylvain, here's log from one node. For most of the log we were running 1.0.8. And then at line 2883399 we upgraded (and this was the first node to upgrade) to 1.1.6.

The error msg comes every few seconds.
Our counters are sub-columns inside supercolumns.
We completed the upgrade on all nodes. Then restarted again (because jna was missing).

We are now running upgradesstables but that's not in this logfile. The error msgs still appear.

An operational problem we're had recently is that we had one node down for ~one month (faulty raid controller) and when we finally brought the node back into the cluster nightly repairs would never finish. In the end we just disabled nightly repairs (we don't have tombstones) with the plan that an upgrade and upgradesstables would bring us back to a state where repairs would work again. I have no idea if this can be related. 
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>         Attachments: cassandra-mck.log.bz2, err.txt
>
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Chris Herron (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479561#comment-13479561 ] 

Chris Herron commented on CASSANDRA-4417:
-----------------------------------------

bq. Also, during that test, is there anything involving streaming going on (a repair, a node bootstrapping/moving/decommissioning)?

There are definitely no repairs or node bootstrapping/moving/decommissioning happening during the test.
Re-ran the test and the JMX stats for StreamStage indicated zero tasks on all nodes after the test completed.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453058#comment-13453058 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
---------------------------------------------

That's a very good point. Counters do rely on the fact that nodes do not lose the increments they are "leader" for (or that they don't reuse the same nodeId if they do), but unless the commit log uses batch mode, this can happen. And that will lead to exactly the exception seen here, so I'd say there's a very good chance this is the problem.

I'll note that if that is indeed a problem, it's very possible that the error logged happens only much later (after the "unclean" shutdown) and on some other node than the one having died. So not being able to correlate the error to an unclean shutdown doesn't really indicate that it's not related.

The consequence of this happening is that the increments that have been lost with un-synced commit log are lost. Meaning that with the default configuration, one could lose up to 10 seconds of the increments (for which the dying node is leader). However, I think it is also possible to have results from read to miss slightly more than that, though that last part should fix itself if the counter is incremented again.

As for the error message logged, it's possible that lots of them are logged even though only a small number of counters are affected since it's print during column reconciliation and thus could be logged many time for the same counter.

A simple "workaround" is to use batch commit log, but that has a potentially important performance impact.

Another solution I've though of would be to try to detect unclean shutdown (by marking something during clean shutdown and checking for that at startup) and if we detect one, to renew the nodeId. The problem with that is that this potentially mean renewing the nodeId pretty often. And each time we do that, the internal representation of counter grow and I'm really afraid it will be a problem in that case. And while we have some mechanism to shrink back counter by merging sub-counts when the nodeId is renewed too often, that mechanism assumes that the node owning the nodeId has the more up-to-date value for this sub-count, which is exactly the problem here. So overall I don't have any good idea to fix this. Other ideas?

                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477670#comment-13477670 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
---------------------------------------------

bq. After reading Sylvain's comments above, tried running the same test with commitlog_sync: batch - we get a similar volume of the same errors

Just to clarify, using batch commit log should only avoid the initial problem to reproduce (assuming the analysis of the problem is correct of course). However, contrarily to what the error message pretends, the existing "invalid counter shards" don't "heal" themselves as soon a the message is logged. In fact, the message is logged each time we merge counter columns that have conflicting shards and when that merge is triggered by a compaction, it will indeed "heal" the shard. But we also merge each time we read for instance. In other words, even if batch commit log fixes the problem, one will need to compact everything/wait for everything to be compacted to have all logged messages disappear. Unless you've been able to reproduce on a brand new cluster where the commit log was set to batch from the beginning (in which case, if you have an easy way to reproduce, that would be interesting to know).

                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Ed Solovey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501249#comment-13501249 ] 

Ed Solovey commented on CASSANDRA-4417:
---------------------------------------

We are on 1.1.6 and are seeing this on a three node cluster with replication factor of 2.  Is there a workaround for this?  Corrupted counters are a showstopper for us and we'll have to move off Cassandra if we can't resolve this.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>         Attachments: cassandra-mck.log.bz2, err.txt
>
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481526#comment-13481526 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
---------------------------------------------

Quick question: do you always increment by the same value by any chance? I'm asking because the last log you've pasted indicates the conflicting information found correspond to 1 increment only, and in the first case the value is 1, on the second the value is 2. If you always increment by say 1, that would tell us which one is wrong (I'm not yet sure which conclusion I would draw from that but more info can't hurt :)).
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Chris Herron (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477943#comment-13477943 ] 

Chris Herron commented on CASSANDRA-4417:
-----------------------------------------

bq. Unless you've been able to reproduce on a brand new cluster where the commit log was set to batch from the beginning (in which case, if you have an easy way to reproduce, that would be interesting to know)

In our test the affected Super CF is completely deleted and recreated - so in that sense the commit log was set to batch from the beginning. Is that equivalent?

This does reproduce for every test run. Unfortunately our test is non-trivial to share. It involves heavy writes and moderate reads to counters, while simultaneously running upgradesstables on all nodes upon multiple CF's (including the affected one). Interestingly, the symptom does appear even before compaction reaches the Super CF that's active during the test.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480069#comment-13480069 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
---------------------------------------------

Ok. The fact that you only reproduce when using upgradesstables is definitively interesting. I'll check if I can see something causing that in upgradesstables. I'll keep you posted.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Chris Herron (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482552#comment-13482552 ] 

Chris Herron commented on CASSANDRA-4417:
-----------------------------------------

bq. I'm really starting to think that CASSANDRA-4071 is likely the main cause for this and is very easy to reproduce in that case. The commit log we've discussed earlier can also trigger that error, but it's probably much harder to trigger.

In our case:
* We haven't made any topology changes
* Our test drops and recreates the affected CFs. No nodes die during the test (w.r.t. unclean shutdown and commit log)
* After previous load test runs under different configuration (see below), no nodes die, and we use nodetool drain before restarting with updated configs.

Note my earlier comment above I said:

bq. In investigating CASSANDRA-4687 we disabled key cache, repeated the load+upgradesstables test and these invalid counter shard warnings did not appear.

Given that we don't have a topology change, can you think of a scenario where a commitlog issue is still contributing?


                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Charles Brophy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450766#comment-13450766 ] 

Charles Brophy commented on CASSANDRA-4417:
-------------------------------------------

We have a six node cluster with even key range balance, random partitioner, and with relication factor=2. I get these errors immediately following running "nodetool repair" but ONLY if a streaming repair happens as a result. We are serving live updates to our counters from our clickstream. My guess is that the sstable being streamed between the servers winds up becoming "out of date" for the duration of the streaming process and ends up containing these "duplicates" that are vetted during the subsequent compaction. In any case, for us it is 100% reproducible via: nodetool repair -> streaming repair -> subsequent compaction. Let me know if you need more details. Hope this helps!
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Bartłomiej Romański (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457504#comment-13457504 ] 

Bartłomiej Romański commented on CASSANDRA-4417:
------------------------------------------------

Is it possible to predict how dangerous this bug could be? We are already experiencing very serious problems with CASSANDRA-4639. Our counter values suddenly became a few times higher than expected. As you can imagine this is a disaster from the business point of view. We are already seriously thinking about going back to SQL databases :/ I wonder how (if) this bug (and possible other counter related bugs) can affect us. We rely heavily on counters.

Can this bug possibly lead to incorrect counter values? Temporarily or permanently - will running repair fix it? 

How incorrect counter values could be? Loosing a couple increments immediately preceding a node failure is probably acceptable in most cases. Is it possible to loose more increments? Or end up in completely incorrect counter values as in CASSANDRA-4639?

What would exactly happen after hitting this bug. Running repair should fix it? The "self-healing" mechanism would actually make counter consistent again? Or we will get this error messages over and over?

Sorry for writing a comment full of questions, but I've got very limited knowledge of cassandra internals. I'll be very thankful if someone could refer to the questions above.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-4417) invalid counter shard detected

Posted by "Charles Brophy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450766#comment-13450766 ] 

Charles Brophy edited comment on CASSANDRA-4417 at 9/8/12 3:35 AM:
-------------------------------------------------------------------

We have a six node cluster with even key range balance, random partitioner, and with replication factor=2. I get these errors immediately following running "nodetool repair" but ONLY if a streaming repair happens as a result. We are serving live updates to our counters from our clickstream. My guess is that the sstable being streamed between the servers winds up becoming "out of date" for the duration of the streaming process and ends up containing these "duplicates" that are vetted during the subsequent compaction. In any case, for us it is 100% reproducible via: nodetool repair -> streaming repair -> subsequent compaction. Let me know if you need more details. Hope this helps!
                
      was (Author: charlesb_zulily):
    We have a six node cluster with even key range balance, random partitioner, and with relication factor=2. I get these errors immediately following running "nodetool repair" but ONLY if a streaming repair happens as a result. We are serving live updates to our counters from our clickstream. My guess is that the sstable being streamed between the servers winds up becoming "out of date" for the duration of the streaming process and ends up containing these "duplicates" that are vetted during the subsequent compaction. In any case, for us it is 100% reproducible via: nodetool repair -> streaming repair -> subsequent compaction. Let me know if you need more details. Hope this helps!
                  
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453117#comment-13453117 ] 

Brandon Williams commented on CASSANDRA-4417:
---------------------------------------------

Under multiple-CF concurrency, wouldn't you effectively end up with batch mode?
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Peter Schuller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453798#comment-13453798 ] 

Peter Schuller commented on CASSANDRA-4417:
-------------------------------------------

@Sylvain I know it wouldn't be correlated with the *same* node; I was referring to uncontrolled shutdowns in general in the cluster.

@Omid: Presumably the premise was that the mutation goes through the commit log on the leader prior to replication. I'm not sure if this is the case, but if it is, then it should work.

@jbellis FWIW, our counter use-cases are such that going commit log synch is probably not feasable due to very high write throughput. Doesn't mean other people's use-cases are the same, and of course I *fully* support the idea of being correct by default (as opposed to performant by default).

@Sylvain again: I agree about refreshing nodeid on every unclean restart being potentially dangerous. Counters are already huge due to the size of counter shards, and refreshing nodeids in any situation which might result in en-masse refreshment can definitely be dangerous both from a CPU usage perspective as well as a disk space one.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Bartłomiej Romański (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457511#comment-13457511 ] 

Bartłomiej Romański commented on CASSANDRA-4417:
------------------------------------------------

In the previous comment I wanted to point directly to CASSANDRA-4436 - I've mixed up numbers.

One more thing: could hinted-handoff be possible somehow related to this issue? We've got a problem with it (CASSANDRA-4673) which was discovered in (more or less) in the same time that our counters problems. Is there a possibility that sending hinted-handoff a few times ends up with incrementing counters a few time?

                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453105#comment-13453105 ] 

Jonathan Ellis commented on CASSANDRA-4417:
-------------------------------------------

Maybe it's time to make commitlog mode (off/periodic/batch) per-CF instead of mix of global and per-KS.  Then we could automatically force counter CF to batch.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4417) invalid counter shard detected

Posted by "Ivan Sobolev (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Sobolev updated CASSANDRA-4417:
------------------------------------

    Attachment: err.txt

We have examples where increment is not always 1.
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>         Attachments: err.txt
>
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Peter Schuller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407747#comment-13407747 ] 

Peter Schuller commented on CASSANDRA-4417:
-------------------------------------------

You are triggering a self-healing mechanism introduced in CASSANDRA-3641. I can confirm we're seeing it on 1.1 still too, even on sstables that haven't been upgraded from older versions. I don't think anyone knows exactly why it's happening.

The condition being checked for is never supposed to happen. Prior to the CASSANDRA-3641 the result would be counters with values that never converge even with read-repair; post-CASSANDRA-3641 the values converge. But depending on the root cause, it's unclear whether there is a danger of incorrect counter values.

                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453132#comment-13453132 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
---------------------------------------------

That's an option. Though not an exactly short term one (I suspect mixing periodic and batch cf on the same commit log might require a bit of care; unless you were thinking of having multiple commit logs, but I'm not sure that would be a good thing).

But hey, I don't have a much better solution so far, so looking at that option is definitively worth it (since it's generally useful anyway).
                
> invalid counter shard detected 
> -------------------------------
>
>                 Key: CASSANDRA-4417
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>         Environment: Amazon Linux
>            Reporter: Senthilvel Rangaswamy
>
> Seeing errors like these:
> 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 13) and (17bfd850-ac52-11e1-0000-6ecd0b5b61e7, 1, 1) differ only in count; will pick highest to self-heal; this indicates a bug or corruption generated a bad counter shard
> What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira