You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Oskar Kjellin <os...@gmail.com> on 2016/06/21 15:47:58 UTC

Cluster not working after upgrade from 2.1.12 to 3.5.0

Hi,

We've done this upgrade in both dev and stage before and we did not see
similar issues.
After upgrading production today we have a lot issues tho.

The main issue is that the Datastax client quite often does not get the
data (even though it's the same query). I see similar flakyness by simply
running cqlsh, although it does return it returns broken data.

We are running a 3 node cluster with RF 3.

I have this table

CREATE TABLE keyspace.table (

  a text,

    b text,

    c text,

    d list<text>,

    e text,

    f timestamp,

    g list<text>,

    h timestamp,

    PRIMARY KEY (a, b, c)

)


Every other time I query (not exactly every other time, but random) I get:


SELECT * from table where a = 'xxx' and b = 'xxx'

 a             | b | c                                 | d | e | f
                | g            | h

---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------

 xxx |          xxx | ccc |             null |       null | 2089-11-30
23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000

 xxx |          xxx |                           ddd |             null |
    null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
13:29:36.000000+0000


Which is the expected output.


But I also get:

 a             | b | c                                 | d | e | f
                | g            | h

---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------

 xxx |          xxx | ccc |             null |       null |
            null |                  null |                            null

 xxx |          xxx | ccc |             null |       null | 2089-11-30
23:00:00.000000+0000 | ['fff'] |                            null

 xxx |          xxx | ccc |             null |       null |
            null |                  null | 2014-12-31 23:00:00.000000+0000

 xxx |          xxx |                           ddd |             null |
    null |                            null |                  null |
                    null

 xxx |          xxx |                           ddd |             null |
    null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
      null

 xxx |          xxx |                           ddd |             null |
    null |                            null |                  null | 2016-06-17
13:29:36.000000+0000


Notice that the same PK is returned 3 times. With different parts of the
data. I believe this is what's currently killing our production environment.


I'm running upgradesstables as of this moment, but it's not finished yet. I
started a repair before but nothing happened. The upgradesstables finished
now on 2 out of 3 nodes, but production is still down :/


We also see these in the logs, over and over again:

DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
Digest mismatch:

org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
(b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)

at
org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
~[apache-cassandra-3.5.0.jar:3.5.0]

at
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
~[apache-cassandra-3.5.0.jar:3.5.0]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_72]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_72]

at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]


Any help is much appreciated

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Posted by Bryan Cheng <br...@blockcypher.com>.

Hi Oskar,

I know this won't help you as quickly as you would like but please consider
updating the JIRA issue with details of your environment as it may help
move the investigation along.

Good luck!

On Tue, Jun 21, 2016 at 12:21 PM, Julien Anguenot <ju...@anguenot.org>
wrote:

> You could try to sstabledump that one corrupted table, write some
> (Python) code to get rid of the duplicates processing that stabledump
> output (might not be bullet proof depending on data, I agree),
> truncate and re-insert them back in that table without duplicates.
>
> On Tue, Jun 21, 2016 at 11:52 AM, Oskar Kjellin <os...@gmail.com>
> wrote:
> > Hmm, no way we can do that in prod :/
> >
> > Sent from my iPhone
> >
> >> On 21 juni 2016, at 18:50, Julien Anguenot <ju...@anguenot.org> wrote:
> >>
> >> See my comments on the issue: I had to truncate and reinsert data in
> >> these corrupted tables.
> >>
> >> AFAIK, there is no evidence that UDTs are responsible of this bad
> behavior.
> >>
> >>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin <
> oskar.kjellin@gmail.com> wrote:
> >>> Yea I saw that one. We're not using UDT in the affected tables tho.
> >>>
> >>> Did you resolve it?
> >>>
> >>> Sent from my iPhone
> >>>
> >>>> On 21 juni 2016, at 18:27, Julien Anguenot <ju...@anguenot.org>
> wrote:
> >>>>
> >>>> I have experienced similar duplicate primary keys behavior with couple
> >>>> of tables after upgrading from 2.2.x to 3.0.x.
> >>>>
> >>>> See comments on the Jira issue I opened at the time over there:
> >>>> https://issues.apache.org/jira/browse/CASSANDRA-11887
> >>>>
> >>>>
> >>>>> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <
> oskar.kjellin@gmail.com> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> We've done this upgrade in both dev and stage before and we did not
> see
> >>>>> similar issues.
> >>>>> After upgrading production today we have a lot issues tho.
> >>>>>
> >>>>> The main issue is that the Datastax client quite often does not get
> the data
> >>>>> (even though it's the same query). I see similar flakyness by simply
> running
> >>>>> cqlsh, although it does return it returns broken data.
> >>>>>
> >>>>> We are running a 3 node cluster with RF 3.
> >>>>>
> >>>>> I have this table
> >>>>>
> >>>>> CREATE TABLE keyspace.table (
> >>>>>
> >>>>> a text,
> >>>>>
> >>>>>   b text,
> >>>>>
> >>>>>   c text,
> >>>>>
> >>>>>   d list<text>,
> >>>>>
> >>>>>   e text,
> >>>>>
> >>>>>   f timestamp,
> >>>>>
> >>>>>   g list<text>,
> >>>>>
> >>>>>   h timestamp,
> >>>>>
> >>>>>   PRIMARY KEY (a, b, c)
> >>>>>
> >>>>> )
> >>>>>
> >>>>>
> >>>>> Every other time I query (not exactly every other time, but random)
> I get:
> >>>>>
> >>>>>
> >>>>> SELECT * from table where a = 'xxx' and b = 'xxx'
> >>>>>
> >>>>> a             | b | c                                 | d | e | f
> >>>>> | g            | h
> >>>>>
> >>>>>
> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
> >>>>>
> >>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
> >>>>> 23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000
> >>>>>
> >>>>> xxx |          xxx |                           ddd |
>  null |
> >>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
> >>>>> 13:29:36.000000+0000
> >>>>>
> >>>>>
> >>>>> Which is the expected output.
> >>>>>
> >>>>>
> >>>>> But I also get:
> >>>>>
> >>>>> a             | b | c                                 | d | e | f
> >>>>> | g            | h
> >>>>>
> >>>>>
> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
> >>>>>
> >>>>> xxx |          xxx | ccc |             null |       null |
> >>>>> null |                  null |                            null
> >>>>>
> >>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
> >>>>> 23:00:00.000000+0000 | ['fff'] |                            null
> >>>>>
> >>>>> xxx |          xxx | ccc |             null |       null |
> >>>>> null |                  null | 2014-12-31 23:00:00.000000+0000
> >>>>>
> >>>>> xxx |          xxx |                           ddd |
>  null |
> >>>>> null |                            null |                  null |
> >>>>> null
> >>>>>
> >>>>> xxx |          xxx |                           ddd |
>  null |
> >>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
> >>>>> null
> >>>>>
> >>>>> xxx |          xxx |                           ddd |
>  null |
> >>>>> null |                            null |                  null |
> 2016-06-17
> >>>>> 13:29:36.000000+0000
> >>>>>
> >>>>>
> >>>>> Notice that the same PK is returned 3 times. With different parts of
> the
> >>>>> data. I believe this is what's currently killing our production
> environment.
> >>>>>
> >>>>>
> >>>>> I'm running upgradesstables as of this moment, but it's not finished
> yet. I
> >>>>> started a repair before but nothing happened. The upgradesstables
> finished
> >>>>> now on 2 out of 3 nodes, but production is still down :/
> >>>>>
> >>>>>
> >>>>> We also see these in the logs, over and over again:
> >>>>>
> >>>>> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119
> ReadCallback.java:235 -
> >>>>> Digest mismatch:
> >>>>>
> >>>>> org.apache.cassandra.service.DigestMismatchException: Mismatch for
> key
> >>>>> DecoratedKey(-1566729966326640413,
> 336b35356c49537731797a4a5f64627a797236)
> >>>>> (b3dcfcbeed6676eae7ff88cc1bd251fb vs
> 6e7e9225871374d68a7cdb54ae70726d)
> >>>>>
> >>>>> at
> >>>>>
> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
> >>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
> >>>>>
> >>>>> at
> >>>>>
> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
> >>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
> >>>>>
> >>>>> at
> >>>>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> >>>>> [na:1.8.0_72]
> >>>>>
> >>>>> at
> >>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> >>>>> [na:1.8.0_72]
> >>>>>
> >>>>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
> >>>>>
> >>>>>
> >>>>> Any help is much appreciated
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Julien Anguenot (@anguenot)
> >>
> >>
> >>
> >> --
> >> Julien Anguenot (@anguenot)
>
>
>
> --
> Julien Anguenot (@anguenot)
>

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Posted by Julien Anguenot <ju...@anguenot.org>.

You could try to sstabledump that one corrupted table, write some
(Python) code to get rid of the duplicates processing that stabledump
output (might not be bullet proof depending on data, I agree),
truncate and re-insert them back in that table without duplicates.

On Tue, Jun 21, 2016 at 11:52 AM, Oskar Kjellin <os...@gmail.com> wrote:
> Hmm, no way we can do that in prod :/
>
> Sent from my iPhone
>
>> On 21 juni 2016, at 18:50, Julien Anguenot <ju...@anguenot.org> wrote:
>>
>> See my comments on the issue: I had to truncate and reinsert data in
>> these corrupted tables.
>>
>> AFAIK, there is no evidence that UDTs are responsible of this bad behavior.
>>
>>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin <os...@gmail.com> wrote:
>>> Yea I saw that one. We're not using UDT in the affected tables tho.
>>>
>>> Did you resolve it?
>>>
>>> Sent from my iPhone
>>>
>>>> On 21 juni 2016, at 18:27, Julien Anguenot <ju...@anguenot.org> wrote:
>>>>
>>>> I have experienced similar duplicate primary keys behavior with couple
>>>> of tables after upgrading from 2.2.x to 3.0.x.
>>>>
>>>> See comments on the Jira issue I opened at the time over there:
>>>> https://issues.apache.org/jira/browse/CASSANDRA-11887
>>>>
>>>>
>>>>> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <os...@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> We've done this upgrade in both dev and stage before and we did not see
>>>>> similar issues.
>>>>> After upgrading production today we have a lot issues tho.
>>>>>
>>>>> The main issue is that the Datastax client quite often does not get the data
>>>>> (even though it's the same query). I see similar flakyness by simply running
>>>>> cqlsh, although it does return it returns broken data.
>>>>>
>>>>> We are running a 3 node cluster with RF 3.
>>>>>
>>>>> I have this table
>>>>>
>>>>> CREATE TABLE keyspace.table (
>>>>>
>>>>> a text,
>>>>>
>>>>>   b text,
>>>>>
>>>>>   c text,
>>>>>
>>>>>   d list<text>,
>>>>>
>>>>>   e text,
>>>>>
>>>>>   f timestamp,
>>>>>
>>>>>   g list<text>,
>>>>>
>>>>>   h timestamp,
>>>>>
>>>>>   PRIMARY KEY (a, b, c)
>>>>>
>>>>> )
>>>>>
>>>>>
>>>>> Every other time I query (not exactly every other time, but random) I get:
>>>>>
>>>>>
>>>>> SELECT * from table where a = 'xxx' and b = 'xxx'
>>>>>
>>>>> a             | b | c                                 | d | e | f
>>>>> | g            | h
>>>>>
>>>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>>>
>>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>>>> 23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000
>>>>>
>>>>> xxx |          xxx |                           ddd |             null |
>>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
>>>>> 13:29:36.000000+0000
>>>>>
>>>>>
>>>>> Which is the expected output.
>>>>>
>>>>>
>>>>> But I also get:
>>>>>
>>>>> a             | b | c                                 | d | e | f
>>>>> | g            | h
>>>>>
>>>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>>>
>>>>> xxx |          xxx | ccc |             null |       null |
>>>>> null |                  null |                            null
>>>>>
>>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>>>> 23:00:00.000000+0000 | ['fff'] |                            null
>>>>>
>>>>> xxx |          xxx | ccc |             null |       null |
>>>>> null |                  null | 2014-12-31 23:00:00.000000+0000
>>>>>
>>>>> xxx |          xxx |                           ddd |             null |
>>>>> null |                            null |                  null |
>>>>> null
>>>>>
>>>>> xxx |          xxx |                           ddd |             null |
>>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
>>>>> null
>>>>>
>>>>> xxx |          xxx |                           ddd |             null |
>>>>> null |                            null |                  null | 2016-06-17
>>>>> 13:29:36.000000+0000
>>>>>
>>>>>
>>>>> Notice that the same PK is returned 3 times. With different parts of the
>>>>> data. I believe this is what's currently killing our production environment.
>>>>>
>>>>>
>>>>> I'm running upgradesstables as of this moment, but it's not finished yet. I
>>>>> started a repair before but nothing happened. The upgradesstables finished
>>>>> now on 2 out of 3 nodes, but production is still down :/
>>>>>
>>>>>
>>>>> We also see these in the logs, over and over again:
>>>>>
>>>>> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
>>>>> Digest mismatch:
>>>>>
>>>>> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
>>>>> DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
>>>>> (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
>>>>>
>>>>> at
>>>>> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
>>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>>>
>>>>> at
>>>>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
>>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>>>
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>> [na:1.8.0_72]
>>>>>
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>> [na:1.8.0_72]
>>>>>
>>>>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
>>>>>
>>>>>
>>>>> Any help is much appreciated
>>>>
>>>>
>>>>
>>>> --
>>>> Julien Anguenot (@anguenot)
>>
>>
>>
>> --
>> Julien Anguenot (@anguenot)



-- 
Julien Anguenot (@anguenot)

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Posted by Oskar Kjellin <os...@gmail.com>.

Hmm, no way we can do that in prod :/

Sent from my iPhone

> On 21 juni 2016, at 18:50, Julien Anguenot <ju...@anguenot.org> wrote:
> 
> See my comments on the issue: I had to truncate and reinsert data in
> these corrupted tables.
> 
> AFAIK, there is no evidence that UDTs are responsible of this bad behavior.
> 
>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin <os...@gmail.com> wrote:
>> Yea I saw that one. We're not using UDT in the affected tables tho.
>> 
>> Did you resolve it?
>> 
>> Sent from my iPhone
>> 
>>> On 21 juni 2016, at 18:27, Julien Anguenot <ju...@anguenot.org> wrote:
>>> 
>>> I have experienced similar duplicate primary keys behavior with couple
>>> of tables after upgrading from 2.2.x to 3.0.x.
>>> 
>>> See comments on the Jira issue I opened at the time over there:
>>> https://issues.apache.org/jira/browse/CASSANDRA-11887
>>> 
>>> 
>>>> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <os...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> We've done this upgrade in both dev and stage before and we did not see
>>>> similar issues.
>>>> After upgrading production today we have a lot issues tho.
>>>> 
>>>> The main issue is that the Datastax client quite often does not get the data
>>>> (even though it's the same query). I see similar flakyness by simply running
>>>> cqlsh, although it does return it returns broken data.
>>>> 
>>>> We are running a 3 node cluster with RF 3.
>>>> 
>>>> I have this table
>>>> 
>>>> CREATE TABLE keyspace.table (
>>>> 
>>>> a text,
>>>> 
>>>>   b text,
>>>> 
>>>>   c text,
>>>> 
>>>>   d list<text>,
>>>> 
>>>>   e text,
>>>> 
>>>>   f timestamp,
>>>> 
>>>>   g list<text>,
>>>> 
>>>>   h timestamp,
>>>> 
>>>>   PRIMARY KEY (a, b, c)
>>>> 
>>>> )
>>>> 
>>>> 
>>>> Every other time I query (not exactly every other time, but random) I get:
>>>> 
>>>> 
>>>> SELECT * from table where a = 'xxx' and b = 'xxx'
>>>> 
>>>> a             | b | c                                 | d | e | f
>>>> | g            | h
>>>> 
>>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>> 
>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>>> 23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
>>>> 13:29:36.000000+0000
>>>> 
>>>> 
>>>> Which is the expected output.
>>>> 
>>>> 
>>>> But I also get:
>>>> 
>>>> a             | b | c                                 | d | e | f
>>>> | g            | h
>>>> 
>>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>> 
>>>> xxx |          xxx | ccc |             null |       null |
>>>> null |                  null |                            null
>>>> 
>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>>> 23:00:00.000000+0000 | ['fff'] |                            null
>>>> 
>>>> xxx |          xxx | ccc |             null |       null |
>>>> null |                  null | 2014-12-31 23:00:00.000000+0000
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null |                            null |                  null |
>>>> null
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
>>>> null
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null |                            null |                  null | 2016-06-17
>>>> 13:29:36.000000+0000
>>>> 
>>>> 
>>>> Notice that the same PK is returned 3 times. With different parts of the
>>>> data. I believe this is what's currently killing our production environment.
>>>> 
>>>> 
>>>> I'm running upgradesstables as of this moment, but it's not finished yet. I
>>>> started a repair before but nothing happened. The upgradesstables finished
>>>> now on 2 out of 3 nodes, but production is still down :/
>>>> 
>>>> 
>>>> We also see these in the logs, over and over again:
>>>> 
>>>> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
>>>> Digest mismatch:
>>>> 
>>>> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
>>>> DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
>>>> (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
>>>> 
>>>> at
>>>> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>> 
>>>> at
>>>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>> 
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>> [na:1.8.0_72]
>>>> 
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>> [na:1.8.0_72]
>>>> 
>>>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
>>>> 
>>>> 
>>>> Any help is much appreciated
>>> 
>>> 
>>> 
>>> --
>>> Julien Anguenot (@anguenot)
> 
> 
> 
> -- 
> Julien Anguenot (@anguenot)

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Posted by Julien Anguenot <ju...@anguenot.org>.

AFAICT, the issue does not seem to be driver related as the duplicates
where showing up both using the cqlsh and Java driver. In addition,
the sstabledump output contained the actual duplicates. (see Jira
issue)

On Tue, Jun 21, 2016 at 12:04 PM, Oskar Kjellin <os...@gmail.com> wrote:
> Did you see similar issues when querying using a driver? Because we get no results in the driver what so ever
>
> Sent from my iPhone
>
>> On 21 juni 2016, at 18:50, Julien Anguenot <ju...@anguenot.org> wrote:
>>
>> See my comments on the issue: I had to truncate and reinsert data in
>> these corrupted tables.
>>
>> AFAIK, there is no evidence that UDTs are responsible of this bad behavior.
>>
>>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin <os...@gmail.com> wrote:
>>> Yea I saw that one. We're not using UDT in the affected tables tho.
>>>
>>> Did you resolve it?
>>>
>>> Sent from my iPhone
>>>
>>>> On 21 juni 2016, at 18:27, Julien Anguenot <ju...@anguenot.org> wrote:
>>>>
>>>> I have experienced similar duplicate primary keys behavior with couple
>>>> of tables after upgrading from 2.2.x to 3.0.x.
>>>>
>>>> See comments on the Jira issue I opened at the time over there:
>>>> https://issues.apache.org/jira/browse/CASSANDRA-11887
>>>>
>>>>
>>>>> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <os...@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> We've done this upgrade in both dev and stage before and we did not see
>>>>> similar issues.
>>>>> After upgrading production today we have a lot issues tho.
>>>>>
>>>>> The main issue is that the Datastax client quite often does not get the data
>>>>> (even though it's the same query). I see similar flakyness by simply running
>>>>> cqlsh, although it does return it returns broken data.
>>>>>
>>>>> We are running a 3 node cluster with RF 3.
>>>>>
>>>>> I have this table
>>>>>
>>>>> CREATE TABLE keyspace.table (
>>>>>
>>>>> a text,
>>>>>
>>>>>   b text,
>>>>>
>>>>>   c text,
>>>>>
>>>>>   d list<text>,
>>>>>
>>>>>   e text,
>>>>>
>>>>>   f timestamp,
>>>>>
>>>>>   g list<text>,
>>>>>
>>>>>   h timestamp,
>>>>>
>>>>>   PRIMARY KEY (a, b, c)
>>>>>
>>>>> )
>>>>>
>>>>>
>>>>> Every other time I query (not exactly every other time, but random) I get:
>>>>>
>>>>>
>>>>> SELECT * from table where a = 'xxx' and b = 'xxx'
>>>>>
>>>>> a             | b | c                                 | d | e | f
>>>>> | g            | h
>>>>>
>>>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>>>
>>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>>>> 23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000
>>>>>
>>>>> xxx |          xxx |                           ddd |             null |
>>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
>>>>> 13:29:36.000000+0000
>>>>>
>>>>>
>>>>> Which is the expected output.
>>>>>
>>>>>
>>>>> But I also get:
>>>>>
>>>>> a             | b | c                                 | d | e | f
>>>>> | g            | h
>>>>>
>>>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>>>
>>>>> xxx |          xxx | ccc |             null |       null |
>>>>> null |                  null |                            null
>>>>>
>>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>>>> 23:00:00.000000+0000 | ['fff'] |                            null
>>>>>
>>>>> xxx |          xxx | ccc |             null |       null |
>>>>> null |                  null | 2014-12-31 23:00:00.000000+0000
>>>>>
>>>>> xxx |          xxx |                           ddd |             null |
>>>>> null |                            null |                  null |
>>>>> null
>>>>>
>>>>> xxx |          xxx |                           ddd |             null |
>>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
>>>>> null
>>>>>
>>>>> xxx |          xxx |                           ddd |             null |
>>>>> null |                            null |                  null | 2016-06-17
>>>>> 13:29:36.000000+0000
>>>>>
>>>>>
>>>>> Notice that the same PK is returned 3 times. With different parts of the
>>>>> data. I believe this is what's currently killing our production environment.
>>>>>
>>>>>
>>>>> I'm running upgradesstables as of this moment, but it's not finished yet. I
>>>>> started a repair before but nothing happened. The upgradesstables finished
>>>>> now on 2 out of 3 nodes, but production is still down :/
>>>>>
>>>>>
>>>>> We also see these in the logs, over and over again:
>>>>>
>>>>> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
>>>>> Digest mismatch:
>>>>>
>>>>> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
>>>>> DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
>>>>> (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
>>>>>
>>>>> at
>>>>> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
>>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>>>
>>>>> at
>>>>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
>>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>>>
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>> [na:1.8.0_72]
>>>>>
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>> [na:1.8.0_72]
>>>>>
>>>>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
>>>>>
>>>>>
>>>>> Any help is much appreciated
>>>>
>>>>
>>>>
>>>> --
>>>> Julien Anguenot (@anguenot)
>>
>>
>>
>> --
>> Julien Anguenot (@anguenot)



-- 
Julien Anguenot (@anguenot)

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Posted by Oskar Kjellin <os...@gmail.com>.

Did you see similar issues when querying using a driver? Because we get no results in the driver what so ever 

Sent from my iPhone

> On 21 juni 2016, at 18:50, Julien Anguenot <ju...@anguenot.org> wrote:
> 
> See my comments on the issue: I had to truncate and reinsert data in
> these corrupted tables.
> 
> AFAIK, there is no evidence that UDTs are responsible of this bad behavior.
> 
>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin <os...@gmail.com> wrote:
>> Yea I saw that one. We're not using UDT in the affected tables tho.
>> 
>> Did you resolve it?
>> 
>> Sent from my iPhone
>> 
>>> On 21 juni 2016, at 18:27, Julien Anguenot <ju...@anguenot.org> wrote:
>>> 
>>> I have experienced similar duplicate primary keys behavior with couple
>>> of tables after upgrading from 2.2.x to 3.0.x.
>>> 
>>> See comments on the Jira issue I opened at the time over there:
>>> https://issues.apache.org/jira/browse/CASSANDRA-11887
>>> 
>>> 
>>>> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <os...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> We've done this upgrade in both dev and stage before and we did not see
>>>> similar issues.
>>>> After upgrading production today we have a lot issues tho.
>>>> 
>>>> The main issue is that the Datastax client quite often does not get the data
>>>> (even though it's the same query). I see similar flakyness by simply running
>>>> cqlsh, although it does return it returns broken data.
>>>> 
>>>> We are running a 3 node cluster with RF 3.
>>>> 
>>>> I have this table
>>>> 
>>>> CREATE TABLE keyspace.table (
>>>> 
>>>> a text,
>>>> 
>>>>   b text,
>>>> 
>>>>   c text,
>>>> 
>>>>   d list<text>,
>>>> 
>>>>   e text,
>>>> 
>>>>   f timestamp,
>>>> 
>>>>   g list<text>,
>>>> 
>>>>   h timestamp,
>>>> 
>>>>   PRIMARY KEY (a, b, c)
>>>> 
>>>> )
>>>> 
>>>> 
>>>> Every other time I query (not exactly every other time, but random) I get:
>>>> 
>>>> 
>>>> SELECT * from table where a = 'xxx' and b = 'xxx'
>>>> 
>>>> a             | b | c                                 | d | e | f
>>>> | g            | h
>>>> 
>>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>> 
>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>>> 23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
>>>> 13:29:36.000000+0000
>>>> 
>>>> 
>>>> Which is the expected output.
>>>> 
>>>> 
>>>> But I also get:
>>>> 
>>>> a             | b | c                                 | d | e | f
>>>> | g            | h
>>>> 
>>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>> 
>>>> xxx |          xxx | ccc |             null |       null |
>>>> null |                  null |                            null
>>>> 
>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>>> 23:00:00.000000+0000 | ['fff'] |                            null
>>>> 
>>>> xxx |          xxx | ccc |             null |       null |
>>>> null |                  null | 2014-12-31 23:00:00.000000+0000
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null |                            null |                  null |
>>>> null
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
>>>> null
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null |                            null |                  null | 2016-06-17
>>>> 13:29:36.000000+0000
>>>> 
>>>> 
>>>> Notice that the same PK is returned 3 times. With different parts of the
>>>> data. I believe this is what's currently killing our production environment.
>>>> 
>>>> 
>>>> I'm running upgradesstables as of this moment, but it's not finished yet. I
>>>> started a repair before but nothing happened. The upgradesstables finished
>>>> now on 2 out of 3 nodes, but production is still down :/
>>>> 
>>>> 
>>>> We also see these in the logs, over and over again:
>>>> 
>>>> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
>>>> Digest mismatch:
>>>> 
>>>> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
>>>> DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
>>>> (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
>>>> 
>>>> at
>>>> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>> 
>>>> at
>>>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>> 
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>> [na:1.8.0_72]
>>>> 
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>> [na:1.8.0_72]
>>>> 
>>>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
>>>> 
>>>> 
>>>> Any help is much appreciated
>>> 
>>> 
>>> 
>>> --
>>> Julien Anguenot (@anguenot)
> 
> 
> 
> -- 
> Julien Anguenot (@anguenot)

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Posted by Julien Anguenot <ju...@anguenot.org>.

See my comments on the issue: I had to truncate and reinsert data in
these corrupted tables.

AFAIK, there is no evidence that UDTs are responsible of this bad behavior.

On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin <os...@gmail.com> wrote:
> Yea I saw that one. We're not using UDT in the affected tables tho.
>
> Did you resolve it?
>
> Sent from my iPhone
>
>> On 21 juni 2016, at 18:27, Julien Anguenot <ju...@anguenot.org> wrote:
>>
>> I have experienced similar duplicate primary keys behavior with couple
>> of tables after upgrading from 2.2.x to 3.0.x.
>>
>> See comments on the Jira issue I opened at the time over there:
>> https://issues.apache.org/jira/browse/CASSANDRA-11887
>>
>>
>>> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <os...@gmail.com> wrote:
>>> Hi,
>>>
>>> We've done this upgrade in both dev and stage before and we did not see
>>> similar issues.
>>> After upgrading production today we have a lot issues tho.
>>>
>>> The main issue is that the Datastax client quite often does not get the data
>>> (even though it's the same query). I see similar flakyness by simply running
>>> cqlsh, although it does return it returns broken data.
>>>
>>> We are running a 3 node cluster with RF 3.
>>>
>>> I have this table
>>>
>>> CREATE TABLE keyspace.table (
>>>
>>>  a text,
>>>
>>>    b text,
>>>
>>>    c text,
>>>
>>>    d list<text>,
>>>
>>>    e text,
>>>
>>>    f timestamp,
>>>
>>>    g list<text>,
>>>
>>>    h timestamp,
>>>
>>>    PRIMARY KEY (a, b, c)
>>>
>>> )
>>>
>>>
>>> Every other time I query (not exactly every other time, but random) I get:
>>>
>>>
>>> SELECT * from table where a = 'xxx' and b = 'xxx'
>>>
>>> a             | b | c                                 | d | e | f
>>> | g            | h
>>>
>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>
>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>> 23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000
>>>
>>> xxx |          xxx |                           ddd |             null |
>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
>>> 13:29:36.000000+0000
>>>
>>>
>>> Which is the expected output.
>>>
>>>
>>> But I also get:
>>>
>>> a             | b | c                                 | d | e | f
>>> | g            | h
>>>
>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>
>>> xxx |          xxx | ccc |             null |       null |
>>> null |                  null |                            null
>>>
>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>> 23:00:00.000000+0000 | ['fff'] |                            null
>>>
>>> xxx |          xxx | ccc |             null |       null |
>>> null |                  null | 2014-12-31 23:00:00.000000+0000
>>>
>>> xxx |          xxx |                           ddd |             null |
>>> null |                            null |                  null |
>>> null
>>>
>>> xxx |          xxx |                           ddd |             null |
>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
>>> null
>>>
>>> xxx |          xxx |                           ddd |             null |
>>> null |                            null |                  null | 2016-06-17
>>> 13:29:36.000000+0000
>>>
>>>
>>> Notice that the same PK is returned 3 times. With different parts of the
>>> data. I believe this is what's currently killing our production environment.
>>>
>>>
>>> I'm running upgradesstables as of this moment, but it's not finished yet. I
>>> started a repair before but nothing happened. The upgradesstables finished
>>> now on 2 out of 3 nodes, but production is still down :/
>>>
>>>
>>> We also see these in the logs, over and over again:
>>>
>>> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
>>> Digest mismatch:
>>>
>>> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
>>> DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
>>> (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
>>>
>>> at
>>> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>
>>> at
>>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> [na:1.8.0_72]
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> [na:1.8.0_72]
>>>
>>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
>>>
>>>
>>> Any help is much appreciated
>>
>>
>>
>> --
>> Julien Anguenot (@anguenot)



-- 
Julien Anguenot (@anguenot)

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Posted by Oskar Kjellin <os...@gmail.com>.

Yea I saw that one. We're not using UDT in the affected tables tho. 

Did you resolve it?

Sent from my iPhone

> On 21 juni 2016, at 18:27, Julien Anguenot <ju...@anguenot.org> wrote:
> 
> I have experienced similar duplicate primary keys behavior with couple
> of tables after upgrading from 2.2.x to 3.0.x.
> 
> See comments on the Jira issue I opened at the time over there:
> https://issues.apache.org/jira/browse/CASSANDRA-11887
> 
> 
>> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <os...@gmail.com> wrote:
>> Hi,
>> 
>> We've done this upgrade in both dev and stage before and we did not see
>> similar issues.
>> After upgrading production today we have a lot issues tho.
>> 
>> The main issue is that the Datastax client quite often does not get the data
>> (even though it's the same query). I see similar flakyness by simply running
>> cqlsh, although it does return it returns broken data.
>> 
>> We are running a 3 node cluster with RF 3.
>> 
>> I have this table
>> 
>> CREATE TABLE keyspace.table (
>> 
>>  a text,
>> 
>>    b text,
>> 
>>    c text,
>> 
>>    d list<text>,
>> 
>>    e text,
>> 
>>    f timestamp,
>> 
>>    g list<text>,
>> 
>>    h timestamp,
>> 
>>    PRIMARY KEY (a, b, c)
>> 
>> )
>> 
>> 
>> Every other time I query (not exactly every other time, but random) I get:
>> 
>> 
>> SELECT * from table where a = 'xxx' and b = 'xxx'
>> 
>> a             | b | c                                 | d | e | f
>> | g            | h
>> 
>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>> 
>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>> 23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000
>> 
>> xxx |          xxx |                           ddd |             null |
>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
>> 13:29:36.000000+0000
>> 
>> 
>> Which is the expected output.
>> 
>> 
>> But I also get:
>> 
>> a             | b | c                                 | d | e | f
>> | g            | h
>> 
>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>> 
>> xxx |          xxx | ccc |             null |       null |
>> null |                  null |                            null
>> 
>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>> 23:00:00.000000+0000 | ['fff'] |                            null
>> 
>> xxx |          xxx | ccc |             null |       null |
>> null |                  null | 2014-12-31 23:00:00.000000+0000
>> 
>> xxx |          xxx |                           ddd |             null |
>> null |                            null |                  null |
>> null
>> 
>> xxx |          xxx |                           ddd |             null |
>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
>> null
>> 
>> xxx |          xxx |                           ddd |             null |
>> null |                            null |                  null | 2016-06-17
>> 13:29:36.000000+0000
>> 
>> 
>> Notice that the same PK is returned 3 times. With different parts of the
>> data. I believe this is what's currently killing our production environment.
>> 
>> 
>> I'm running upgradesstables as of this moment, but it's not finished yet. I
>> started a repair before but nothing happened. The upgradesstables finished
>> now on 2 out of 3 nodes, but production is still down :/
>> 
>> 
>> We also see these in the logs, over and over again:
>> 
>> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
>> Digest mismatch:
>> 
>> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
>> DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
>> (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
>> 
>> at
>> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>> 
>> at
>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>> 
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> [na:1.8.0_72]
>> 
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> [na:1.8.0_72]
>> 
>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
>> 
>> 
>> Any help is much appreciated
> 
> 
> 
> -- 
> Julien Anguenot (@anguenot)

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Posted by Julien Anguenot <ju...@anguenot.org>.

I have experienced similar duplicate primary keys behavior with couple
of tables after upgrading from 2.2.x to 3.0.x.

See comments on the Jira issue I opened at the time over there:
https://issues.apache.org/jira/browse/CASSANDRA-11887


On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <os...@gmail.com> wrote:
> Hi,
>
> We've done this upgrade in both dev and stage before and we did not see
> similar issues.
> After upgrading production today we have a lot issues tho.
>
> The main issue is that the Datastax client quite often does not get the data
> (even though it's the same query). I see similar flakyness by simply running
> cqlsh, although it does return it returns broken data.
>
> We are running a 3 node cluster with RF 3.
>
> I have this table
>
> CREATE TABLE keyspace.table (
>
>   a text,
>
>     b text,
>
>     c text,
>
>     d list<text>,
>
>     e text,
>
>     f timestamp,
>
>     g list<text>,
>
>     h timestamp,
>
>     PRIMARY KEY (a, b, c)
>
> )
>
>
> Every other time I query (not exactly every other time, but random) I get:
>
>
> SELECT * from table where a = 'xxx' and b = 'xxx'
>
>  a             | b | c                                 | d | e | f
> | g            | h
>
> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>
>  xxx |          xxx | ccc |             null |       null | 2089-11-30
> 23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000
>
>  xxx |          xxx |                           ddd |             null |
> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
> 13:29:36.000000+0000
>
>
> Which is the expected output.
>
>
> But I also get:
>
>  a             | b | c                                 | d | e | f
> | g            | h
>
> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>
>  xxx |          xxx | ccc |             null |       null |
> null |                  null |                            null
>
>  xxx |          xxx | ccc |             null |       null | 2089-11-30
> 23:00:00.000000+0000 | ['fff'] |                            null
>
>  xxx |          xxx | ccc |             null |       null |
> null |                  null | 2014-12-31 23:00:00.000000+0000
>
>  xxx |          xxx |                           ddd |             null |
> null |                            null |                  null |
> null
>
>  xxx |          xxx |                           ddd |             null |
> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
> null
>
>  xxx |          xxx |                           ddd |             null |
> null |                            null |                  null | 2016-06-17
> 13:29:36.000000+0000
>
>
> Notice that the same PK is returned 3 times. With different parts of the
> data. I believe this is what's currently killing our production environment.
>
>
> I'm running upgradesstables as of this moment, but it's not finished yet. I
> started a repair before but nothing happened. The upgradesstables finished
> now on 2 out of 3 nodes, but production is still down :/
>
>
> We also see these in the logs, over and over again:
>
> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
> Digest mismatch:
>
> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
> DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
> (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
>
> at
> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
> ~[apache-cassandra-3.5.0.jar:3.5.0]
>
> at
> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
> ~[apache-cassandra-3.5.0.jar:3.5.0]
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_72]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_72]
>
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
>
>
> Any help is much appreciated



-- 
Julien Anguenot (@anguenot)