You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by André Cruz <an...@co.sapo.pt> on 2014/11/14 17:36:15 UTC

Repair completes successfully but data is still inconsistent

Hello.

So, I have detected a data inconsistency between my nodes:

(Consistency level is ONE)
[disco@Disco] get NamespaceFile2['45fc8996-41bc-429b-a382-5da9294eb59c:/XXXDIRXXX']['XXXFILEXXX'];
Value was not found
Elapsed time: 48 msec(s).
[disco@Disco] get NamespaceFile2['45fc8996-41bc-429b-a382-5da9294eb59c:/XXXDIRXXX']['XXXFILEXXX'];
=> (name=XXXFILEXXX, value=a96a7f54-49d4-11e4-8185-e0db55018fa4, timestamp=1412213845215797)
Elapsed time: 7.45 msec(s).
[disco@Disco]

I have a replication factor of 3, and if I read with a consistency level of QUORUM, the result converges to the column being present. The strange thing is: I have deleted it on 2014-11-01 03:39:25 and my writes use QUORUM. My gc_grace_period is 10 days and if I had not been running repairs during this period this could be explained, but the fact is that repairs have been running every day with no sign of problems. First of all some data:

Cassandra version: 1.2.19 (we upgraded from 1.2.16 on 2014-10-22)
NamespaceFile2 compaction strategy: LeveledCompactionStrategy
CFStats:
	Read Count: 3376397
	Read Latency: 0.24978254956392865 ms.
	Write Count: 21254817
	Write Latency: 0.04713691710448507 ms.
	Pending Tasks: 0
		Column Family: NamespaceFile2
		SSTable count: 28
		SSTables in each level: [1, 10, 17, 0, 0, 0, 0, 0, 0]
		Space used (live): 4961751613
		Space used (total): 5021967044
		SSTable Compression Ratio: 0.5773464014899713
		Number of Keys (estimate): 9369856
		Memtable Columns Count: 82901
		Memtable Data Size: 19352636
		Memtable Switch Count: 283
		Read Count: 3376397
		Read Latency: NaN ms.
		Write Count: 21254817
		Write Latency: 0.049 ms.
		Pending Tasks: 0
		Bloom Filter False Positives: 21904
		Bloom Filter False Ratio: 0.00000
		Bloom Filter Space Used: 7405728
		Compacted row minimum size: 61
		Compacted row maximum size: 74975550
		Compacted row mean size: 795
		Average live cells per slice (last five minutes): 1025.0
		Average tombstones per slice (last five minutes): 0.0

This particular row has 1.1M columns, 70934390 bytes, and the offending key encoded by the partitioner is 001045fc899641bc429ba3825da9294eb59c0000072f646f7267656d00.

First I checked which nodes were responsible for this particular row:

$ nodetool -h XXX.XXX.XXX.112 getendpoints Disco NamespaceFile2 '45fc8996-41bc-429b-a382-5da9294eb59c:/XXXDIRXXX'
XXX.XXX.XXX.17
XXX.XXX.XXX.14
XXX.XXX.XXX.18


This is the repair log of these particular nodes:

----------XXX.XXX.XXX.14----------
INFO [AntiEntropySessions:366] 2014-11-14 04:34:55,740 AntiEntropyService.java (line 651) [repair #956244b0-6bb7-11e4-8eec-f5962c02982e] new session: will sync /XXX.XXX.XXX.14, /XXX.XXX.XXX.12, /XXX.XXX.XXX.18 on range (14178431955039102644307275309657008810,28356863910078205288614550619314017621] for Disco.[NamespaceFile2]
...
INFO [AntiEntropyStage:1] 2014-11-14 04:36:51,125 AntiEntropyService.java (line 764) [repair #956244b0-6bb7-11e4-8eec-f5962c02982e] NamespaceFile2 is fully synced
INFO [AntiEntropySessions:366] 2014-11-14 04:36:51,126 AntiEntropyService.java (line 698) [repair #956244b0-6bb7-11e4-8eec-f5962c02982e] session completed successfully
...
INFO [AntiEntropySessions:367] 2014-11-14 04:36:51,103 AntiEntropyService.java (line 651) [repair #da2543e0-6bb7-11e4-8eec-f5962c02982e] new session: will sync /XXX.XXX.XXX.14, /XXX.XXX.XXX.11, /XXX.XXX.XXX.17 on range (155962751505430129087380028406227096910,0] for Disco.[NamespaceFile2]
...
INFO [AntiEntropyStage:1] 2014-11-14 04:38:20,949 AntiEntropyService.java (line 764) [repair #da2543e0-6bb7-11e4-8eec-f5962c02982e] NamespaceFile2 is fully synced
INFO [AntiEntropySessions:367] 2014-11-14 04:38:20,949 AntiEntropyService.java (line 698) [repair #da2543e0-6bb7-11e4-8eec-f5962c02982e] session completed successfully
...
INFO [AntiEntropySessions:368] 2014-11-14 04:38:20,930 AntiEntropyService.java (line 651) [repair #0fafc710-6bb8-11e4-8eec-f5962c02982e] new session: will sync /XXX.XXX.XXX.14, /XXX.XXX.XXX.17, /XXX.XXX.XXX.18 on range (0,14178431955039102644307275309657008810] for Disco.[NamespaceFile2]
...
INFO [AntiEntropyStage:1] 2014-11-14 04:40:34,237 AntiEntropyService.java (line 764) [repair #0fafc710-6bb8-11e4-8eec-f5962c02982e] NamespaceFile2 is fully synced
INFO [AntiEntropySessions:368] 2014-11-14 04:40:34,237 AntiEntropyService.java (line 698) [repair #0fafc710-6bb8-11e4-8eec-f5962c02982e] session completed successfully


----------XXX.XXX.XXX.17----------
INFO [AntiEntropySessions:362] 2014-11-13 04:33:01,974 AntiEntropyService.java (line 651) [repair #27293450-6aee-11e4-aabc-9b69569c95c3] new session: will sync /XXX.XXX.XXX.17, /XXX.XXX.XXX.11, /XXX.XXX.XXX.14 on range (155962751505430129087380028406227096910,0] for Disco.[NamespaceFile2]
...
INFO [AntiEntropyStage:1] 2014-11-13 04:34:35,810 AntiEntropyService.java (line 764) [repair #27293450-6aee-11e4-aabc-9b69569c95c3] NamespaceFile2 is fully synced
INFO [AntiEntropySessions:362] 2014-11-13 04:34:35,810 AntiEntropyService.java (line 698) [repair #27293450-6aee-11e4-aabc-9b69569c95c3] session completed successfully
...
INFO [AntiEntropySessions:363] 2014-11-13 04:34:35,790 AntiEntropyService.java (line 651) [repair #5f1463d0-6aee-11e4-aabc-9b69569c95c3] new session: will sync /XXX.XXX.XXX.17, /XXX.XXX.XXX.11, /XXX.XXX.XXX.112 on range (141784319550391026443072753096570088106,155962751505430129087380028406227096910] for Disco.[NamespaceFile2]
 ...
INFO [AntiEntropyStage:1] 2014-11-13 04:36:44,042 AntiEntropyService.java (line 764) [repair #5f1463d0-6aee-11e4-aabc-9b69569c95c3] NamespaceFile2 is fully synced
INFO [AntiEntropySessions:363] 2014-11-13 04:36:44,042 AntiEntropyService.java (line 698) [repair #5f1463d0-6aee-11e4-aabc-9b69569c95c3] session completed successfully
...
INFO [AntiEntropySessions:364] 2014-11-13 04:36:43,996 AntiEntropyService.java (line 651) [repair #ab7f12b0-6aee-11e4-aabc-9b69569c95c3] new session: will sync /XXX.XXX.XXX.17, /XXX.XXX.XXX.14, /XXX.XXX.XXX.18 on range (0,14178431955039102644307275309657008810] for Disco.[NamespaceFile2]
 ...
INFO [Thread-209997] 2014-11-13 04:38:55,122 AntiEntropyService.java (line 764) [repair #ab7f12b0-6aee-11e4-aabc-9b69569c95c3] NamespaceFile2 is fully synced
INFO [AntiEntropySessions:364] 2014-11-13 04:38:55,122 AntiEntropyService.java (line 698) [repair #ab7f12b0-6aee-11e4-aabc-9b69569c95c3] session completed successfully


----------XXX.XXX.XXX.18----------
INFO [AntiEntropySessions:364] 2014-11-12 04:49:00,962 AntiEntropyService.java (line 651) [repair #38594200-6a27-11e4-b912-e1a9f145c444] new session: will sync /XXX.XXX.XXX.18, /XXX.XXX.XXX.12, /XXX.XXX.XXX.19 on range (28356863910078205288614550619314017621,42535295865117307932921825928971026430] for Disco.[NamespaceFile2]
...
INFO [AntiEntropyStage:1] 2014-11-12 04:50:50,095 AntiEntropyService.java (line 764) [repair #38594200-6a27-11e4-b912-e1a9f145c444] NamespaceFile2 is fully synced
INFO [AntiEntropySessions:364] 2014-11-12 04:50:50,095 AntiEntropyService.java (line 698) [repair #38594200-6a27-11e4-b912-e1a9f145c444] session completed successfully
...
 INFO [AntiEntropySessions:365] 2014-11-12 04:50:50,040 AntiEntropyService.java (line 651) [repair #795d3d60-6a27-11e4-b912-e1a9f145c444] new session: will sync /XXX.XXX.XXX.18, /XXX.XXX.XXX.12, /XXX.XXX.XXX.14 on range (14178431955039102644307275309657008810,28356863910078205288614550619314017621] for Disco.[NamespaceFile2]
...
INFO [AntiEntropyStage:1] 2014-11-12 04:52:31,657 AntiEntropyService.java (line 764) [repair #795d3d60-6a27-11e4-b912-e1a9f145c444] NamespaceFile2 is fully synced
INFO [AntiEntropySessions:365] 2014-11-12 04:52:31,657 AntiEntropyService.java (line 698) [repair #795d3d60-6a27-11e4-b912-e1a9f145c444] session completed successfully
...
INFO [AntiEntropySessions:366] 2014-11-12 04:52:31,627 AntiEntropyService.java (line 651) [repair #b5ea2f90-6a27-11e4-b912-e1a9f145c444] new session: will sync /XXX.XXX.XXX.18, /XXX.XXX.XXX.14, /XXX.XXX.XXX.17 on range (0,14178431955039102644307275309657008810] for Disco.[NamespaceFile2]
...
INFO [Thread-202021] 2014-11-12 04:54:36,914 AntiEntropyService.java (line 764) [repair #b5ea2f90-6a27-11e4-b912-e1a9f145c444] NamespaceFile2 is fully synced
INFO [AntiEntropySessions:366] 2014-11-12 04:54:36,915 AntiEntropyService.java (line 698) [repair #b5ea2f90-6a27-11e4-b912-e1a9f145c444] session completed successfully


So repairs have been running. I dumped the sstables to JSON and this is what I found:

----------XXX.XXX.XXX.14----------
Disco-NamespaceFile2-ic-4710-Data.db.json - Oct 29 04:48
      [
        "XXXFILEXXX",
        "a96a7f54-49d4-11e4-8185-e0db55018fa4",
        1412213845215797
      ],

Disco-NamespaceFile2-ic-5144-Data.db.json - Nov 12 04:58
NOT FOUND

Disco-NamespaceFile2-ic-5201-Data.db.json
NOT FOUND


----------XXX.XXX.XXX.17----------
Disco-NamespaceFile2-ic-5055-Data.db.json - Oct 30 04:36
      [
        "XXXFILEXXX",
        "a96a7f54-49d4-11e4-8185-e0db55018fa4",
        1412213845215797
      ],

Disco-NamespaceFile2-ic-5472-Data.db.json
NOT FOUND

Disco-NamespaceFile2-ic-5532-Data.db.json
NOT FOUND


----------XXX.XXX.XXX.18----------
Disco-NamespaceFile2-ic-5337-Data.db.json - Nov  2 04:55
      [
        "XXXFILEXXX",
        "545455ed",
        1414813165914138,
        "d"
      ],

Disco-NamespaceFile2-ic-5719-Data.db.json - Nov 13 04:43
NOT FOUND

Disco-NamespaceFile2-ic-5748-Data.db.json
      [
        "XXXFILEXXX",
        "a96a7f54-49d4-11e4-8185-e0db55018fa4",
        1412213845215797
      ],


So it seems both XXX.XXX.XXX.14 and XXX.XXX.XXX.17 had the column and on more recent sstables it is completely gone, suposedly the tombstone has already been reclaimed. On XXX.XXX.XXX.18 however, we can still see the tombstone on the first sstable, on the second sstable the column is gone, and then the undeleted value magically reappears on the last sstable. Since repairs have been running, is there another possible reason for this? On another note, why haven't repairs propagated this zombie column to the other nodes?

Any help or pointers where to go next would be appreciated.

Best regards,
André Cruz

Re: Repair completes successfully but data is still inconsistent

Posted by Robert Coli <rc...@eventbrite.com>.

On Thu, Nov 27, 2014 at 2:38 AM, André Cruz <an...@co.sapo.pt> wrote:

> On 26 Nov 2014, at 19:07, Robert Coli <rc...@eventbrite.com> wrote:
> >
> > Yes. Do you know if 5748 was created as a result of compaction or via a
> flush from a memtable?
>
> It was the result of a compaction:
>

Ok, so in theory if you had the input SSTables to that compaction, you
could use them to diagnose this bug, by seeing what values they had for
that row before compaction. This is why something like tablesnap can be
useful in practice...

=Rob

Re: Repair completes successfully but data is still inconsistent

Posted by André Cruz <an...@co.sapo.pt>.

On 26 Nov 2014, at 19:07, Robert Coli <rc...@eventbrite.com> wrote:
> 
> Yes. Do you know if 5748 was created as a result of compaction or via a flush from a memtable?

It was the result of a compaction:

 INFO [CompactionExecutor:22422] 2014-11-13 13:08:41,926 CompactionTask.java (line 262) Compacted 2 sstables to [/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5744,].  10,732,071 bytes to 10,824,798 (~100% of original) in 2,950ms = 3.499435MB/s.  36,632 total rows, 35,545 unique.  Row merge counts were {1:34460, 2:1086, }
 INFO [CompactionExecutor:22528] 2014-11-13 14:55:09,944 CompactionTask.java (line 262) Compacted 2 sstables to [/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5746,].  15,790,121 bytes to 14,913,599 (~94% of original) in 3,491ms = 4.074110MB/s.  46,167 total rows, 43,745 unique.  Row merge counts were {1:41323, 2:2422, }
 INFO [CompactionExecutor:22590] 2014-11-13 15:26:50,087 CompactionTask.java (line 262) Compacted 2 sstables to [/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5748,].  16,088,983 bytes to 16,110,649 (~100% of original) in 2,741ms = 5.605367MB/s.  48,332 total rows, 47,392 unique.  Row merge counts were {1:46452, 2:940, } <---------------------------
 INFO [CompactionExecutor:22718] 2014-11-13 18:05:36,326 CompactionTask.java (line 262) Compacted 2 sstables to [/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5750,].  21,508,530 bytes to 21,342,786 (~99% of original) in 3,461ms = 5.880979MB/s.  66,361 total rows, 63,219 unique.  Row merge counts were {1:60077, 2:3142, }
 INFO [CompactionExecutor:22817] 2014-11-13 19:06:04,564 CompactionTask.java (line 262) Compacted 2 sstables to [/servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5752,].  23,232,660 bytes to 23,087,822 (~99% of original) in 3,144ms = 7.003264MB/s.  68,968 total rows, 67,602 unique.  Row merge counts were {1:66236, 2:1366, }


Regarding flushes I have:

INFO [FlushWriter:3079] 2014-11-13 13:08:38,972 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5743-Data.db (4698473 bytes) for commitlog position ReplayPosition(segmentId=1413900469571, position=6240)
 INFO [FlushWriter:3093] 2014-11-13 14:55:06,436 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5745-Data.db (4965323 bytes) for commitlog position ReplayPosition(segmentId=1413900469603, position=2518)
 INFO [FlushWriter:3101] 2014-11-13 15:26:47,336 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5747-Data.db (1175384 bytes) for commitlog position ReplayPosition(segmentId=1413900469635, position=9984)
 INFO [FlushWriter:3121] 2014-11-13 18:05:32,853 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5749-Data.db (5397881 bytes) for commitlog position ReplayPosition(segmentId=1413900469667, position=8533)
 INFO [FlushWriter:3134] 2014-11-13 19:06:01,416 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5751-Data.db (1889874 bytes) for commitlog position ReplayPosition(segmentId=1413900469699, position=108)
 INFO [FlushWriter:3147] 2014-11-13 21:20:58,312 Memtable.java (line 436) Completed flushing /servers/storage/cassandra-data/Disco/NamespaceFile2/Disco-NamespaceFile2-ic-5753-Data.db (3283519 bytes) for commitlog position ReplayPosition(segmentId=1413900469731, position=6848)


André

Re: Repair completes successfully but data is still inconsistent

Posted by Robert Coli <rc...@eventbrite.com>.

On Wed, Nov 26, 2014 at 10:17 AM, André Cruz <an...@co.sapo.pt> wrote:

> Of these, the row in question was present on:
> Disco-NamespaceFile2-ic-5337-Data.db - tombstone column
> Disco-NamespaceFile2-ic-5719-Data.db - no trace of that column
> Disco-NamespaceFile2-ic-5748-Data.db - live column with original timestamp
>
> Is this the information you requested?
>

Yes. Do you know if 5748 was created as a result of compaction or via a
flush from a memtable?

=Rob

Re: Repair completes successfully but data is still inconsistent

Posted by André Cruz <an...@co.sapo.pt>.

On 24 Nov 2014, at 18:54, Robert Coli <rc...@eventbrite.com> wrote:
> 
> But for any given value on any given node, you can verify the value it has in 100% of SStables... that's what both the normal read path and repair should do when reconciling row fragments into the materialized row? Hard to understand a case where repair fails, and I might provide that set of SStables attached to an Apache JIRA.

These were the sstables present on node 3 on the 13th:

-rw-r--r--  1 andre  staff  167794724 Nov 13 17:57 Disco-NamespaceFile2-ic-232-Data.db
-rw-r--r--  1 andre  staff  167809485 Nov 13 17:58 Disco-NamespaceFile2-ic-3608-Data.db
-rw-r--r--  1 andre  staff  167800404 Nov 13 17:58 Disco-NamespaceFile2-ic-3609-Data.db
-rw-r--r--  1 andre  staff   59773136 Nov 13 17:58 Disco-NamespaceFile2-ic-3610-Data.db
-rw-r--r--  1 andre  staff  167804631 Nov 13 17:59 Disco-NamespaceFile2-ic-4022-Data.db
-rw-r--r--  1 andre  staff  167795369 Nov 13 17:59 Disco-NamespaceFile2-ic-4023-Data.db
-rw-r--r--  1 andre  staff   30930871 Nov 13 18:00 Disco-NamespaceFile2-ic-4024-Data.db
-rw-r--r--  1 andre  staff  167806410 Nov 13 18:00 Disco-NamespaceFile2-ic-5334-Data.db
-rw-r--r--  1 andre  staff  167789187 Nov 13 18:01 Disco-NamespaceFile2-ic-5336-Data.db
-rw-r--r--  1 andre  staff  167786564 Nov 13 18:02 Disco-NamespaceFile2-ic-5337-Data.db
-rw-r--r--  1 andre  staff  167794911 Nov 13 18:02 Disco-NamespaceFile2-ic-5338-Data.db
-rw-r--r--  1 andre  staff  167813170 Nov 13 18:03 Disco-NamespaceFile2-ic-5339-Data.db
-rw-r--r--  1 andre  staff   64246179 Nov 13 18:03 Disco-NamespaceFile2-ic-5340-Data.db
-rw-r--r--  1 andre  staff  167794716 Nov 13 18:04 Disco-NamespaceFile2-ic-5719-Data.db
-rw-r--r--  1 andre  staff  167782343 Nov 13 18:04 Disco-NamespaceFile2-ic-5721-Data.db
-rw-r--r--  1 andre  staff  167789502 Nov 13 18:05 Disco-NamespaceFile2-ic-5722-Data.db
-rw-r--r--  1 andre  staff  167800931 Nov 13 18:05 Disco-NamespaceFile2-ic-5723-Data.db
-rw-r--r--  1 andre  staff  167810143 Nov 13 18:06 Disco-NamespaceFile2-ic-5724-Data.db
-rw-r--r--  1 andre  staff  167800713 Nov 13 18:07 Disco-NamespaceFile2-ic-5725-Data.db
-rw-r--r--  1 andre  staff  167812647 Nov 13 18:07 Disco-NamespaceFile2-ic-5726-Data.db
-rw-r--r--  1 andre  staff  167782180 Nov 13 18:08 Disco-NamespaceFile2-ic-5727-Data.db
-rw-r--r--  1 andre  staff  167797488 Nov 13 18:09 Disco-NamespaceFile2-ic-5728-Data.db
-rw-r--r--  1 andre  staff   12950043 Nov 13 18:09 Disco-NamespaceFile2-ic-5729-Data.db
-rw-r--r--  1 andre  staff  167798810 Nov 13 18:09 Disco-NamespaceFile2-ic-5730-Data.db
-rw-r--r--  1 andre  staff  167805918 Nov 13 18:10 Disco-NamespaceFile2-ic-5731-Data.db
-rw-r--r--  1 andre  staff  167805189 Nov 13 18:10 Disco-NamespaceFile2-ic-5732-Data.db
-rw-r--r--  1 andre  staff   12563937 Nov 13 18:10 Disco-NamespaceFile2-ic-5733-Data.db
-rw-r--r--  1 andre  staff   16110649 Nov 13 18:11 Disco-NamespaceFile2-ic-5748-Data.db


Of these, the row in question was present on:
Disco-NamespaceFile2-ic-5337-Data.db - tombstone column
Disco-NamespaceFile2-ic-5719-Data.db - no trace of that column
Disco-NamespaceFile2-ic-5748-Data.db - live column with original timestamp

Is this the information you requested?

Best regards,
André

Re: Repair completes successfully but data is still inconsistent

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Nov 24, 2014 at 10:39 AM, André Cruz <an...@co.sapo.pt> wrote:

> This data does not use TTLs. What other reason could there be for a mask?
> If I connect using cassandra-cli to that specific node, which becomes the
> coordinator, is it guaranteed to not ask another node when CL is ONE and it
> contains that row?
>

Other than rare cases of writing "doomstones" (deletes with timestamps in
the future) not sure in what case this might occur.

But for any given value on any given node, you can verify the value it has
in 100% of SStables... that's what both the normal read path and repair
should do when reconciling row fragments into the materialized row? Hard to
understand a case where repair fails, and I might provide that set of
SStables attached to an Apache JIRA.

=Rob

Re: Repair completes successfully but data is still inconsistent

Posted by André Cruz <an...@co.sapo.pt>.

On 21 Nov 2014, at 19:01, Robert Coli <rc...@eventbrite.com> wrote:
> 
> 2- Why won’t repair propagate this column value to the other nodes? Repairs have run everyday and the value is still missing on the other nodes.
> 
> No idea. Are you sure it's not expired via TTL or masked in some other way? When you ask that node for it at CL.ONE, do you get this value?

This data does not use TTLs. What other reason could there be for a mask? If I connect using cassandra-cli to that specific node, which becomes the coordinator, is it guaranteed to not ask another node when CL is ONE and it contains that row?

>  
> Cassandra has never stored data consistently except by fortunate accident.”

I wish I had read that a few years back. :)

Thank you,
André

Re: Repair completes successfully but data is still inconsistent

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Nov 21, 2014 at 3:11 AM, André Cruz <an...@co.sapo.pt> wrote:

> Can it be that they were all in the middle of a compaction (Leveled
> compaction) and the new sstables were written but the old ones were not
> deleted? Will Cassandra blindly pick up old and new sstables when it
> restarts?
>

Yes.

https://issues.apache.org/jira/browse/CASSANDRA-6756

Also note linked tickets there like :

https://issues.apache.org/jira/browse/CASSANDRA-6503

6503 is fix version 1.2.14 ...

> 1- What is the correct sequence of commands to bring down a node safely? I
> know that “drain" was used here, because it is in the log. I’ve read
> somewhere that drain should not be used and “disablethrift”,
> “disablegossip”, “flush” and then waiting a while is the correct way.
>

"drain" is the canonical answer. But "drain" has historically not always
worked. In general when it hasn't worked, it has flushed properly but not
marked clean properly.

https://issues.apache.org/jira/browse/CASSANDRA-4446

https://issues.apache.org/jira/browse/CASSANDRA-5911

5911 is "too late for 1.2" and will not be merged there.

> 2- Why won’t repair propagate this column value to the other nodes?
> Repairs have run everyday and the value is still missing on the other nodes.
>

No idea. Are you sure it's not expired via TTL or masked in some other way?
When you ask that node for it at CL.ONE, do you get this value?

> 3- If I have these rogue sstables loaded this seems like a time bomb. Down
> the line I will again delete columns that will reappear after some time. Is
> there a way I can find these sstables that should not be there? I thought
> the timestamp of the file would help but this zombie column is present on
> one of the latest sstables.
>

Unfortunately, not really. You'd need the patch from CASSANDRA-6568 and
you'd need to continuously have been capturing your live SStable set to
determine when

https://issues.apache.org/jira/browse/CASSANDRA-6568

The experience you're having is, AFAICT, irrecoverably fatal to
consistency. This is why, at the summit this year, when Apple announced it
had encountered and fixed like 5 bugs of this type, I summarized their talk
in chats as "the talk where Apple showed that, despite what you've heard
about quorum reads and writes, Cassandra has never stored data consistently
except by fortunate accident."

=Rob
http://twitter.com/rcolidba

Re: Repair completes successfully but data is still inconsistent

Posted by André Cruz <an...@co.sapo.pt>.

On 19 Nov 2014, at 19:53, Robert Coli <rc...@eventbrite.com> wrote:
>
> My hunch is that you originally triggered this by picking up some obsolete SSTables during the 1.2 era. Probably if you clean up the existing zombies you will not encounter them again, unless you encounter another "obsolete sstables marked live" bug. I agree that your compaction exceptions in the 1.2 era are likely implicated.

I’m still in the “1.2 era”, I upgraded from 1.2.16 to 1.2.19. You mentioned that obsolete sstables may have been picked up, and each node had 3-9 exceptions like this when they were brought down prior to being updated:

ERROR [CompactionExecutor:15173] 2014-10-21 15:04:01,923 CassandraDaemon.java (line 191) Exception in thread Thread[CompactionExecutor:15173,1,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1dad30c0 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@555b9c78[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14052]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor.submit(Unknown Source)
at org.apache.cassandra.io.sstable.SSTableDeletingTask.schedule(SSTableDeletingTask.java:65)

Can it be that they were all in the middle of a compaction (Leveled compaction) and the new sstables were written but the old ones were not deleted? Will Cassandra blindly pick up old and new sstables when it restarts?

If so, I have a few questions:

1- What is the correct sequence of commands to bring down a node safely? I know that “drain" was used here, because it is in the log. I’ve read somewhere that drain should not be used and “disablethrift”, “disablegossip”, “flush” and then waiting a while is the correct way.

2- Why won’t repair propagate this column value to the other nodes? Repairs have run everyday and the value is still missing on the other nodes.

3- If I have these rogue sstables loaded this seems like a time bomb. Down the line I will again delete columns that will reappear after some time. Is there a way I can find these sstables that should not be there? I thought the timestamp of the file would help but this zombie column is present on one of the latest sstables.

Thanks,
André

Re: Repair completes successfully but data is still inconsistent

Posted by Robert Coli <rc...@eventbrite.com>.

On Wed, Nov 19, 2014 at 5:18 AM, André Cruz <an...@co.sapo.pt> wrote:

> Each node has 4-9 of these exceptions as it is going down after being
> drained. It seems Cassandra was trying to delete an sstable. Can this be
> related?
>

That seems plausible, though the versions of the files you indicate have
the versions of the data now suggests that the file was eventually
successfully deleted.

My hunch is that you originally triggered this by picking up some obsolete
SSTables during the 1.2 era. Probably if you clean up the existing zombies
you will not encounter them again, unless you encounter another "obsolete
sstables marked live" bug. I agree that your compaction exceptions in the
1.2 era are likely implicated.

=Rob

Re: Repair completes successfully but data is still inconsistent

Posted by André Cruz <an...@co.sapo.pt>.

On 19 Nov 2014, at 11:37, André Cruz <an...@co.sapo.pt> wrote:
> 
> All the nodes were restarted on 21-23 October, for the upgrade (1.2.16 -> 1.2.19) I mentioned. The delete happened after. I should also point out that we were experiencing problems related to CASSANDRA-4206 and CASSANDRA-7808.

Another possible cause are these exceptions I found in the log as the nodes were shutdown and brought up with the new version:

INFO [RMI TCP Connection(270364)-10.134.101.18] 2014-10-21 15:04:00,867 StorageService.java (line 939) DRAINED
ERROR [CompactionExecutor:15173] 2014-10-21 15:04:01,923 CassandraDaemon.java (line 191) Exception in thread Thread[CompactionExecutor:15173,1,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1dad30c0 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@555b9c78[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14052]
        at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
        at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(Unknown Source)
        at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(Unknown Source)
        at java.util.concurrent.ScheduledThreadPoolExecutor.submit(Unknown Source)
        at org.apache.cassandra.io.sstable.SSTableDeletingTask.schedule(SSTableDeletingTask.java:65)
        at org.apache.cassandra.io.sstable.SSTableReader.releaseReference(SSTableReader.java:976)
        at org.apache.cassandra.db.DataTracker.removeOldSSTablesSize(DataTracker.java:370)
        at org.apache.cassandra.db.DataTracker.postReplace(DataTracker.java:335)
        at org.apache.cassandra.db.DataTracker.replace(DataTracker.java:329)
        at org.apache.cassandra.db.DataTracker.replaceCompactedSSTables(DataTracker.java:232)
        at org.apache.cassandra.db.ColumnFamilyStore.replaceCompactedSSTables(ColumnFamilyStore.java:995)
        at org.apache.cassandra.db.compaction.CompactionTask.replaceCompactedSSTables(CompactionTask.java:270)
        at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:230)
        at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
        at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
        at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)


Each node has 4-9 of these exceptions as it is going down after being drained. It seems Cassandra was trying to delete an sstable. Can this be related?

Best regards,
André Cruz

Re: Repair completes successfully but data is still inconsistent

Posted by André Cruz <an...@co.sapo.pt>.

On 19 Nov 2014, at 00:43, Robert Coli <rc...@eventbrite.com> wrote:
> 
> @OP : can you repro if you run a major compaction between the deletion and the tombstone collection?

This happened in production and, AFAIK, for the first time in a system that has been running for 2 years. We have upgraded the Cassandra version last month so there’s that difference, but the upgrade happened before the original delete of this column.

I have found more examples of zombie columns like this (aprox 30k columns of a 1.2M total) and they are all in this same row of this CF. I should point out that we have a sister CF where we do similar inserts/deletes, but it uses STCS, and it doesn’t exhibit this problem. 

I don’t think I can reproduce this easily in a test environment.

> 
> Basically, I am conjecturing that a compaction bug or one of the handful of "unmask previously deleted data" bugs are resulting in the unmasking of a non-tombstone row which is sitting in a SStable.
> 
> OP could also support this conjecture by running sstablekeys on other SSTables on "3rd replica" and determining what masked values there are for the row prior to deletion. If the data is sitting in an old SStable, this is suggestive.

There are 3 sstables that have this row on the 3rd replica:

Disco-NamespaceFile2-ic-5337-Data.db.json - Has the column tombstone
Disco-NamespaceFile2-ic-5719-Data.db.json - Has no value for this column
Disco-NamespaceFile2-ic-5748-Data.db.json - Has the original value

> 
> One last question for OP would be whether the nodes were restarted during the time period this bug was observed. An assortment of the "unmask previously deleted data" bugs come from "dead" sstables in the data directory being marked "live" on a restart.

All the nodes were restarted on 21-23 October, for the upgrade (1.2.16 -> 1.2.19) I mentioned. The delete happened after. I should also point out that we were experiencing problems related to CASSANDRA-4206 and CASSANDRA-7808.

ERROR 15:01:51,885 Exception in thread Thread[CompactionExecutor:15172,1,main]
java.lang.AssertionError: originally calculated column size of 78041151 but now it is 78041303
        at org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:135)
        at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160)
        at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162)
        at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
        at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
        at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

We saw that line multiple times in the logs, always for the same row because the 78041151 and 78041303 even though the data seemed fine. Could that row be the one experiencing problems now? Maybe with the upgrade the new Cassandra correctly compacted this row and all hell broke loose?

If so, is there a easy way to fix this? Shouldn’t repair also propagate this zombie column to the other nodes?

Thank you and best regards,
André Cruz

Re: Repair completes successfully but data is still inconsistent

Posted by Robert Coli <rc...@eventbrite.com>.

On Tue, Nov 18, 2014 at 12:46 PM, Michael Shuler <mi...@pbandjelly.org>
wrote:

> `nodetool cleanup` also looks interesting as an option.


I don't understand why cleanup or scrub would help with a case where data
is being un-tombstoned.

"
1 November - column is deleted - gc_grace_period is 10 days
8 November - all 3 replicas have tombstone
13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has
the original value (!), with the original timestamp…
"

To me the smoking gun here is that 3rd replica has the original value.

@OP : can you repro if you run a major compaction between the deletion and
the tombstone collection?

Basically, I am conjecturing that a compaction bug or one of the handful of
"unmask previously deleted data" bugs are resulting in the unmasking of a
non-tombstone row which is sitting in a SStable.

OP could also support this conjecture by running sstablekeys on other
SSTables on "3rd replica" and determining what masked values there are for
the row prior to deletion. If the data is sitting in an old SStable, this
is suggestive.

One last question for OP would be whether the nodes were restarted during
the time period this bug was observed. An assortment of the "unmask
previously deleted data" bugs come from "dead" sstables in the data
directory being marked "live" on a restart.

=Rob
http://twitter.com/rcolidba

Re: Repair completes successfully but data is still inconsistent

Posted by Michael Shuler <mi...@pbandjelly.org>.

`nodetool cleanup` also looks interesting as an option.

-- 
Michael

Re: Repair completes successfully but data is still inconsistent

Posted by André Cruz <an...@co.sapo.pt>.

On 18 Nov 2014, at 01:08, Michael Shuler <mi...@pbandjelly.org> wrote:
> 
> André, does `nodetool gossipinfo` show all the nodes in schema agreement?
> 

Yes:

$ nodetool -h XXX.XXX.XXX.XXX gossipinfo |grep -i schema
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e
  SCHEMA:8ef63726-c845-3565-9851-91c0074a9b5e


Best regards,
André

Re: Repair completes successfully but data is still inconsistent

Posted by Michael Shuler <mi...@pbandjelly.org>.

On 11/17/2014 05:22 AM, André Cruz wrote:
> I have checked the logs of the 3 replicas for that period and nothing
> really jumps out. Still, repairs have been running daily, the log
> reports that the CF is synced, and as of this moment one of the
> replicas still returns the zombie column so they don’t agree on the
> data.

I'm at a bit of a loss. Readers, is `nodetool scrub` going to be helpful 
here, or any other suggestions of things to look for?

André, does `nodetool gossipinfo` show all the nodes in schema agreement?

-- 
Michael

Re: Repair completes successfully but data is still inconsistent

Posted by André Cruz <an...@co.sapo.pt>.

On 14 Nov 2014, at 18:44, André Cruz <an...@co.sapo.pt> wrote:
> 
> On 14 Nov 2014, at 18:29, Michael Shuler <mi...@pbandjelly.org> wrote:
>> 
>> On 11/14/2014 12:12 PM, André Cruz wrote:
>>> Some extra info. I checked the backups and on the 8th of November, all 3 replicas had the tombstone of the deleted column. So:
>>> 
>>> 1 November - column is deleted - gc_grace_period is 10 days
>>> 8 November - all 3 replicas have tombstone
>>> 13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the original value (!), with the original timestamp…
>> 
>> After seeing your first post, this is helpful info. I'm curious what the logs show between the 8th-13th, particularly around the 10th-11th :)
> 
> Which logs in particular, just the ones from the 3rd machine which has the zombie column? What should I be looking for? :)

I have checked the logs of the 3 replicas for that period and nothing really jumps out. Still, repairs have been running daily, the log reports that the CF is synced, and as of this moment one of the replicas still returns the zombie column so they don’t agree on the data.

André

Re: Repair completes successfully but data is still inconsistent

Posted by André Cruz <an...@co.sapo.pt>.

On 14 Nov 2014, at 18:29, Michael Shuler <mi...@pbandjelly.org> wrote:
> 
> On 11/14/2014 12:12 PM, André Cruz wrote:
>> Some extra info. I checked the backups and on the 8th of November, all 3 replicas had the tombstone of the deleted column. So:
>> 
>> 1 November - column is deleted - gc_grace_period is 10 days
>> 8 November - all 3 replicas have tombstone
>> 13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the original value (!), with the original timestamp…
> 
> After seeing your first post, this is helpful info. I'm curious what the logs show between the 8th-13th, particularly around the 10th-11th :)

Which logs in particular, just the ones from the 3rd machine which has the zombie column? What should I be looking for? :)

Thanks for the help,
André

Re: Repair completes successfully but data is still inconsistent

Posted by Michael Shuler <mi...@pbandjelly.org>.

On 11/14/2014 12:12 PM, André Cruz wrote:
> Some extra info. I checked the backups and on the 8th of November, all 3 replicas had the tombstone of the deleted column. So:
>
> 1 November - column is deleted - gc_grace_period is 10 days
> 8 November - all 3 replicas have tombstone
> 13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the original value (!), with the original timestamp…

After seeing your first post, this is helpful info. I'm curious what the 
logs show between the 8th-13th, particularly around the 10th-11th :)

> Is there a logical explanation for this behaviour?

Not yet!

-- 
Kind regards,
Michael

Re: Repair completes successfully but data is still inconsistent

Posted by André Cruz <an...@co.sapo.pt>.

Some extra info. I checked the backups and on the 8th of November, all 3 replicas had the tombstone of the deleted column. So:

1 November - column is deleted - gc_grace_period is 10 days
8 November - all 3 replicas have tombstone
13/14 November - column/tombstone is gone on 2 replicas, 3rd replica has the original value (!), with the original timestamp…

Is there a logical explanation for this behaviour?

Thank you,
André Cruz