You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Martin Mačura <m....@gmail.com> on 2018/06/05 16:06:00 UTC

Repair slow, "Percent repaired" never updated

Hi,
we're on cassandra 3.11.2, and we're having some issues with repairs.
They take ages to complete, and some time ago the incremental repair
stopped working - that is, SSTables are not being marked as repaired,
even though the repair reports success.

Running a full or incremental repair does not make any difference.

Here's a log of a typical repair (omitted a lot of 'Maximum memory
usage' messages):

INFO  [Repair-Task-12] 2018-06-04 06:29:50,396 RepairRunnable.java:139
- Starting repair command #11 (af1aefc0-67c0-11e8-b07c-c365701888e8),
repairing keyspace prod with repair options (parallelism: parallel,
primary range: false, incremental: true, job threads: 1,
ColumnFamilies: [event], dataCenters: [DC1], hosts: [], # of ranges:
1280, pull repair: false)
INFO  [Repair-Task-12] 2018-06-04 06:29:51,497 RepairSession.java:228
- [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] new session: will
sync /14.0.53.234, /14.0.52.115 on range [...] for asm_log.[event]
INFO  [Repair#11:1] 2018-06-04 06:29:51,776 RepairJob.java:169 -
[repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Requesting merkle trees
for event (to [/14.0.52.115, /14.0.53.234])
INFO  [ValidationExecutor:10] 2018-06-04 06:31:13,859
NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
cannot allocate chunk of 1.000MiB
WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-06-04 06:32:01,385
NoSpamLogger.java:94 - Out of 14 commit log syncs over the past
134.02s with average duration of 34.90ms, 2 have exceeded the
configured commit interval by an average of 60.66ms
...
INFO  [ValidationExecutor:10] 2018-06-04 13:31:19,011
NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
cannot allocate chunk of 1.000MiB
INFO  [AntiEntropyStage:1] 2018-06-04 13:37:17,357
RepairSession.java:180 - [repair
#afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event
from /14.0.52.115
INFO  [ValidationExecutor:10] 2018-06-04 13:46:19,281
NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
cannot allocate chunk of 1.000MiB
INFO  [IndexSummaryManager:1] 2018-06-04 13:57:18,772
IndexSummaryRedistribution.java:76 - Redistributing index summaries
INFO  [AntiEntropyStage:1] 2018-06-04 13:58:21,971
RepairSession.java:180 - [repair
#afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event
from /14.0.53.234
INFO  [RepairJobTask:4] 2018-06-04 13:58:39,780 SyncTask.java:73 -
[repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Endpoints /14.0.52.115
and /14.0.53.234 have 15406 range(s) out of sync for event
INFO  [RepairJobTask:4] 2018-06-04 13:58:39,781 LocalSyncTask.java:71
- [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Performing streaming
repair of 15406 ranges with /14.0.52.115
INFO  [RepairJobTask:4] 2018-06-04 13:59:49,075
StreamResultFuture.java:90 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] Executing streaming plan for
Repair
INFO  [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,076
StreamSession.java:266 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] Starting streaming to
/14.0.52.115
INFO  [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,089
StreamCoordinator.java:264 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8, ID#0] Beginning stream session
with /14.0.52.115
INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:01:14,423
StreamResultFuture.java:173 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8 ID#0] Prepare completed.
Receiving 321 files(6.238GiB), sending 318 files(6.209GiB)
WARN  [Service Thread] 2018-06-04 14:12:15,578 GCInspector.java:282 -
ConcurrentMarkSweep GC in 4095ms.  CMS Old Gen: 4086661264 ->
1107272664; Par Eden Space: 503316480 -> 0; Par Survivor Space:
21541464 -> 0
...
WARN  [GossipTasks:1] 2018-06-04 14:12:15,677 FailureDetector.java:288
- Not marking nodes down due to local pause of 5123793157 > 5000000000
INFO  [ScheduledTasks:1] 2018-06-04 14:12:20,611 NoSpamLogger.java:91
- Some operations were slow, details available at debug level
(debug.log)
INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,188
StreamResultFuture.java:187 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] Session with /14.0.52.115 is
complete
INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,190
StreamResultFuture.java:219 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] All sessions completed
INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,190
LocalSyncTask.java:121 - [repair
#afc2ef90-67c0-11e8-b07c-c365701888e8] Sync complete using session
afc2ef90-67c0-11e8-b07c-c365701888e8 between /14.0.52.115 and
/14.0.53.234 on event
INFO  [RepairJobTask:5] 2018-06-04 14:14:29,191 RepairJob.java:143 -
[repair #afc2ef90-67c0-11e8-b07c-c365701888e8] event is fully synced
INFO  [RepairJobTask:5] 2018-06-04 14:14:29,193 RepairSession.java:270
- [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Session completed
successfully



Tablestats:
               SSTable count: 714
               Space used (live): 489416489322
               Space used (total): 489416489322
               Space used by snapshots (total): 0
               Off heap memory used (total): 311166982
               SSTable Compression Ratio: 0.28468835622230676
               Number of partitions (estimate): 22511321
               Memtable cell count: 6635
               Memtable data size: 3236775
               Memtable off heap memory used: 0
               Memtable switch count: 1649
               ...
               Percent repaired: 85.62
               Bloom filter false positives: 36849
               Bloom filter false ratio: 0.00024
               Bloom filter space used: 87092784
               Bloom filter off heap memory used: 87061568
               Index summary off heap memory used: 15474806
               Compression metadata off heap memory used: 208630608
               Compacted partition minimum bytes: 61
               Compacted partition maximum bytes: 107964792
               Compacted partition mean bytes: 35262
               Average live cells per slice (last five minutes):
301.46281597264624
               Maximum live cells per slice (last five minutes): 3311
               Average tombstones per slice (last five minutes): 1.0
               Maximum tombstones per slice (last five minutes): 1
               Dropped Mutations: 0

Histograms:

Percentile  SSTables     Write Latency      Read Latency    Partition
Size        Cell Count
                             (micros)          (micros)
(bytes)
50%             2.00             24.60           3379.39
3311               310
75%             2.00             29.52           8409.01
51012              5722
95%             2.00             35.43          30130.99
105778              9887
98%             2.00             42.51          43388.63
263210             35425
99%             2.00             42.51          62479.63
263210             35425
Min             0.00              9.89            182.79
 61                 0
Max             3.00            263.21         155469.30
107964792          10090808



Thanks for any help!

Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org


Re: Repair slow, "Percent repaired" never updated

Posted by Martin Mačura <m....@gmail.com>.
P.S.: Here's a corresponding log from the second node:

INFO  [AntiEntropyStage:1] 2018-06-04 13:37:16,409 Validator.java:281
- [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Sending completed
merkle tree to /14.0.53.234 for asm_log.event
INFO  [StreamReceiveTask:30] 2018-06-04 14:14:28,989
StreamResultFuture.java:187 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] Session with /14.0.53.234 is
complete
INFO  [StreamReceiveTask:30] 2018-06-04 14:14:28,990
StreamResultFuture.java:219 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] All sessions completed
INFO  [AntiEntropyStage:1] 2018-06-04 14:14:29,000
ActiveRepairService.java:452 - [repair
#af1aefc0-67c0-11e8-b07c-c365701888e8] Not a global repair, will not
do anticompaction


Why is there no anticompaction if it's an incremental repair?

We have two datacenters currently, this concerns the second one that
we recently brought up (with nodetool rebuild). We cannot do a repair
across datacenters, because nodes in the old DC would run out of disk
space.

Regards,

Martin



On Tue, Jun 5, 2018 at 6:06 PM, Martin Mačura <m....@gmail.com> wrote:
> Hi,
> we're on cassandra 3.11.2, and we're having some issues with repairs.
> They take ages to complete, and some time ago the incremental repair
> stopped working - that is, SSTables are not being marked as repaired,
> even though the repair reports success.
>
> Running a full or incremental repair does not make any difference.
>
> Here's a log of a typical repair (omitted a lot of 'Maximum memory
> usage' messages):
>
> INFO  [Repair-Task-12] 2018-06-04 06:29:50,396 RepairRunnable.java:139
> - Starting repair command #11 (af1aefc0-67c0-11e8-b07c-c365701888e8),
> repairing keyspace prod with repair options (parallelism: parallel,
> primary range: false, incremental: true, job threads: 1,
> ColumnFamilies: [event], dataCenters: [DC1], hosts: [], # of ranges:
> 1280, pull repair: false)
> INFO  [Repair-Task-12] 2018-06-04 06:29:51,497 RepairSession.java:228
> - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] new session: will
> sync /14.0.53.234, /14.0.52.115 on range [...] for asm_log.[event]
> INFO  [Repair#11:1] 2018-06-04 06:29:51,776 RepairJob.java:169 -
> [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Requesting merkle trees
> for event (to [/14.0.52.115, /14.0.53.234])
> INFO  [ValidationExecutor:10] 2018-06-04 06:31:13,859
> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
> cannot allocate chunk of 1.000MiB
> WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-06-04 06:32:01,385
> NoSpamLogger.java:94 - Out of 14 commit log syncs over the past
> 134.02s with average duration of 34.90ms, 2 have exceeded the
> configured commit interval by an average of 60.66ms
> ...
> INFO  [ValidationExecutor:10] 2018-06-04 13:31:19,011
> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
> cannot allocate chunk of 1.000MiB
> INFO  [AntiEntropyStage:1] 2018-06-04 13:37:17,357
> RepairSession.java:180 - [repair
> #afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event
> from /14.0.52.115
> INFO  [ValidationExecutor:10] 2018-06-04 13:46:19,281
> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
> cannot allocate chunk of 1.000MiB
> INFO  [IndexSummaryManager:1] 2018-06-04 13:57:18,772
> IndexSummaryRedistribution.java:76 - Redistributing index summaries
> INFO  [AntiEntropyStage:1] 2018-06-04 13:58:21,971
> RepairSession.java:180 - [repair
> #afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event
> from /14.0.53.234
> INFO  [RepairJobTask:4] 2018-06-04 13:58:39,780 SyncTask.java:73 -
> [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Endpoints /14.0.52.115
> and /14.0.53.234 have 15406 range(s) out of sync for event
> INFO  [RepairJobTask:4] 2018-06-04 13:58:39,781 LocalSyncTask.java:71
> - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Performing streaming
> repair of 15406 ranges with /14.0.52.115
> INFO  [RepairJobTask:4] 2018-06-04 13:59:49,075
> StreamResultFuture.java:90 - [Stream
> #6244fd50-67ff-11e8-b07c-c365701888e8] Executing streaming plan for
> Repair
> INFO  [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,076
> StreamSession.java:266 - [Stream
> #6244fd50-67ff-11e8-b07c-c365701888e8] Starting streaming to
> /14.0.52.115
> INFO  [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,089
> StreamCoordinator.java:264 - [Stream
> #6244fd50-67ff-11e8-b07c-c365701888e8, ID#0] Beginning stream session
> with /14.0.52.115
> INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:01:14,423
> StreamResultFuture.java:173 - [Stream
> #6244fd50-67ff-11e8-b07c-c365701888e8 ID#0] Prepare completed.
> Receiving 321 files(6.238GiB), sending 318 files(6.209GiB)
> WARN  [Service Thread] 2018-06-04 14:12:15,578 GCInspector.java:282 -
> ConcurrentMarkSweep GC in 4095ms.  CMS Old Gen: 4086661264 ->
> 1107272664; Par Eden Space: 503316480 -> 0; Par Survivor Space:
> 21541464 -> 0
> ...
> WARN  [GossipTasks:1] 2018-06-04 14:12:15,677 FailureDetector.java:288
> - Not marking nodes down due to local pause of 5123793157 > 5000000000
> INFO  [ScheduledTasks:1] 2018-06-04 14:12:20,611 NoSpamLogger.java:91
> - Some operations were slow, details available at debug level
> (debug.log)
> INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,188
> StreamResultFuture.java:187 - [Stream
> #6244fd50-67ff-11e8-b07c-c365701888e8] Session with /14.0.52.115 is
> complete
> INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,190
> StreamResultFuture.java:219 - [Stream
> #6244fd50-67ff-11e8-b07c-c365701888e8] All sessions completed
> INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,190
> LocalSyncTask.java:121 - [repair
> #afc2ef90-67c0-11e8-b07c-c365701888e8] Sync complete using session
> afc2ef90-67c0-11e8-b07c-c365701888e8 between /14.0.52.115 and
> /14.0.53.234 on event
> INFO  [RepairJobTask:5] 2018-06-04 14:14:29,191 RepairJob.java:143 -
> [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] event is fully synced
> INFO  [RepairJobTask:5] 2018-06-04 14:14:29,193 RepairSession.java:270
> - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Session completed
> successfully
>
>
>
> Tablestats:
>                SSTable count: 714
>                Space used (live): 489416489322
>                Space used (total): 489416489322
>                Space used by snapshots (total): 0
>                Off heap memory used (total): 311166982
>                SSTable Compression Ratio: 0.28468835622230676
>                Number of partitions (estimate): 22511321
>                Memtable cell count: 6635
>                Memtable data size: 3236775
>                Memtable off heap memory used: 0
>                Memtable switch count: 1649
>                ...
>                Percent repaired: 85.62
>                Bloom filter false positives: 36849
>                Bloom filter false ratio: 0.00024
>                Bloom filter space used: 87092784
>                Bloom filter off heap memory used: 87061568
>                Index summary off heap memory used: 15474806
>                Compression metadata off heap memory used: 208630608
>                Compacted partition minimum bytes: 61
>                Compacted partition maximum bytes: 107964792
>                Compacted partition mean bytes: 35262
>                Average live cells per slice (last five minutes):
> 301.46281597264624
>                Maximum live cells per slice (last five minutes): 3311
>                Average tombstones per slice (last five minutes): 1.0
>                Maximum tombstones per slice (last five minutes): 1
>                Dropped Mutations: 0
>
> Histograms:
>
> Percentile  SSTables     Write Latency      Read Latency    Partition
> Size        Cell Count
>                              (micros)          (micros)
> (bytes)
> 50%             2.00             24.60           3379.39
> 3311               310
> 75%             2.00             29.52           8409.01
> 51012              5722
> 95%             2.00             35.43          30130.99
> 105778              9887
> 98%             2.00             42.51          43388.63
> 263210             35425
> 99%             2.00             42.51          62479.63
> 263210             35425
> Min             0.00              9.89            182.79
>  61                 0
> Max             3.00            263.21         155469.30
> 107964792          10090808
>
>
>
> Thanks for any help!
>
> Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org