You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Laxmikant Upadhyay <la...@gmail.com> on 2018/09/04 16:31:09 UTC

High IO and poor read performance on 3.11.2 cassandra cluster

We have 3 node cassandra cluster (3.11.2) in single dc.

We have written 450 million records on the table with LCS. The write
latency is fine.  After write we perform read and update operations.

When we run read+update operations on newly inserted 1 million records (on
top of 450 m records) then the read latency and io usage is under control.
However when we perform read+update on old 1 million records which are part
of 450 million records we observe high read latency (The performance goes
down by 4 times in comparison 1st case ).  We have not observed major gc
pauses.

*system information:*
*cpu core :*  24
*disc type : *ssd . we are using raid with deadline schedular
*disk space:*
df -h :
Filesystem                      Size  Used Avail Use% Mounted on
/dev/sdb1                        1.9T  393G  1.5T  22% /var/lib/cassandra
*memory:*
free -g
              total        used        free      shared  buff/cache
 available
Mem:             62          30           0           0          32
  31
Swap:             8           0           8

==========================================

*schema*

desc table ks.xyz;

CREATE TABLE ks.xyz (
    key text,
    column1 text,
    value text,
    PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
    AND CLUSTERING ORDER BY (column1 ASC)
    AND bloom_filter_fp_chance = 0.1
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
    AND compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
==============================================================================
Below is some system stats snippet when read operations was running:

*iotop -o * : Observation : the disk read goes up to 5.5 G/s

Total DISK READ :       *3.86 G/s* | Total DISK WRITE :    1252.88 K/s
Actual DISK READ:      * 3.92 G/s* | Actual DISK WRITE:       0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
10715 be/4 cassandr  375.89 M/s   99.79 K/s  0.00 % 29.15 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10714 be/4 cassandr  358.56 M/s  107.18 K/s  0.00 % 27.06 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10712 be/4 cassandr  351.86 M/s  147.83 K/s  0.00 % 25.02 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10718 be/4 cassandr  359.82 M/s  110.87 K/s  0.00 % 24.49 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10711 be/4 cassandr  333.03 M/s  125.66 K/s  0.00 % 23.37 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10716 be/4 cassandr  330.80 M/s  103.48 K/s  0.00 % 23.02 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10717 be/4 cassandr  319.49 M/s  118.27 K/s  0.00 % 22.11 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10713 be/4 cassandr  300.62 M/s  118.27 K/s  0.00 % 21.65 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10719 be/4 cassandr  294.98 M/s   81.31 K/s  0.00 % 21.60 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10720 be/4 cassandr  289.00 M/s   73.92 K/s  0.00 % 21.45 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10742 be/4 cassandr  240.98 M/s   81.31 K/s  0.00 % 17.68 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10743 be/4 cassandr  224.43 M/s   36.96 K/s  0.00 % 17.57 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10744 be/4 cassandr  113.29 M/s   14.78 K/s  0.00 % 10.22 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10745 be/4 cassandr   61.63 M/s   33.26 K/s  0.00 %  4.20 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/

==================================================================================================
*iostats -x 5*

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                     2.35    0.00    7.09       *11.26 *   0.00   79.30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdb             205.60     0.00 5304.60    3.00 3707651.20  2450.40
1398.03    11.37    2.14    2.14    0.67   0.18  93.60


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                       2.99    0.00    7.49       *10.95*    0.00   78.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdb             228.00     0.00 5639.40    4.80 3947922.40  1348.00
1399.41    12.65    2.24    2.25    0.46   0.17  94.00

==================================================================================================
*nodetool info*
ID                     : ddab003d-41d1-413e-ace2-87ccd34229d5
Gossip active          : true
Thrift active          : false
Native Transport active: true
Load                   : 391.58 GiB
Generation No          : 1536065356
Uptime (seconds)       : 4714
Heap Memory (MB)       : 5172.52 / 13312.00
Off Heap Memory (MB)   : 2288.28
Data Center            : DC1
Rack                   : RAC1
Exceptions             : 0
Key Cache              : entries 1604008, size 167.04 MiB, capacity 256
MiB, 3615 hits, 1801574 requests, 0.002 recent hit rate, 14400 save period
in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits,
0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 50 MiB, 0 hits,
0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache            : entries 7680, size 480 MiB, capacity 480 MiB,
1399440 misses, 2338352 requests, 0.402 recent hit rate, 2760.976
microseconds miss latency
Percent Repaired       : 1.6357850897532832E-8%
Token                  : (invoke with -T/--tokens to see all 256 tokens)

=============================================================================

*nodetool tablehistograms ks xyz  :  *Observation : not a wide row
ks/xyz histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size
      Cell Count
                              (micros)          (micros)           (bytes)
50%             1.00             17.08           5839.59              1109
              17
75%             1.00             20.50           *7007.51 *
1331                24
95%             2.00             29.52          *12108.97 *
1331                24
98%             2.00             35.43          14530.76              1331
              24
99%             2.00             42.51          17436.92              1331
              24
Min             1.00              5.72            105.78               150
               2
Max             2.00            315.85         464228.84              1916
              35

*nodetool tablehistograms ks xyz*
ks/xyz histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size
      Cell Count
                              (micros)          (micros)           (bytes)
50%             1.00             17.08           5839.59              1109
              17
75%             1.00             20.50           7007.51              1331
              24
95%             2.00             29.52          12108.97              1331
              24
98%             2.00             35.43          14530.76              1331
              24
99%             2.00             42.51          17436.92              1331
              24
Min             1.00              4.77            105.78               150
               2
Max             2.00            315.85         464228.84              1916
              35



======================================================================

*nodetool tpstats*

Pool Name                         Active   Pending      Completed
 Blocked  All time blocked
ReadStage                              3         1        1603986
 0                 0
MiscStage                              0         0              0
 0                 0
CompactionExecutor                     0         0           6670
 0                 0
MutationStage                          0         0        5044371
 0                 0
MemtableReclaimMemory                  0         0             57
 0                 0
PendingRangeCalculator                 0         0              3
 0                 0
GossipStage                            0         0          14299
 0                 0
SecondaryIndexManagement               0         0              0
 0                 0
HintsDispatcher                        0         0              0
 0                 0
RequestResponseStage                   0         1        3867575
 0                 0
Native-Transport-Requests              3         0        2319802
 0                 0
ReadRepairStage                        0         0          13281
 0                 0
CounterMutationStage                   0         0              0
 0                 0
MigrationStage                         0         0              2
 0                 0
MemtablePostFlush                      0         0             65
 0                 0
PerDiskMemtableFlushWriter_0           0         0             57
 0                 0
ValidationExecutor                     0         0              0
 0                 0
Sampler                                0         0              0
 0                 0
MemtableFlushWriter                    0         0             57
 0                 0
InternalResponseStage                  0         0              0
 0                 0
ViewMutationStage                      0         0              0
 0                 0
AntiEntropyStage                       0         0              0
 0                 0
CacheCleanupExecutor                   0         0              0
 0                 0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                     0
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

=====================================================================

*notetool compactionstats*
pending tasks: 0

==============================================================================
*debug.log Snippent*


WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-09-04 19:39:51,057
NoSpamLogger.java:94 - Out of 29 commit log syncs over the past 263.94s
with average duration of 15.17ms, 2 have exceeded the configured commit
interval by an average of 20.32ms
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:39:53,015
AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
a fresh one
DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659 ColumnFamilyStore.java:1308
- Flushing largest CFS(Keyspace='system', ColumnFamily='batches') to free
up room. Used total: 0.33/0.00, live: 0.33/0.00, flushing: 0.00/0.00, this:
0.13/0.00
DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659 ColumnFamilyStore.java:918
- Enqueuing flush of batches: 128.567MiB (13%) on-heap, 0.000KiB (0%)
off-heap
DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,685
Memtable.java:456 - Writing Memtable-batches@1450035834(78.630MiB
serialized bytes, 125418 ops, 13%/0% of on/off-heap limit), flushed range =
(null, null]
DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,695
Memtable.java:485 - Completed flushing
/var/lib/cassandra/data/system/batches-919a4bc57a333573b03e13fc3f68b465/mc-16-big-Data.db
(0.000KiB) for commitlog position
CommitLogPosition(segmentId=1536065319618, position=7958044)
DEBUG [MemtableFlushWriter:12] 2018-09-04 19:39:59,695
ColumnFamilyStore.java:1216 - Flushed to [] (0 sstables, 0.000KiB), biggest
0.000KiB, smallest 8589934592.000GiB
DEBUG [ScheduledTasks:1] 2018-09-04 19:40:15,710 MonitoringTask.java:173 -
2 operations were slow in the last 4999 msecs:
<SELECT * FROM ks1.abc WHERE key = 12345 LIMIT 5000>, time 575 msec - slow
timeout 500 msec/cross-node
<SELECT * FROM ks.xyz WHERE key = key1 LIMIT 5000>, time 645 msec - slow
timeout 500 msec/cross-node
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:20,475
AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
a fresh one
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:46,675
AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
a fresh one
DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,976 ColumnFamilyStore.java:1308
- Flushing largest CFS(Keyspace='ks', ColumnFamily='xyz') to free up room.
Used total: 0.33/0.00, live: 0.33/0.00, flushing: 0.00/0.00, this: 0.12/0.00
DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,977 ColumnFamilyStore.java:918
- Enqueuing flush of xyz: 121.374MiB (12%) on-heap, 0.000KiB (0%) off-heap

*Observation :*  frequent "operations were slow in the last " (for select
queries) and "WARN: commit log syncs over the past"
===================================================

*notetool tablestats -H  ks.xyz <http://ks.xyz>;*
Total number of tables: 89
----------------
Keyspace : ks
        Read Count: 1439722
        Read Latency: 1.8982509581710914 ms
        Write Count: 4222811
        Write Latency: 0.016324778684151386 ms
        Pending Flushes: 0
                Table: xyz
                SSTable count: 1036
                SSTables in each level: [1, 10, 116/100, 909, 0, 0, 0, 0, 0]
                Space used (live): 187.09 GiB
                Space used (total): 187.09 GiB
                Space used by snapshots (total): 0 bytes
                Off heap memory used (total): 783.93 MiB
                SSTable Compression Ratio: 0.3238726404414842
                Number of partitions (estimate): 447095605
                Memtable cell count: 306194
                Memtable data size: 20.59 MiB
                Memtable off heap memory used: 0 bytes
                Memtable switch count: 7
                Local read count: 1440322
                Local read latency: 6.785 ms
                Local write count: 1408204
                Local write latency: 0.021 ms
                Pending flushes: 0
                Percent repaired: 0.0
                Bloom filter false positives: 19
                Bloom filter false ratio: 0.00003
                Bloom filter space used: 418.2 MiB
                Bloom filter off heap memory used: 418.19 MiB
                Index summary off heap memory used: 307.75 MiB
                Compression metadata off heap memory used: 57.99 MiB
                Compacted partition minimum bytes: 150
                Compacted partition maximum bytes: 1916
                Compacted partition mean bytes: 1003
                Average live cells per slice (last five minutes): 20.0
                Maximum live cells per slice (last five minutes): 20
                Average tombstones per slice (last five minutes): 1.0
                Maximum tombstones per slice (last five minutes): 1
                Dropped Mutations: 0 bytes

-- 

regards,
Laxmikant Upadhyay

Re: High IO and poor read performance on 3.11.2 cassandra cluster

Posted by "wxn002@zjqunshuo.com" <wx...@zjqunshuo.com>.

How large is your row? You may meet reading wide row problem. 

-Simon

From: Laxmikant Upadhyay
Date: 2018-09-05 01:01
To: user
Subject: High IO and poor read performance on 3.11.2 cassandra cluster
We have 3 node cassandra cluster (3.11.2) in single dc.

We have written 450 million records on the table with LCS. The write latency is fine.  After write we perform read and update operations.

When we run read+update operations on newly inserted 1 million records (on top of 450 m records) then the read latency and io usage is under control. However when we perform read+update on old 1 million records which are part of 450 million records we observe high read latency (The performance goes down by 4 times in comparison 1st case ).  We have not observed major gc pauses.

system information:
cpu core :  24
disc type : ssd . we are using raid with deadline schedular
disk space:
df -h :
Filesystem                      Size  Used Avail Use% Mounted on
/dev/sdb1                        1.9T  393G  1.5T  22% /var/lib/cassandra
memory:
free -g
              total        used        free      shared  buff/cache   available
Mem:             62          30           0           0          32          31
Swap:             8           0           8

==========================================

schema

desc table ks.xyz;

CREATE TABLE ks.xyz (
    key text,
    column1 text,
    value text,
    PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
    AND CLUSTERING ORDER BY (column1 ASC)
    AND bloom_filter_fp_chance = 0.1
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
==============================================================================
Below is some system stats snippet when read operations was running:

iotop -o  : Observation : the disk read goes up to 5.5 G/s

Total DISK READ :       3.86 G/s | Total DISK WRITE :    1252.88 K/s
Actual DISK READ:       3.92 G/s | Actual DISK WRITE:       0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
10715 be/4 cassandr  375.89 M/s   99.79 K/s  0.00 % 29.15 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10714 be/4 cassandr  358.56 M/s  107.18 K/s  0.00 % 27.06 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10712 be/4 cassandr  351.86 M/s  147.83 K/s  0.00 % 25.02 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10718 be/4 cassandr  359.82 M/s  110.87 K/s  0.00 % 24.49 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10711 be/4 cassandr  333.03 M/s  125.66 K/s  0.00 % 23.37 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10716 be/4 cassandr  330.80 M/s  103.48 K/s  0.00 % 23.02 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10717 be/4 cassandr  319.49 M/s  118.27 K/s  0.00 % 22.11 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10713 be/4 cassandr  300.62 M/s  118.27 K/s  0.00 % 21.65 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10719 be/4 cassandr  294.98 M/s   81.31 K/s  0.00 % 21.60 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10720 be/4 cassandr  289.00 M/s   73.92 K/s  0.00 % 21.45 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10742 be/4 cassandr  240.98 M/s   81.31 K/s  0.00 % 17.68 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10743 be/4 cassandr  224.43 M/s   36.96 K/s  0.00 % 17.57 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10744 be/4 cassandr  113.29 M/s   14.78 K/s  0.00 % 10.22 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10745 be/4 cassandr   61.63 M/s   33.26 K/s  0.00 %  4.20 % java -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/

==================================================================================================
iostats -x 5

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                     2.35    0.00    7.09       11.26    0.00   79.30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb             205.60     0.00 5304.60    3.00 3707651.20  2450.40  1398.03    11.37    2.14    2.14    0.67   0.18  93.60


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                       2.99    0.00    7.49       10.95    0.00   78.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb             228.00     0.00 5639.40    4.80 3947922.40  1348.00  1399.41    12.65    2.24    2.25    0.46   0.17  94.00

==================================================================================================
nodetool info
ID                     : ddab003d-41d1-413e-ace2-87ccd34229d5
Gossip active          : true
Thrift active          : false
Native Transport active: true
Load                   : 391.58 GiB
Generation No          : 1536065356
Uptime (seconds)       : 4714
Heap Memory (MB)       : 5172.52 / 13312.00
Off Heap Memory (MB)   : 2288.28
Data Center            : DC1
Rack                   : RAC1
Exceptions             : 0
Key Cache              : entries 1604008, size 167.04 MiB, capacity 256 MiB, 3615 hits, 1801574 requests, 0.002 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache            : entries 7680, size 480 MiB, capacity 480 MiB, 1399440 misses, 2338352 requests, 0.402 recent hit rate, 2760.976 microseconds miss latency
Percent Repaired       : 1.6357850897532832E-8%
Token                  : (invoke with -T/--tokens to see all 256 tokens)

=============================================================================

nodetool tablehistograms ks xyz  :  Observation : not a wide row
ks/xyz histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)
50%             1.00             17.08           5839.59              1109                17
75%             1.00             20.50           7007.51              1331                24
95%             2.00             29.52          12108.97              1331                24
98%             2.00             35.43          14530.76              1331                24
99%             2.00             42.51          17436.92              1331                24
Min             1.00              5.72            105.78               150                 2
Max             2.00            315.85         464228.84              1916                35

nodetool tablehistograms ks xyz
ks/xyz histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)
50%             1.00             17.08           5839.59              1109                17
75%             1.00             20.50           7007.51              1331                24
95%             2.00             29.52          12108.97              1331                24
98%             2.00             35.43          14530.76              1331                24
99%             2.00             42.51          17436.92              1331                24
Min             1.00              4.77            105.78               150                 2
Max             2.00            315.85         464228.84              1916                35



======================================================================

nodetool tpstats

Pool Name                         Active   Pending      Completed   Blocked  All time blocked
ReadStage                              3         1        1603986         0                 0
MiscStage                              0         0              0         0                 0
CompactionExecutor                     0         0           6670         0                 0
MutationStage                          0         0        5044371         0                 0
MemtableReclaimMemory                  0         0             57         0                 0
PendingRangeCalculator                 0         0              3         0                 0
GossipStage                            0         0          14299         0                 0
SecondaryIndexManagement               0         0              0         0                 0
HintsDispatcher                        0         0              0         0                 0
RequestResponseStage                   0         1        3867575         0                 0
Native-Transport-Requests              3         0        2319802         0                 0
ReadRepairStage                        0         0          13281         0                 0
CounterMutationStage                   0         0              0         0                 0
MigrationStage                         0         0              2         0                 0
MemtablePostFlush                      0         0             65         0                 0
PerDiskMemtableFlushWriter_0           0         0             57         0                 0
ValidationExecutor                     0         0              0         0                 0
Sampler                                0         0              0         0                 0
MemtableFlushWriter                    0         0             57         0                 0
InternalResponseStage                  0         0              0         0                 0
ViewMutationStage                      0         0              0         0                 0
AntiEntropyStage                       0         0              0         0                 0
CacheCleanupExecutor                   0         0              0         0                 0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                     0
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

=====================================================================

notetool compactionstats
pending tasks: 0

==============================================================================
debug.log Snippent


WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-09-04 19:39:51,057 NoSpamLogger.java:94 - Out of 29 commit log syncs over the past 263.94s with average duration of 15.17ms, 2 have exceeded the configured commit interval by an average of 20.32ms
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:39:53,015 AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating a fresh one
DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659 ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='system', ColumnFamily='batches') to free up room. Used total: 0.33/0.00, live: 0.33/0.00, flushing: 0.00/0.00, this: 0.13/0.00
DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659 ColumnFamilyStore.java:918 - Enqueuing flush of batches: 128.567MiB (13%) on-heap, 0.000KiB (0%) off-heap
DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,685 Memtable.java:456 - Writing Memtable-batches@1450035834(78.630MiB serialized bytes, 125418 ops, 13%/0% of on/off-heap limit), flushed range = (null, null]
DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,695 Memtable.java:485 - Completed flushing /var/lib/cassandra/data/system/batches-919a4bc57a333573b03e13fc3f68b465/mc-16-big-Data.db (0.000KiB) for commitlog position CommitLogPosition(segmentId=1536065319618, position=7958044)
DEBUG [MemtableFlushWriter:12] 2018-09-04 19:39:59,695 ColumnFamilyStore.java:1216 - Flushed to [] (0 sstables, 0.000KiB), biggest 0.000KiB, smallest 8589934592.000GiB
DEBUG [ScheduledTasks:1] 2018-09-04 19:40:15,710 MonitoringTask.java:173 - 2 operations were slow in the last 4999 msecs:
<SELECT * FROM ks1.abc WHERE key = 12345 LIMIT 5000>, time 575 msec - slow timeout 500 msec/cross-node
<SELECT * FROM ks.xyz WHERE key = key1 LIMIT 5000>, time 645 msec - slow timeout 500 msec/cross-node
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:20,475 AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating a fresh one
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:46,675 AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating a fresh one
DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,976 ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='ks', ColumnFamily='xyz') to free up room. Used total: 0.33/0.00, live: 0.33/0.00, flushing: 0.00/0.00, this: 0.12/0.00
DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,977 ColumnFamilyStore.java:918 - Enqueuing flush of xyz: 121.374MiB (12%) on-heap, 0.000KiB (0%) off-heap

Observation :  frequent "operations were slow in the last " (for select queries) and "WARN: commit log syncs over the past" 
===================================================

notetool tablestats -H  ks.xyz;
Total number of tables: 89
----------------
Keyspace : ks
        Read Count: 1439722
        Read Latency: 1.8982509581710914 ms
        Write Count: 4222811
        Write Latency: 0.016324778684151386 ms
        Pending Flushes: 0
                Table: xyz
                SSTable count: 1036
                SSTables in each level: [1, 10, 116/100, 909, 0, 0, 0, 0, 0]
                Space used (live): 187.09 GiB
                Space used (total): 187.09 GiB
                Space used by snapshots (total): 0 bytes
                Off heap memory used (total): 783.93 MiB
                SSTable Compression Ratio: 0.3238726404414842
                Number of partitions (estimate): 447095605
                Memtable cell count: 306194
                Memtable data size: 20.59 MiB
                Memtable off heap memory used: 0 bytes
                Memtable switch count: 7
                Local read count: 1440322
                Local read latency: 6.785 ms
                Local write count: 1408204
                Local write latency: 0.021 ms
                Pending flushes: 0
                Percent repaired: 0.0
                Bloom filter false positives: 19
                Bloom filter false ratio: 0.00003
                Bloom filter space used: 418.2 MiB
                Bloom filter off heap memory used: 418.19 MiB
                Index summary off heap memory used: 307.75 MiB
                Compression metadata off heap memory used: 57.99 MiB
                Compacted partition minimum bytes: 150
                Compacted partition maximum bytes: 1916
                Compacted partition mean bytes: 1003
                Average live cells per slice (last five minutes): 20.0
                Maximum live cells per slice (last five minutes): 20
                Average tombstones per slice (last five minutes): 1.0
                Maximum tombstones per slice (last five minutes): 1
                Dropped Mutations: 0 bytes

-- 

regards,
Laxmikant Upadhyay

Re: High IO and poor read performance on 3.11.2 cassandra cluster

Posted by Elliott Sims <el...@backblaze.com>.

A few reasons I can think of offhand why your test setup might not see
problems from large readahead:
Your sstables are <4MB or your reads are typically <4MB from the end of the
file
Your queries tend to use the 4MB of data anyways
Your dataset is small enough that most of it fits in the VM cache, and it
rarely goes to disk
Load is low enough that the read I/O amplification doesn't hurt performance
Less likely but still possible is that there's a subtle difference in the
way that 2.1 does reads vs 3.x that's affecting it.  The less subtle
explanation is that 3.x has smaller rows and a smaller readahead is
therefore probably optimal, but that would only decrease your performance
benefit and not cause a regression from 2.1->3.x.


On Mon, Sep 10, 2018 at 1:27 AM, Laxmikant Upadhyay <laxmikant.hcl@gmail.com
> wrote:

> Thank you so much Alexander !
>
> Your doubt was right. It was due to the very high value of readahead only
> (4 mb).
>
> Although We had set readahead value to 8kb in our /etc/rc.local but some
> how this was not working.
> we are keeping the value to 64 kb as we this is giving better performance
> than 8kb. Now we are able to meet our sla.
>
> One interesting observation is that we have a setup on cassandra 2.1.16
> also and on that system the readahead value is 4mb only but we are not
> observing any performance dip there. I am not sure why.
>
>
> On Wed, Sep 5, 2018 at 11:31 AM Alexander Dejanovski <
> alex@thelastpickle.com> wrote:
>
>> Don't forget to run "nodetool upgradesstables -a" after you ran the ALTER
>> statement so that all SSTables get re-written with the new compression
>> settings.
>>
>> Since you have a lot of tables in your cluster, be aware that lowering
>> the chunk length will grow the offheap memory usage of Cassandra.
>> You can get more informations here : http://thelastpickle.com/
>> blog/2018/08/08/compression_performance.html
>>
>> You should also check your readahead settings as it may be set too high :
>> sudo blockdev --report
>> The default is usually 256 but Cassandra would rather favor low readahead
>> values to get more IOPS instead of more throughput (and readahead is
>> usually not that useful for Cassandra). A conservative setting is 64 (you
>> can go down to 8 and see how Cassandra performs then).
>> Do note that changing the readahead settings requires to restart
>> Cassandra as it is only read once by the JVM during startup.
>>
>> Cheers,
>>
>> On Wed, Sep 5, 2018 at 7:27 AM CPC <ac...@gmail.com> wrote:
>>
>>> Could you decrease chunk_length_in_kb to 16 or 8 and repeat the test.
>>>
>>> On Wed, Sep 5, 2018, 5:51 AM wxn002@zjqunshuo.com <wx...@zjqunshuo.com>
>>> wrote:
>>>
>>>> How large is your row? You may meet reading wide row problem.
>>>>
>>>> -Simon
>>>>
>>>> *From:* Laxmikant Upadhyay <la...@gmail.com>
>>>> *Date:* 2018-09-05 01:01
>>>> *To:* user <us...@cassandra.apache.org>
>>>> *Subject:* High IO and poor read performance on 3.11.2 cassandra
>>>> cluster
>>>>
>>>> We have 3 node cassandra cluster (3.11.2) in single dc.
>>>>
>>>> We have written 450 million records on the table with LCS. The write
>>>> latency is fine.  After write we perform read and update operations.
>>>>
>>>> When we run read+update operations on newly inserted 1 million records
>>>> (on top of 450 m records) then the read latency and io usage is under
>>>> control. However when we perform read+update on old 1 million records which
>>>> are part of 450 million records we observe high read latency (The
>>>> performance goes down by 4 times in comparison 1st case ).  We have not
>>>> observed major gc pauses.
>>>>
>>>> *system information:*
>>>> *cpu core :*  24
>>>> *disc type : *ssd . we are using raid with deadline schedular
>>>> *disk space:*
>>>> df -h :
>>>> Filesystem                      Size  Used Avail Use% Mounted on
>>>> /dev/sdb1                        1.9T  393G  1.5T  22%
>>>> /var/lib/cassandra
>>>> *memory:*
>>>> free -g
>>>>               total        used        free      shared  buff/cache
>>>>  available
>>>> Mem:             62          30           0           0          32
>>>>       31
>>>> Swap:             8           0           8
>>>>
>>>> ==========================================
>>>>
>>>> *schema*
>>>>
>>>> desc table ks.xyz;
>>>>
>>>> CREATE TABLE ks.xyz (
>>>>     key text,
>>>>     column1 text,
>>>>     value text,
>>>>     PRIMARY KEY (key, column1)
>>>> ) WITH COMPACT STORAGE
>>>>     AND CLUSTERING ORDER BY (column1 ASC)
>>>>     AND bloom_filter_fp_chance = 0.1
>>>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>>     AND comment = ''
>>>>     AND compaction = {'class': 'org.apache.cassandra.db.compaction.
>>>> LeveledCompactionStrategy'}
>>>>     AND compression = {'chunk_length_in_kb': '64', 'class': '
>>>> org.apache.cassandra.io.compress.LZ4Compressor'}
>>>>     AND crc_check_chance = 1.0
>>>>     AND dclocal_read_repair_chance = 0.0
>>>>     AND default_time_to_live = 0
>>>>     AND gc_grace_seconds = 864000
>>>>     AND max_index_interval = 2048
>>>>     AND memtable_flush_period_in_ms = 0
>>>>     AND min_index_interval = 128
>>>>     AND read_repair_chance = 0.0
>>>>     AND speculative_retry = '99PERCENTILE';
>>>> ============================================================
>>>> ==================
>>>> Below is some system stats snippet when read operations was running:
>>>>
>>>> *iotop -o * : Observation : the disk read goes up to 5.5 G/s
>>>>
>>>> Total DISK READ :       *3.86 G/s* | Total DISK WRITE :    1252.88 K/s
>>>> Actual DISK READ:      * 3.92 G/s* | Actual DISK WRITE:       0.00 B/s
>>>>   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
>>>> 10715 be/4 cassandr  375.89 M/s   99.79 K/s  0.00 % 29.15 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10714 be/4 cassandr  358.56 M/s  107.18 K/s  0.00 % 27.06 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10712 be/4 cassandr  351.86 M/s  147.83 K/s  0.00 % 25.02 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10718 be/4 cassandr  359.82 M/s  110.87 K/s  0.00 % 24.49 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10711 be/4 cassandr  333.03 M/s  125.66 K/s  0.00 % 23.37 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10716 be/4 cassandr  330.80 M/s  103.48 K/s  0.00 % 23.02 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10717 be/4 cassandr  319.49 M/s  118.27 K/s  0.00 % 22.11 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10713 be/4 cassandr  300.62 M/s  118.27 K/s  0.00 % 21.65 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10719 be/4 cassandr  294.98 M/s   81.31 K/s  0.00 % 21.60 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10720 be/4 cassandr  289.00 M/s   73.92 K/s  0.00 % 21.45 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10742 be/4 cassandr  240.98 M/s   81.31 K/s  0.00 % 17.68 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10743 be/4 cassandr  224.43 M/s   36.96 K/s  0.00 % 17.57 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10744 be/4 cassandr  113.29 M/s   14.78 K/s  0.00 % 10.22 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>> 10745 be/4 cassandr   61.63 M/s   33.26 K/s  0.00 %  4.20 % java
>>>> -Dorg.xerial.snappy.tempdir=/var/tmp -Dja~etrics-core-3.1.0.jar:/
>>>> usr/share/cassandra/lib/
>>>>
>>>> ============================================================
>>>> ======================================
>>>> *iostats -x 5*
>>>>
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>                      2.35    0.00    7.09       *11.26 *   0.00   79.30
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> sdb             205.60     0.00 5304.60    3.00 3707651.20  2450.40
>>>> 1398.03    11.37    2.14    2.14    0.67   0.18  93.60
>>>>
>>>>
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>                        2.99    0.00    7.49       *10.95*    0.00
>>>>  78.56
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> sdb             228.00     0.00 5639.40    4.80 3947922.40  1348.00
>>>> 1399.41    12.65    2.24    2.25    0.46   0.17  94.00
>>>>
>>>> ============================================================
>>>> ======================================
>>>> *nodetool info*
>>>> ID                     : ddab003d-41d1-413e-ace2-87ccd34229d5
>>>> Gossip active          : true
>>>> Thrift active          : false
>>>> Native Transport active: true
>>>> Load                   : 391.58 GiB
>>>> Generation No          : 1536065356
>>>> Uptime (seconds)       : 4714
>>>> Heap Memory (MB)       : 5172.52 / 13312.00
>>>> Off Heap Memory (MB)   : 2288.28
>>>> Data Center            : DC1
>>>> Rack                   : RAC1
>>>> Exceptions             : 0
>>>> Key Cache              : entries 1604008, size 167.04 MiB, capacity 256
>>>> MiB, 3615 hits, 1801574 requests, 0.002 recent hit rate, 14400 save period
>>>> in seconds
>>>> Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0
>>>> hits, 0 requests, NaN recent hit rate, 0 save period in seconds
>>>> Counter Cache          : entries 0, size 0 bytes, capacity 50 MiB, 0
>>>> hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
>>>> Chunk Cache            : entries 7680, size 480 MiB, capacity 480 MiB,
>>>> 1399440 misses, 2338352 requests, 0.402 recent hit rate, 2760.976
>>>> microseconds miss latency
>>>> Percent Repaired       : 1.6357850897532832E-8%
>>>> Token                  : (invoke with -T/--tokens to see all 256 tokens)
>>>>
>>>> ============================================================
>>>> =================
>>>>
>>>> *nodetool tablehistograms ks xyz  :  *Observation : not a wide row
>>>> ks/xyz histograms
>>>> Percentile  SSTables     Write Latency      Read Latency    Partition
>>>> Size        Cell Count
>>>>                               (micros)          (micros)
>>>>  (bytes)
>>>> 50%             1.00             17.08           5839.59
>>>> 1109                17
>>>> 75%             1.00             20.50           *7007.51 *
>>>>   1331                24
>>>> 95%             2.00             29.52          *12108.97 *
>>>>   1331                24
>>>> 98%             2.00             35.43          14530.76
>>>> 1331                24
>>>> 99%             2.00             42.51          17436.92
>>>> 1331                24
>>>> Min             1.00              5.72            105.78
>>>>  150                 2
>>>> Max             2.00            315.85         464228.84
>>>> 1916                35
>>>>
>>>> *nodetool tablehistograms ks xyz*
>>>> ks/xyz histograms
>>>> Percentile  SSTables     Write Latency      Read Latency    Partition
>>>> Size        Cell Count
>>>>                               (micros)          (micros)
>>>>  (bytes)
>>>> 50%             1.00             17.08           5839.59
>>>> 1109                17
>>>> 75%             1.00             20.50           7007.51
>>>> 1331                24
>>>> 95%             2.00             29.52          12108.97
>>>> 1331                24
>>>> 98%             2.00             35.43          14530.76
>>>> 1331                24
>>>> 99%             2.00             42.51          17436.92
>>>> 1331                24
>>>> Min             1.00              4.77            105.78
>>>>  150                 2
>>>> Max             2.00            315.85         464228.84
>>>> 1916                35
>>>>
>>>>
>>>>
>>>> ======================================================================
>>>>
>>>> *nodetool tpstats*
>>>>
>>>> Pool Name                         Active   Pending      Completed
>>>>  Blocked  All time blocked
>>>> ReadStage                              3         1        1603986
>>>>    0                 0
>>>> MiscStage                              0         0              0
>>>>    0                 0
>>>> CompactionExecutor                     0         0           6670
>>>>    0                 0
>>>> MutationStage                          0         0        5044371
>>>>    0                 0
>>>> MemtableReclaimMemory                  0         0             57
>>>>    0                 0
>>>> PendingRangeCalculator                 0         0              3
>>>>    0                 0
>>>> GossipStage                            0         0          14299
>>>>    0                 0
>>>> SecondaryIndexManagement               0         0              0
>>>>    0                 0
>>>> HintsDispatcher                        0         0              0
>>>>    0                 0
>>>> RequestResponseStage                   0         1        3867575
>>>>    0                 0
>>>> Native-Transport-Requests              3         0        2319802
>>>>    0                 0
>>>> ReadRepairStage                        0         0          13281
>>>>    0                 0
>>>> CounterMutationStage                   0         0              0
>>>>    0                 0
>>>> MigrationStage                         0         0              2
>>>>    0                 0
>>>> MemtablePostFlush                      0         0             65
>>>>    0                 0
>>>> PerDiskMemtableFlushWriter_0           0         0             57
>>>>    0                 0
>>>> ValidationExecutor                     0         0              0
>>>>    0                 0
>>>> Sampler                                0         0              0
>>>>    0                 0
>>>> MemtableFlushWriter                    0         0             57
>>>>    0                 0
>>>> InternalResponseStage                  0         0              0
>>>>    0                 0
>>>> ViewMutationStage                      0         0              0
>>>>    0                 0
>>>> AntiEntropyStage                       0         0              0
>>>>    0                 0
>>>> CacheCleanupExecutor                   0         0              0
>>>>    0                 0
>>>>
>>>> Message type           Dropped
>>>> READ                         0
>>>> RANGE_SLICE                  0
>>>> _TRACE                       0
>>>> HINT                         0
>>>> MUTATION                     0
>>>> COUNTER_MUTATION             0
>>>> BATCH_STORE                  0
>>>> BATCH_REMOVE                 0
>>>> REQUEST_RESPONSE             0
>>>> PAGED_RANGE                  0
>>>> READ_REPAIR                  0
>>>>
>>>> =====================================================================
>>>>
>>>> *notetool compactionstats*
>>>> pending tasks: 0
>>>>
>>>> ============================================================
>>>> ==================
>>>> *debug.log Snippent*
>>>>
>>>>
>>>> WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-09-04 19:39:51,057
>>>> NoSpamLogger.java:94 - Out of 29 commit log syncs over the past 263.94s
>>>> with average duration of 15.17ms, 2 have exceeded the configured commit
>>>> interval by an average of 20.32ms
>>>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:39:53,015
>>>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve;
>>>> creating a fresh one
>>>> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659
>>>> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='system',
>>>> ColumnFamily='batches') to free up room. Used total: 0.33/0.00, live:
>>>> 0.33/0.00, flushing: 0.00/0.00, this: 0.13/0.00
>>>> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659
>>>> ColumnFamilyStore.java:918 - Enqueuing flush of batches: 128.567MiB (13%)
>>>> on-heap, 0.000KiB (0%) off-heap
>>>> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,685
>>>> Memtable.java:456 - Writing Memtable-batches@1450035834(78.630MiB
>>>> serialized bytes, 125418 ops, 13%/0% of on/off-heap limit), flushed range =
>>>> (null, null]
>>>> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,695
>>>> Memtable.java:485 - Completed flushing /var/lib/cassandra/data/
>>>> system/batches-919a4bc57a333573b03e13fc3f68b465/mc-16-big-Data.db
>>>> (0.000KiB) for commitlog position CommitLogPosition(segmentId=1536065319618,
>>>> position=7958044)
>>>> DEBUG [MemtableFlushWriter:12] 2018-09-04 19:39:59,695
>>>> ColumnFamilyStore.java:1216 - Flushed to [] (0 sstables, 0.000KiB), biggest
>>>> 0.000KiB, smallest 8589934592.000GiB
>>>> DEBUG [ScheduledTasks:1] 2018-09-04 19:40:15,710
>>>> MonitoringTask.java:173 - 2 operations were slow in the last 4999 msecs:
>>>> <SELECT * FROM ks1.abc WHERE key = 12345 LIMIT 5000>, time 575 msec -
>>>> slow timeout 500 msec/cross-node
>>>> <SELECT * FROM ks.xyz WHERE key = key1 LIMIT 5000>, time 645 msec -
>>>> slow timeout 500 msec/cross-node
>>>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:20,475
>>>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve;
>>>> creating a fresh one
>>>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:46,675
>>>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve;
>>>> creating a fresh one
>>>> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,976
>>>> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='ks',
>>>> ColumnFamily='xyz') to free up room. Used total: 0.33/0.00, live:
>>>> 0.33/0.00, flushing: 0.00/0.00, this: 0.12/0.00
>>>> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,977
>>>> ColumnFamilyStore.java:918 - Enqueuing flush of xyz: 121.374MiB (12%)
>>>> on-heap, 0.000KiB (0%) off-heap
>>>>
>>>> *Observation :*  frequent "operations were slow in the last " (for
>>>> select queries) and "WARN: commit log syncs over the past"
>>>> ===================================================
>>>>
>>>> *notetool tablestats -H  ks.xyz <http://ks.xyz>;*
>>>> Total number of tables: 89
>>>> ----------------
>>>> Keyspace : ks
>>>>         Read Count: 1439722
>>>>         Read Latency: 1.8982509581710914 ms
>>>>         Write Count: 4222811
>>>>         Write Latency: 0.016324778684151386 ms
>>>>         Pending Flushes: 0
>>>>                 Table: xyz
>>>>                 SSTable count: 1036
>>>>                 SSTables in each level: [1, 10, 116/100, 909, 0, 0, 0,
>>>> 0, 0]
>>>>                 Space used (live): 187.09 GiB
>>>>                 Space used (total): 187.09 GiB
>>>>                 Space used by snapshots (total): 0 bytes
>>>>                 Off heap memory used (total): 783.93 MiB
>>>>                 SSTable Compression Ratio: 0.3238726404414842
>>>>                 Number of partitions (estimate): 447095605
>>>>                 Memtable cell count: 306194
>>>>                 Memtable data size: 20.59 MiB
>>>>                 Memtable off heap memory used: 0 bytes
>>>>                 Memtable switch count: 7
>>>>                 Local read count: 1440322
>>>>                 Local read latency: 6.785 ms
>>>>                 Local write count: 1408204
>>>>                 Local write latency: 0.021 ms
>>>>                 Pending flushes: 0
>>>>                 Percent repaired: 0.0
>>>>                 Bloom filter false positives: 19
>>>>                 Bloom filter false ratio: 0.00003
>>>>                 Bloom filter space used: 418.2 MiB
>>>>                 Bloom filter off heap memory used: 418.19 MiB
>>>>                 Index summary off heap memory used: 307.75 MiB
>>>>                 Compression metadata off heap memory used: 57.99 MiB
>>>>                 Compacted partition minimum bytes: 150
>>>>                 Compacted partition maximum bytes: 1916
>>>>                 Compacted partition mean bytes: 1003
>>>>                 Average live cells per slice (last five minutes): 20.0
>>>>                 Maximum live cells per slice (last five minutes): 20
>>>>                 Average tombstones per slice (last five minutes): 1.0
>>>>                 Maximum tombstones per slice (last five minutes): 1
>>>>                 Dropped Mutations: 0 bytes
>>>>
>>>> --
>>>>
>>>> regards,
>>>> Laxmikant Upadhyay
>>>>
>>>> --
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>
> --
>
> regards,
> Laxmikant Upadhyay
>
>

Re: High IO and poor read performance on 3.11.2 cassandra cluster

Posted by Laxmikant Upadhyay <la...@gmail.com>.

Thank you so much Alexander !

Your doubt was right. It was due to the very high value of readahead only
(4 mb).

Although We had set readahead value to 8kb in our /etc/rc.local but some
how this was not working.
we are keeping the value to 64 kb as we this is giving better performance
than 8kb. Now we are able to meet our sla.

One interesting observation is that we have a setup on cassandra 2.1.16
also and on that system the readahead value is 4mb only but we are not
observing any performance dip there. I am not sure why.


On Wed, Sep 5, 2018 at 11:31 AM Alexander Dejanovski <al...@thelastpickle.com>
wrote:

> Don't forget to run "nodetool upgradesstables -a" after you ran the ALTER
> statement so that all SSTables get re-written with the new compression
> settings.
>
> Since you have a lot of tables in your cluster, be aware that lowering the
> chunk length will grow the offheap memory usage of Cassandra.
> You can get more informations here :
> http://thelastpickle.com/blog/2018/08/08/compression_performance.html
>
> You should also check your readahead settings as it may be set too high :
> sudo blockdev --report
> The default is usually 256 but Cassandra would rather favor low readahead
> values to get more IOPS instead of more throughput (and readahead is
> usually not that useful for Cassandra). A conservative setting is 64 (you
> can go down to 8 and see how Cassandra performs then).
> Do note that changing the readahead settings requires to restart Cassandra
> as it is only read once by the JVM during startup.
>
> Cheers,
>
> On Wed, Sep 5, 2018 at 7:27 AM CPC <ac...@gmail.com> wrote:
>
>> Could you decrease chunk_length_in_kb to 16 or 8 and repeat the test.
>>
>> On Wed, Sep 5, 2018, 5:51 AM wxn002@zjqunshuo.com <wx...@zjqunshuo.com>
>> wrote:
>>
>>> How large is your row? You may meet reading wide row problem.
>>>
>>> -Simon
>>>
>>> *From:* Laxmikant Upadhyay <la...@gmail.com>
>>> *Date:* 2018-09-05 01:01
>>> *To:* user <us...@cassandra.apache.org>
>>> *Subject:* High IO and poor read performance on 3.11.2 cassandra cluster
>>>
>>> We have 3 node cassandra cluster (3.11.2) in single dc.
>>>
>>> We have written 450 million records on the table with LCS. The write
>>> latency is fine.  After write we perform read and update operations.
>>>
>>> When we run read+update operations on newly inserted 1 million records
>>> (on top of 450 m records) then the read latency and io usage is under
>>> control. However when we perform read+update on old 1 million records which
>>> are part of 450 million records we observe high read latency (The
>>> performance goes down by 4 times in comparison 1st case ).  We have not
>>> observed major gc pauses.
>>>
>>> *system information:*
>>> *cpu core :*  24
>>> *disc type : *ssd . we are using raid with deadline schedular
>>> *disk space:*
>>> df -h :
>>> Filesystem                      Size  Used Avail Use% Mounted on
>>> /dev/sdb1                        1.9T  393G  1.5T  22% /var/lib/cassandra
>>> *memory:*
>>> free -g
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:             62          30           0           0          32
>>>     31
>>> Swap:             8           0           8
>>>
>>> ==========================================
>>>
>>> *schema*
>>>
>>> desc table ks.xyz;
>>>
>>> CREATE TABLE ks.xyz (
>>>     key text,
>>>     column1 text,
>>>     value text,
>>>     PRIMARY KEY (key, column1)
>>> ) WITH COMPACT STORAGE
>>>     AND CLUSTERING ORDER BY (column1 ASC)
>>>     AND bloom_filter_fp_chance = 0.1
>>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>     AND comment = ''
>>>     AND compaction = {'class':
>>> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
>>>     AND compression = {'chunk_length_in_kb': '64', 'class':
>>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>>     AND crc_check_chance = 1.0
>>>     AND dclocal_read_repair_chance = 0.0
>>>     AND default_time_to_live = 0
>>>     AND gc_grace_seconds = 864000
>>>     AND max_index_interval = 2048
>>>     AND memtable_flush_period_in_ms = 0
>>>     AND min_index_interval = 128
>>>     AND read_repair_chance = 0.0
>>>     AND speculative_retry = '99PERCENTILE';
>>>
>>> ==============================================================================
>>> Below is some system stats snippet when read operations was running:
>>>
>>> *iotop -o * : Observation : the disk read goes up to 5.5 G/s
>>>
>>> Total DISK READ :       *3.86 G/s* | Total DISK WRITE :    1252.88 K/s
>>> Actual DISK READ:      * 3.92 G/s* | Actual DISK WRITE:       0.00 B/s
>>>   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
>>> 10715 be/4 cassandr  375.89 M/s   99.79 K/s  0.00 % 29.15 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10714 be/4 cassandr  358.56 M/s  107.18 K/s  0.00 % 27.06 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10712 be/4 cassandr  351.86 M/s  147.83 K/s  0.00 % 25.02 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10718 be/4 cassandr  359.82 M/s  110.87 K/s  0.00 % 24.49 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10711 be/4 cassandr  333.03 M/s  125.66 K/s  0.00 % 23.37 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10716 be/4 cassandr  330.80 M/s  103.48 K/s  0.00 % 23.02 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10717 be/4 cassandr  319.49 M/s  118.27 K/s  0.00 % 22.11 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10713 be/4 cassandr  300.62 M/s  118.27 K/s  0.00 % 21.65 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10719 be/4 cassandr  294.98 M/s   81.31 K/s  0.00 % 21.60 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10720 be/4 cassandr  289.00 M/s   73.92 K/s  0.00 % 21.45 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10742 be/4 cassandr  240.98 M/s   81.31 K/s  0.00 % 17.68 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10743 be/4 cassandr  224.43 M/s   36.96 K/s  0.00 % 17.57 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10744 be/4 cassandr  113.29 M/s   14.78 K/s  0.00 % 10.22 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>> 10745 be/4 cassandr   61.63 M/s   33.26 K/s  0.00 %  4.20 % java
>>> -Dorg.xerial.snappy.tempdir=/var/tmp
>>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>>
>>>
>>> ==================================================================================================
>>> *iostats -x 5*
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>                      2.35    0.00    7.09       *11.26 *   0.00   79.30
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sdb             205.60     0.00 5304.60    3.00 3707651.20  2450.40
>>> 1398.03    11.37    2.14    2.14    0.67   0.18  93.60
>>>
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>                        2.99    0.00    7.49       *10.95*    0.00
>>>  78.56
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sdb             228.00     0.00 5639.40    4.80 3947922.40  1348.00
>>> 1399.41    12.65    2.24    2.25    0.46   0.17  94.00
>>>
>>>
>>> ==================================================================================================
>>> *nodetool info*
>>> ID                     : ddab003d-41d1-413e-ace2-87ccd34229d5
>>> Gossip active          : true
>>> Thrift active          : false
>>> Native Transport active: true
>>> Load                   : 391.58 GiB
>>> Generation No          : 1536065356
>>> Uptime (seconds)       : 4714
>>> Heap Memory (MB)       : 5172.52 / 13312.00
>>> Off Heap Memory (MB)   : 2288.28
>>> Data Center            : DC1
>>> Rack                   : RAC1
>>> Exceptions             : 0
>>> Key Cache              : entries 1604008, size 167.04 MiB, capacity 256
>>> MiB, 3615 hits, 1801574 requests, 0.002 recent hit rate, 14400 save period
>>> in seconds
>>> Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0
>>> hits, 0 requests, NaN recent hit rate, 0 save period in seconds
>>> Counter Cache          : entries 0, size 0 bytes, capacity 50 MiB, 0
>>> hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
>>> Chunk Cache            : entries 7680, size 480 MiB, capacity 480 MiB,
>>> 1399440 misses, 2338352 requests, 0.402 recent hit rate, 2760.976
>>> microseconds miss latency
>>> Percent Repaired       : 1.6357850897532832E-8%
>>> Token                  : (invoke with -T/--tokens to see all 256 tokens)
>>>
>>>
>>> =============================================================================
>>>
>>> *nodetool tablehistograms ks xyz  :  *Observation : not a wide row
>>> ks/xyz histograms
>>> Percentile  SSTables     Write Latency      Read Latency    Partition
>>> Size        Cell Count
>>>                               (micros)          (micros)
>>>  (bytes)
>>> 50%             1.00             17.08           5839.59
>>> 1109                17
>>> 75%             1.00             20.50           *7007.51 *
>>> 1331                24
>>> 95%             2.00             29.52          *12108.97 *
>>> 1331                24
>>> 98%             2.00             35.43          14530.76
>>> 1331                24
>>> 99%             2.00             42.51          17436.92
>>> 1331                24
>>> Min             1.00              5.72            105.78
>>>  150                 2
>>> Max             2.00            315.85         464228.84
>>> 1916                35
>>>
>>> *nodetool tablehistograms ks xyz*
>>> ks/xyz histograms
>>> Percentile  SSTables     Write Latency      Read Latency    Partition
>>> Size        Cell Count
>>>                               (micros)          (micros)
>>>  (bytes)
>>> 50%             1.00             17.08           5839.59
>>> 1109                17
>>> 75%             1.00             20.50           7007.51
>>> 1331                24
>>> 95%             2.00             29.52          12108.97
>>> 1331                24
>>> 98%             2.00             35.43          14530.76
>>> 1331                24
>>> 99%             2.00             42.51          17436.92
>>> 1331                24
>>> Min             1.00              4.77            105.78
>>>  150                 2
>>> Max             2.00            315.85         464228.84
>>> 1916                35
>>>
>>>
>>>
>>> ======================================================================
>>>
>>> *nodetool tpstats*
>>>
>>> Pool Name                         Active   Pending      Completed
>>>  Blocked  All time blocked
>>> ReadStage                              3         1        1603986
>>>  0                 0
>>> MiscStage                              0         0              0
>>>  0                 0
>>> CompactionExecutor                     0         0           6670
>>>  0                 0
>>> MutationStage                          0         0        5044371
>>>  0                 0
>>> MemtableReclaimMemory                  0         0             57
>>>  0                 0
>>> PendingRangeCalculator                 0         0              3
>>>  0                 0
>>> GossipStage                            0         0          14299
>>>  0                 0
>>> SecondaryIndexManagement               0         0              0
>>>  0                 0
>>> HintsDispatcher                        0         0              0
>>>  0                 0
>>> RequestResponseStage                   0         1        3867575
>>>  0                 0
>>> Native-Transport-Requests              3         0        2319802
>>>  0                 0
>>> ReadRepairStage                        0         0          13281
>>>  0                 0
>>> CounterMutationStage                   0         0              0
>>>  0                 0
>>> MigrationStage                         0         0              2
>>>  0                 0
>>> MemtablePostFlush                      0         0             65
>>>  0                 0
>>> PerDiskMemtableFlushWriter_0           0         0             57
>>>  0                 0
>>> ValidationExecutor                     0         0              0
>>>  0                 0
>>> Sampler                                0         0              0
>>>  0                 0
>>> MemtableFlushWriter                    0         0             57
>>>  0                 0
>>> InternalResponseStage                  0         0              0
>>>  0                 0
>>> ViewMutationStage                      0         0              0
>>>  0                 0
>>> AntiEntropyStage                       0         0              0
>>>  0                 0
>>> CacheCleanupExecutor                   0         0              0
>>>  0                 0
>>>
>>> Message type           Dropped
>>> READ                         0
>>> RANGE_SLICE                  0
>>> _TRACE                       0
>>> HINT                         0
>>> MUTATION                     0
>>> COUNTER_MUTATION             0
>>> BATCH_STORE                  0
>>> BATCH_REMOVE                 0
>>> REQUEST_RESPONSE             0
>>> PAGED_RANGE                  0
>>> READ_REPAIR                  0
>>>
>>> =====================================================================
>>>
>>> *notetool compactionstats*
>>> pending tasks: 0
>>>
>>>
>>> ==============================================================================
>>> *debug.log Snippent*
>>>
>>>
>>> WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-09-04 19:39:51,057
>>> NoSpamLogger.java:94 - Out of 29 commit log syncs over the past 263.94s
>>> with average duration of 15.17ms, 2 have exceeded the configured commit
>>> interval by an average of 20.32ms
>>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:39:53,015
>>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
>>> a fresh one
>>> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659
>>> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='system',
>>> ColumnFamily='batches') to free up room. Used total: 0.33/0.00, live:
>>> 0.33/0.00, flushing: 0.00/0.00, this: 0.13/0.00
>>> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659
>>> ColumnFamilyStore.java:918 - Enqueuing flush of batches: 128.567MiB (13%)
>>> on-heap, 0.000KiB (0%) off-heap
>>> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,685
>>> Memtable.java:456 - Writing Memtable-batches@1450035834(78.630MiB
>>> serialized bytes, 125418 ops, 13%/0% of on/off-heap limit), flushed range =
>>> (null, null]
>>> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,695
>>> Memtable.java:485 - Completed flushing
>>> /var/lib/cassandra/data/system/batches-919a4bc57a333573b03e13fc3f68b465/mc-16-big-Data.db
>>> (0.000KiB) for commitlog position
>>> CommitLogPosition(segmentId=1536065319618, position=7958044)
>>> DEBUG [MemtableFlushWriter:12] 2018-09-04 19:39:59,695
>>> ColumnFamilyStore.java:1216 - Flushed to [] (0 sstables, 0.000KiB), biggest
>>> 0.000KiB, smallest 8589934592.000GiB
>>> DEBUG [ScheduledTasks:1] 2018-09-04 19:40:15,710 MonitoringTask.java:173
>>> - 2 operations were slow in the last 4999 msecs:
>>> <SELECT * FROM ks1.abc WHERE key = 12345 LIMIT 5000>, time 575 msec -
>>> slow timeout 500 msec/cross-node
>>> <SELECT * FROM ks.xyz WHERE key = key1 LIMIT 5000>, time 645 msec -
>>> slow timeout 500 msec/cross-node
>>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:20,475
>>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
>>> a fresh one
>>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:46,675
>>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
>>> a fresh one
>>> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,976
>>> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='ks',
>>> ColumnFamily='xyz') to free up room. Used total: 0.33/0.00, live:
>>> 0.33/0.00, flushing: 0.00/0.00, this: 0.12/0.00
>>> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,977
>>> ColumnFamilyStore.java:918 - Enqueuing flush of xyz: 121.374MiB (12%)
>>> on-heap, 0.000KiB (0%) off-heap
>>>
>>> *Observation :*  frequent "operations were slow in the last " (for
>>> select queries) and "WARN: commit log syncs over the past"
>>> ===================================================
>>>
>>> *notetool tablestats -H  ks.xyz <http://ks.xyz>;*
>>> Total number of tables: 89
>>> ----------------
>>> Keyspace : ks
>>>         Read Count: 1439722
>>>         Read Latency: 1.8982509581710914 ms
>>>         Write Count: 4222811
>>>         Write Latency: 0.016324778684151386 ms
>>>         Pending Flushes: 0
>>>                 Table: xyz
>>>                 SSTable count: 1036
>>>                 SSTables in each level: [1, 10, 116/100, 909, 0, 0, 0,
>>> 0, 0]
>>>                 Space used (live): 187.09 GiB
>>>                 Space used (total): 187.09 GiB
>>>                 Space used by snapshots (total): 0 bytes
>>>                 Off heap memory used (total): 783.93 MiB
>>>                 SSTable Compression Ratio: 0.3238726404414842
>>>                 Number of partitions (estimate): 447095605
>>>                 Memtable cell count: 306194
>>>                 Memtable data size: 20.59 MiB
>>>                 Memtable off heap memory used: 0 bytes
>>>                 Memtable switch count: 7
>>>                 Local read count: 1440322
>>>                 Local read latency: 6.785 ms
>>>                 Local write count: 1408204
>>>                 Local write latency: 0.021 ms
>>>                 Pending flushes: 0
>>>                 Percent repaired: 0.0
>>>                 Bloom filter false positives: 19
>>>                 Bloom filter false ratio: 0.00003
>>>                 Bloom filter space used: 418.2 MiB
>>>                 Bloom filter off heap memory used: 418.19 MiB
>>>                 Index summary off heap memory used: 307.75 MiB
>>>                 Compression metadata off heap memory used: 57.99 MiB
>>>                 Compacted partition minimum bytes: 150
>>>                 Compacted partition maximum bytes: 1916
>>>                 Compacted partition mean bytes: 1003
>>>                 Average live cells per slice (last five minutes): 20.0
>>>                 Maximum live cells per slice (last five minutes): 20
>>>                 Average tombstones per slice (last five minutes): 1.0
>>>                 Maximum tombstones per slice (last five minutes): 1
>>>                 Dropped Mutations: 0 bytes
>>>
>>> --
>>>
>>> regards,
>>> Laxmikant Upadhyay
>>>
>>> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>


-- 

regards,
Laxmikant Upadhyay

Re: High IO and poor read performance on 3.11.2 cassandra cluster

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

Don't forget to run "nodetool upgradesstables -a" after you ran the ALTER
statement so that all SSTables get re-written with the new compression
settings.

Since you have a lot of tables in your cluster, be aware that lowering the
chunk length will grow the offheap memory usage of Cassandra.
You can get more informations here :
http://thelastpickle.com/blog/2018/08/08/compression_performance.html

You should also check your readahead settings as it may be set too high :
sudo blockdev --report
The default is usually 256 but Cassandra would rather favor low readahead
values to get more IOPS instead of more throughput (and readahead is
usually not that useful for Cassandra). A conservative setting is 64 (you
can go down to 8 and see how Cassandra performs then).
Do note that changing the readahead settings requires to restart Cassandra
as it is only read once by the JVM during startup.

Cheers,

On Wed, Sep 5, 2018 at 7:27 AM CPC <ac...@gmail.com> wrote:

> Could you decrease chunk_length_in_kb to 16 or 8 and repeat the test.
>
> On Wed, Sep 5, 2018, 5:51 AM wxn002@zjqunshuo.com <wx...@zjqunshuo.com>
> wrote:
>
>> How large is your row? You may meet reading wide row problem.
>>
>> -Simon
>>
>> *From:* Laxmikant Upadhyay <la...@gmail.com>
>> *Date:* 2018-09-05 01:01
>> *To:* user <us...@cassandra.apache.org>
>> *Subject:* High IO and poor read performance on 3.11.2 cassandra cluster
>>
>> We have 3 node cassandra cluster (3.11.2) in single dc.
>>
>> We have written 450 million records on the table with LCS. The write
>> latency is fine.  After write we perform read and update operations.
>>
>> When we run read+update operations on newly inserted 1 million records
>> (on top of 450 m records) then the read latency and io usage is under
>> control. However when we perform read+update on old 1 million records which
>> are part of 450 million records we observe high read latency (The
>> performance goes down by 4 times in comparison 1st case ).  We have not
>> observed major gc pauses.
>>
>> *system information:*
>> *cpu core :*  24
>> *disc type : *ssd . we are using raid with deadline schedular
>> *disk space:*
>> df -h :
>> Filesystem                      Size  Used Avail Use% Mounted on
>> /dev/sdb1                        1.9T  393G  1.5T  22% /var/lib/cassandra
>> *memory:*
>> free -g
>>               total        used        free      shared  buff/cache
>>  available
>> Mem:             62          30           0           0          32
>>     31
>> Swap:             8           0           8
>>
>> ==========================================
>>
>> *schema*
>>
>> desc table ks.xyz;
>>
>> CREATE TABLE ks.xyz (
>>     key text,
>>     column1 text,
>>     value text,
>>     PRIMARY KEY (key, column1)
>> ) WITH COMPACT STORAGE
>>     AND CLUSTERING ORDER BY (column1 ASC)
>>     AND bloom_filter_fp_chance = 0.1
>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>     AND comment = ''
>>     AND compaction = {'class':
>> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
>>     AND compression = {'chunk_length_in_kb': '64', 'class':
>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>     AND crc_check_chance = 1.0
>>     AND dclocal_read_repair_chance = 0.0
>>     AND default_time_to_live = 0
>>     AND gc_grace_seconds = 864000
>>     AND max_index_interval = 2048
>>     AND memtable_flush_period_in_ms = 0
>>     AND min_index_interval = 128
>>     AND read_repair_chance = 0.0
>>     AND speculative_retry = '99PERCENTILE';
>>
>> ==============================================================================
>> Below is some system stats snippet when read operations was running:
>>
>> *iotop -o * : Observation : the disk read goes up to 5.5 G/s
>>
>> Total DISK READ :       *3.86 G/s* | Total DISK WRITE :    1252.88 K/s
>> Actual DISK READ:      * 3.92 G/s* | Actual DISK WRITE:       0.00 B/s
>>   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
>> 10715 be/4 cassandr  375.89 M/s   99.79 K/s  0.00 % 29.15 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10714 be/4 cassandr  358.56 M/s  107.18 K/s  0.00 % 27.06 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10712 be/4 cassandr  351.86 M/s  147.83 K/s  0.00 % 25.02 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10718 be/4 cassandr  359.82 M/s  110.87 K/s  0.00 % 24.49 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10711 be/4 cassandr  333.03 M/s  125.66 K/s  0.00 % 23.37 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10716 be/4 cassandr  330.80 M/s  103.48 K/s  0.00 % 23.02 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10717 be/4 cassandr  319.49 M/s  118.27 K/s  0.00 % 22.11 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10713 be/4 cassandr  300.62 M/s  118.27 K/s  0.00 % 21.65 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10719 be/4 cassandr  294.98 M/s   81.31 K/s  0.00 % 21.60 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10720 be/4 cassandr  289.00 M/s   73.92 K/s  0.00 % 21.45 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10742 be/4 cassandr  240.98 M/s   81.31 K/s  0.00 % 17.68 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10743 be/4 cassandr  224.43 M/s   36.96 K/s  0.00 % 17.57 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10744 be/4 cassandr  113.29 M/s   14.78 K/s  0.00 % 10.22 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>> 10745 be/4 cassandr   61.63 M/s   33.26 K/s  0.00 %  4.20 % java
>> -Dorg.xerial.snappy.tempdir=/var/tmp
>> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>>
>>
>> ==================================================================================================
>> *iostats -x 5*
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>                      2.35    0.00    7.09       *11.26 *   0.00   79.30
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdb             205.60     0.00 5304.60    3.00 3707651.20  2450.40
>> 1398.03    11.37    2.14    2.14    0.67   0.18  93.60
>>
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>                        2.99    0.00    7.49       *10.95*    0.00   78.56
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdb             228.00     0.00 5639.40    4.80 3947922.40  1348.00
>> 1399.41    12.65    2.24    2.25    0.46   0.17  94.00
>>
>>
>> ==================================================================================================
>> *nodetool info*
>> ID                     : ddab003d-41d1-413e-ace2-87ccd34229d5
>> Gossip active          : true
>> Thrift active          : false
>> Native Transport active: true
>> Load                   : 391.58 GiB
>> Generation No          : 1536065356
>> Uptime (seconds)       : 4714
>> Heap Memory (MB)       : 5172.52 / 13312.00
>> Off Heap Memory (MB)   : 2288.28
>> Data Center            : DC1
>> Rack                   : RAC1
>> Exceptions             : 0
>> Key Cache              : entries 1604008, size 167.04 MiB, capacity 256
>> MiB, 3615 hits, 1801574 requests, 0.002 recent hit rate, 14400 save period
>> in seconds
>> Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0
>> hits, 0 requests, NaN recent hit rate, 0 save period in seconds
>> Counter Cache          : entries 0, size 0 bytes, capacity 50 MiB, 0
>> hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
>> Chunk Cache            : entries 7680, size 480 MiB, capacity 480 MiB,
>> 1399440 misses, 2338352 requests, 0.402 recent hit rate, 2760.976
>> microseconds miss latency
>> Percent Repaired       : 1.6357850897532832E-8%
>> Token                  : (invoke with -T/--tokens to see all 256 tokens)
>>
>>
>> =============================================================================
>>
>> *nodetool tablehistograms ks xyz  :  *Observation : not a wide row
>> ks/xyz histograms
>> Percentile  SSTables     Write Latency      Read Latency    Partition
>> Size        Cell Count
>>                               (micros)          (micros)           (bytes)
>> 50%             1.00             17.08           5839.59
>> 1109                17
>> 75%             1.00             20.50           *7007.51 *
>> 1331                24
>> 95%             2.00             29.52          *12108.97 *
>> 1331                24
>> 98%             2.00             35.43          14530.76
>> 1331                24
>> 99%             2.00             42.51          17436.92
>> 1331                24
>> Min             1.00              5.72            105.78
>>  150                 2
>> Max             2.00            315.85         464228.84
>> 1916                35
>>
>> *nodetool tablehistograms ks xyz*
>> ks/xyz histograms
>> Percentile  SSTables     Write Latency      Read Latency    Partition
>> Size        Cell Count
>>                               (micros)          (micros)           (bytes)
>> 50%             1.00             17.08           5839.59
>> 1109                17
>> 75%             1.00             20.50           7007.51
>> 1331                24
>> 95%             2.00             29.52          12108.97
>> 1331                24
>> 98%             2.00             35.43          14530.76
>> 1331                24
>> 99%             2.00             42.51          17436.92
>> 1331                24
>> Min             1.00              4.77            105.78
>>  150                 2
>> Max             2.00            315.85         464228.84
>> 1916                35
>>
>>
>>
>> ======================================================================
>>
>> *nodetool tpstats*
>>
>> Pool Name                         Active   Pending      Completed
>>  Blocked  All time blocked
>> ReadStage                              3         1        1603986
>>  0                 0
>> MiscStage                              0         0              0
>>  0                 0
>> CompactionExecutor                     0         0           6670
>>  0                 0
>> MutationStage                          0         0        5044371
>>  0                 0
>> MemtableReclaimMemory                  0         0             57
>>  0                 0
>> PendingRangeCalculator                 0         0              3
>>  0                 0
>> GossipStage                            0         0          14299
>>  0                 0
>> SecondaryIndexManagement               0         0              0
>>  0                 0
>> HintsDispatcher                        0         0              0
>>  0                 0
>> RequestResponseStage                   0         1        3867575
>>  0                 0
>> Native-Transport-Requests              3         0        2319802
>>  0                 0
>> ReadRepairStage                        0         0          13281
>>  0                 0
>> CounterMutationStage                   0         0              0
>>  0                 0
>> MigrationStage                         0         0              2
>>  0                 0
>> MemtablePostFlush                      0         0             65
>>  0                 0
>> PerDiskMemtableFlushWriter_0           0         0             57
>>  0                 0
>> ValidationExecutor                     0         0              0
>>  0                 0
>> Sampler                                0         0              0
>>  0                 0
>> MemtableFlushWriter                    0         0             57
>>  0                 0
>> InternalResponseStage                  0         0              0
>>  0                 0
>> ViewMutationStage                      0         0              0
>>  0                 0
>> AntiEntropyStage                       0         0              0
>>  0                 0
>> CacheCleanupExecutor                   0         0              0
>>  0                 0
>>
>> Message type           Dropped
>> READ                         0
>> RANGE_SLICE                  0
>> _TRACE                       0
>> HINT                         0
>> MUTATION                     0
>> COUNTER_MUTATION             0
>> BATCH_STORE                  0
>> BATCH_REMOVE                 0
>> REQUEST_RESPONSE             0
>> PAGED_RANGE                  0
>> READ_REPAIR                  0
>>
>> =====================================================================
>>
>> *notetool compactionstats*
>> pending tasks: 0
>>
>>
>> ==============================================================================
>> *debug.log Snippent*
>>
>>
>> WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-09-04 19:39:51,057
>> NoSpamLogger.java:94 - Out of 29 commit log syncs over the past 263.94s
>> with average duration of 15.17ms, 2 have exceeded the configured commit
>> interval by an average of 20.32ms
>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:39:53,015
>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
>> a fresh one
>> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659
>> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='system',
>> ColumnFamily='batches') to free up room. Used total: 0.33/0.00, live:
>> 0.33/0.00, flushing: 0.00/0.00, this: 0.13/0.00
>> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659
>> ColumnFamilyStore.java:918 - Enqueuing flush of batches: 128.567MiB (13%)
>> on-heap, 0.000KiB (0%) off-heap
>> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,685
>> Memtable.java:456 - Writing Memtable-batches@1450035834(78.630MiB
>> serialized bytes, 125418 ops, 13%/0% of on/off-heap limit), flushed range =
>> (null, null]
>> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,695
>> Memtable.java:485 - Completed flushing
>> /var/lib/cassandra/data/system/batches-919a4bc57a333573b03e13fc3f68b465/mc-16-big-Data.db
>> (0.000KiB) for commitlog position
>> CommitLogPosition(segmentId=1536065319618, position=7958044)
>> DEBUG [MemtableFlushWriter:12] 2018-09-04 19:39:59,695
>> ColumnFamilyStore.java:1216 - Flushed to [] (0 sstables, 0.000KiB), biggest
>> 0.000KiB, smallest 8589934592.000GiB
>> DEBUG [ScheduledTasks:1] 2018-09-04 19:40:15,710 MonitoringTask.java:173
>> - 2 operations were slow in the last 4999 msecs:
>> <SELECT * FROM ks1.abc WHERE key = 12345 LIMIT 5000>, time 575 msec -
>> slow timeout 500 msec/cross-node
>> <SELECT * FROM ks.xyz WHERE key = key1 LIMIT 5000>, time 645 msec - slow
>> timeout 500 msec/cross-node
>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:20,475
>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
>> a fresh one
>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:46,675
>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
>> a fresh one
>> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,976
>> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='ks',
>> ColumnFamily='xyz') to free up room. Used total: 0.33/0.00, live:
>> 0.33/0.00, flushing: 0.00/0.00, this: 0.12/0.00
>> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,977
>> ColumnFamilyStore.java:918 - Enqueuing flush of xyz: 121.374MiB (12%)
>> on-heap, 0.000KiB (0%) off-heap
>>
>> *Observation :*  frequent "operations were slow in the last " (for
>> select queries) and "WARN: commit log syncs over the past"
>> ===================================================
>>
>> *notetool tablestats -H  ks.xyz <http://ks.xyz>;*
>> Total number of tables: 89
>> ----------------
>> Keyspace : ks
>>         Read Count: 1439722
>>         Read Latency: 1.8982509581710914 ms
>>         Write Count: 4222811
>>         Write Latency: 0.016324778684151386 ms
>>         Pending Flushes: 0
>>                 Table: xyz
>>                 SSTable count: 1036
>>                 SSTables in each level: [1, 10, 116/100, 909, 0, 0, 0, 0,
>> 0]
>>                 Space used (live): 187.09 GiB
>>                 Space used (total): 187.09 GiB
>>                 Space used by snapshots (total): 0 bytes
>>                 Off heap memory used (total): 783.93 MiB
>>                 SSTable Compression Ratio: 0.3238726404414842
>>                 Number of partitions (estimate): 447095605
>>                 Memtable cell count: 306194
>>                 Memtable data size: 20.59 MiB
>>                 Memtable off heap memory used: 0 bytes
>>                 Memtable switch count: 7
>>                 Local read count: 1440322
>>                 Local read latency: 6.785 ms
>>                 Local write count: 1408204
>>                 Local write latency: 0.021 ms
>>                 Pending flushes: 0
>>                 Percent repaired: 0.0
>>                 Bloom filter false positives: 19
>>                 Bloom filter false ratio: 0.00003
>>                 Bloom filter space used: 418.2 MiB
>>                 Bloom filter off heap memory used: 418.19 MiB
>>                 Index summary off heap memory used: 307.75 MiB
>>                 Compression metadata off heap memory used: 57.99 MiB
>>                 Compacted partition minimum bytes: 150
>>                 Compacted partition maximum bytes: 1916
>>                 Compacted partition mean bytes: 1003
>>                 Average live cells per slice (last five minutes): 20.0
>>                 Maximum live cells per slice (last five minutes): 20
>>                 Average tombstones per slice (last five minutes): 1.0
>>                 Maximum tombstones per slice (last five minutes): 1
>>                 Dropped Mutations: 0 bytes
>>
>> --
>>
>> regards,
>> Laxmikant Upadhyay
>>
>> --
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: High IO and poor read performance on 3.11.2 cassandra cluster

Posted by CPC <ac...@gmail.com>.

Could you decrease chunk_length_in_kb to 16 or 8 and repeat the test.

On Wed, Sep 5, 2018, 5:51 AM wxn002@zjqunshuo.com <wx...@zjqunshuo.com>
wrote:

> How large is your row? You may meet reading wide row problem.
>
> -Simon
>
> *From:* Laxmikant Upadhyay <la...@gmail.com>
> *Date:* 2018-09-05 01:01
> *To:* user <us...@cassandra.apache.org>
> *Subject:* High IO and poor read performance on 3.11.2 cassandra cluster
> We have 3 node cassandra cluster (3.11.2) in single dc.
>
> We have written 450 million records on the table with LCS. The write
> latency is fine.  After write we perform read and update operations.
>
> When we run read+update operations on newly inserted 1 million records (on
> top of 450 m records) then the read latency and io usage is under control.
> However when we perform read+update on old 1 million records which are part
> of 450 million records we observe high read latency (The performance goes
> down by 4 times in comparison 1st case ).  We have not observed major gc
> pauses.
>
> *system information:*
> *cpu core :*  24
> *disc type : *ssd . we are using raid with deadline schedular
> *disk space:*
> df -h :
> Filesystem                      Size  Used Avail Use% Mounted on
> /dev/sdb1                        1.9T  393G  1.5T  22% /var/lib/cassandra
> *memory:*
> free -g
>               total        used        free      shared  buff/cache
>  available
> Mem:             62          30           0           0          32
>   31
> Swap:             8           0           8
>
> ==========================================
>
> *schema*
>
> desc table ks.xyz;
>
> CREATE TABLE ks.xyz (
>     key text,
>     column1 text,
>     value text,
>     PRIMARY KEY (key, column1)
> ) WITH COMPACT STORAGE
>     AND CLUSTERING ORDER BY (column1 ASC)
>     AND bloom_filter_fp_chance = 0.1
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND comment = ''
>     AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
>     AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND crc_check_chance = 1.0
>     AND dclocal_read_repair_chance = 0.0
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99PERCENTILE';
>
> ==============================================================================
> Below is some system stats snippet when read operations was running:
>
> *iotop -o * : Observation : the disk read goes up to 5.5 G/s
>
> Total DISK READ :       *3.86 G/s* | Total DISK WRITE :    1252.88 K/s
> Actual DISK READ:      * 3.92 G/s* | Actual DISK WRITE:       0.00 B/s
>   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
> 10715 be/4 cassandr  375.89 M/s   99.79 K/s  0.00 % 29.15 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10714 be/4 cassandr  358.56 M/s  107.18 K/s  0.00 % 27.06 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10712 be/4 cassandr  351.86 M/s  147.83 K/s  0.00 % 25.02 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10718 be/4 cassandr  359.82 M/s  110.87 K/s  0.00 % 24.49 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10711 be/4 cassandr  333.03 M/s  125.66 K/s  0.00 % 23.37 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10716 be/4 cassandr  330.80 M/s  103.48 K/s  0.00 % 23.02 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10717 be/4 cassandr  319.49 M/s  118.27 K/s  0.00 % 22.11 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10713 be/4 cassandr  300.62 M/s  118.27 K/s  0.00 % 21.65 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10719 be/4 cassandr  294.98 M/s   81.31 K/s  0.00 % 21.60 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10720 be/4 cassandr  289.00 M/s   73.92 K/s  0.00 % 21.45 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10742 be/4 cassandr  240.98 M/s   81.31 K/s  0.00 % 17.68 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10743 be/4 cassandr  224.43 M/s   36.96 K/s  0.00 % 17.57 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10744 be/4 cassandr  113.29 M/s   14.78 K/s  0.00 % 10.22 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
> 10745 be/4 cassandr   61.63 M/s   33.26 K/s  0.00 %  4.20 % java
> -Dorg.xerial.snappy.tempdir=/var/tmp
> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
>
>
> ==================================================================================================
> *iostats -x 5*
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                      2.35    0.00    7.09       *11.26 *   0.00   79.30
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdb             205.60     0.00 5304.60    3.00 3707651.20  2450.40
> 1398.03    11.37    2.14    2.14    0.67   0.18  93.60
>
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                        2.99    0.00    7.49       *10.95*    0.00   78.56
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdb             228.00     0.00 5639.40    4.80 3947922.40  1348.00
> 1399.41    12.65    2.24    2.25    0.46   0.17  94.00
>
>
> ==================================================================================================
> *nodetool info*
> ID                     : ddab003d-41d1-413e-ace2-87ccd34229d5
> Gossip active          : true
> Thrift active          : false
> Native Transport active: true
> Load                   : 391.58 GiB
> Generation No          : 1536065356
> Uptime (seconds)       : 4714
> Heap Memory (MB)       : 5172.52 / 13312.00
> Off Heap Memory (MB)   : 2288.28
> Data Center            : DC1
> Rack                   : RAC1
> Exceptions             : 0
> Key Cache              : entries 1604008, size 167.04 MiB, capacity 256
> MiB, 3615 hits, 1801574 requests, 0.002 recent hit rate, 14400 save period
> in seconds
> Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0
> hits, 0 requests, NaN recent hit rate, 0 save period in seconds
> Counter Cache          : entries 0, size 0 bytes, capacity 50 MiB, 0 hits,
> 0 requests, NaN recent hit rate, 7200 save period in seconds
> Chunk Cache            : entries 7680, size 480 MiB, capacity 480 MiB,
> 1399440 misses, 2338352 requests, 0.402 recent hit rate, 2760.976
> microseconds miss latency
> Percent Repaired       : 1.6357850897532832E-8%
> Token                  : (invoke with -T/--tokens to see all 256 tokens)
>
>
> =============================================================================
>
> *nodetool tablehistograms ks xyz  :  *Observation : not a wide row
> ks/xyz histograms
> Percentile  SSTables     Write Latency      Read Latency    Partition
> Size        Cell Count
>                               (micros)          (micros)           (bytes)
> 50%             1.00             17.08           5839.59
> 1109                17
> 75%             1.00             20.50           *7007.51 *
> 1331                24
> 95%             2.00             29.52          *12108.97 *
> 1331                24
> 98%             2.00             35.43          14530.76
> 1331                24
> 99%             2.00             42.51          17436.92
> 1331                24
> Min             1.00              5.72            105.78
>  150                 2
> Max             2.00            315.85         464228.84
> 1916                35
>
> *nodetool tablehistograms ks xyz*
> ks/xyz histograms
> Percentile  SSTables     Write Latency      Read Latency    Partition
> Size        Cell Count
>                               (micros)          (micros)           (bytes)
> 50%             1.00             17.08           5839.59
> 1109                17
> 75%             1.00             20.50           7007.51
> 1331                24
> 95%             2.00             29.52          12108.97
> 1331                24
> 98%             2.00             35.43          14530.76
> 1331                24
> 99%             2.00             42.51          17436.92
> 1331                24
> Min             1.00              4.77            105.78
>  150                 2
> Max             2.00            315.85         464228.84
> 1916                35
>
>
>
> ======================================================================
>
> *nodetool tpstats*
>
> Pool Name                         Active   Pending      Completed
>  Blocked  All time blocked
> ReadStage                              3         1        1603986
>  0                 0
> MiscStage                              0         0              0
>  0                 0
> CompactionExecutor                     0         0           6670
>  0                 0
> MutationStage                          0         0        5044371
>  0                 0
> MemtableReclaimMemory                  0         0             57
>  0                 0
> PendingRangeCalculator                 0         0              3
>  0                 0
> GossipStage                            0         0          14299
>  0                 0
> SecondaryIndexManagement               0         0              0
>  0                 0
> HintsDispatcher                        0         0              0
>  0                 0
> RequestResponseStage                   0         1        3867575
>  0                 0
> Native-Transport-Requests              3         0        2319802
>  0                 0
> ReadRepairStage                        0         0          13281
>  0                 0
> CounterMutationStage                   0         0              0
>  0                 0
> MigrationStage                         0         0              2
>  0                 0
> MemtablePostFlush                      0         0             65
>  0                 0
> PerDiskMemtableFlushWriter_0           0         0             57
>  0                 0
> ValidationExecutor                     0         0              0
>  0                 0
> Sampler                                0         0              0
>  0                 0
> MemtableFlushWriter                    0         0             57
>  0                 0
> InternalResponseStage                  0         0              0
>  0                 0
> ViewMutationStage                      0         0              0
>  0                 0
> AntiEntropyStage                       0         0              0
>  0                 0
> CacheCleanupExecutor                   0         0              0
>  0                 0
>
> Message type           Dropped
> READ                         0
> RANGE_SLICE                  0
> _TRACE                       0
> HINT                         0
> MUTATION                     0
> COUNTER_MUTATION             0
> BATCH_STORE                  0
> BATCH_REMOVE                 0
> REQUEST_RESPONSE             0
> PAGED_RANGE                  0
> READ_REPAIR                  0
>
> =====================================================================
>
> *notetool compactionstats*
> pending tasks: 0
>
>
> ==============================================================================
> *debug.log Snippent*
>
>
> WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-09-04 19:39:51,057
> NoSpamLogger.java:94 - Out of 29 commit log syncs over the past 263.94s
> with average duration of 15.17ms, 2 have exceeded the configured commit
> interval by an average of 20.32ms
> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:39:53,015
> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
> a fresh one
> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659
> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='system',
> ColumnFamily='batches') to free up room. Used total: 0.33/0.00, live:
> 0.33/0.00, flushing: 0.00/0.00, this: 0.13/0.00
> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659 ColumnFamilyStore.java:918
> - Enqueuing flush of batches: 128.567MiB (13%) on-heap, 0.000KiB (0%)
> off-heap
> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,685
> Memtable.java:456 - Writing Memtable-batches@1450035834(78.630MiB
> serialized bytes, 125418 ops, 13%/0% of on/off-heap limit), flushed range =
> (null, null]
> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,695
> Memtable.java:485 - Completed flushing
> /var/lib/cassandra/data/system/batches-919a4bc57a333573b03e13fc3f68b465/mc-16-big-Data.db
> (0.000KiB) for commitlog position
> CommitLogPosition(segmentId=1536065319618, position=7958044)
> DEBUG [MemtableFlushWriter:12] 2018-09-04 19:39:59,695
> ColumnFamilyStore.java:1216 - Flushed to [] (0 sstables, 0.000KiB), biggest
> 0.000KiB, smallest 8589934592.000GiB
> DEBUG [ScheduledTasks:1] 2018-09-04 19:40:15,710 MonitoringTask.java:173 -
> 2 operations were slow in the last 4999 msecs:
> <SELECT * FROM ks1.abc WHERE key = 12345 LIMIT 5000>, time 575 msec - slow
> timeout 500 msec/cross-node
> <SELECT * FROM ks.xyz WHERE key = key1 LIMIT 5000>, time 645 msec - slow
> timeout 500 msec/cross-node
> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:20,475
> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
> a fresh one
> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:46,675
> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
> a fresh one
> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,976
> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='ks',
> ColumnFamily='xyz') to free up room. Used total: 0.33/0.00, live:
> 0.33/0.00, flushing: 0.00/0.00, this: 0.12/0.00
> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,977 ColumnFamilyStore.java:918
> - Enqueuing flush of xyz: 121.374MiB (12%) on-heap, 0.000KiB (0%) off-heap
>
> *Observation :*  frequent "operations were slow in the last " (for select
> queries) and "WARN: commit log syncs over the past"
> ===================================================
>
> *notetool tablestats -H  ks.xyz <http://ks.xyz>;*
> Total number of tables: 89
> ----------------
> Keyspace : ks
>         Read Count: 1439722
>         Read Latency: 1.8982509581710914 ms
>         Write Count: 4222811
>         Write Latency: 0.016324778684151386 ms
>         Pending Flushes: 0
>                 Table: xyz
>                 SSTable count: 1036
>                 SSTables in each level: [1, 10, 116/100, 909, 0, 0, 0, 0,
> 0]
>                 Space used (live): 187.09 GiB
>                 Space used (total): 187.09 GiB
>                 Space used by snapshots (total): 0 bytes
>                 Off heap memory used (total): 783.93 MiB
>                 SSTable Compression Ratio: 0.3238726404414842
>                 Number of partitions (estimate): 447095605
>                 Memtable cell count: 306194
>                 Memtable data size: 20.59 MiB
>                 Memtable off heap memory used: 0 bytes
>                 Memtable switch count: 7
>                 Local read count: 1440322
>                 Local read latency: 6.785 ms
>                 Local write count: 1408204
>                 Local write latency: 0.021 ms
>                 Pending flushes: 0
>                 Percent repaired: 0.0
>                 Bloom filter false positives: 19
>                 Bloom filter false ratio: 0.00003
>                 Bloom filter space used: 418.2 MiB
>                 Bloom filter off heap memory used: 418.19 MiB
>                 Index summary off heap memory used: 307.75 MiB
>                 Compression metadata off heap memory used: 57.99 MiB
>                 Compacted partition minimum bytes: 150
>                 Compacted partition maximum bytes: 1916
>                 Compacted partition mean bytes: 1003
>                 Average live cells per slice (last five minutes): 20.0
>                 Maximum live cells per slice (last five minutes): 20
>                 Average tombstones per slice (last five minutes): 1.0
>                 Maximum tombstones per slice (last five minutes): 1
>                 Dropped Mutations: 0 bytes
>
> --
>
> regards,
> Laxmikant Upadhyay
>
>