You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Rudi Bruchez <ru...@babaluga.com> on 2017/08/27 21:45:56 UTC

timeouts on counter tables

Hello,

On a 3 nodes cluster (nodes : 48 procs, 32 Go RAM, SSD), I've timeouts 
on counter table UPDATEs.
One node is specifically slow, generating timeouts. IO bound. iotop 
shows consistently about 300 Mb/s reads, and writes are around 100 ko/s, 
changing.
The keys seem well distributed.

The application uses a PHP driver, token aware, and sends updates 
asynchronously from 11 client machines.

I don't know what could be the cause :
- too many concurrent UPDATE in async mode ?
- a counter type problem ? We've given 1 Gb for counter cache.
- disk ? SSD with software RAID 1
- key hotspot ?

I've compiled some information below. If someone has suggestions or 
other checks or lines of thought I might pursue, that'd be great !

----------------------------------------

Cassandra version 3.11.0

*iostat* shows something like that on the slow node (software RAID 1 on 
sda and sdb)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               2,00     0,00 2160,00    0,00   169,20     0,00 
160,43   147,10   68,53   68,53    0,00   0,46 100,00
sdb               1,00     0,00 1289,00    0,00    87,35     0,00 
138,79   148,00  109,07  109,07    0,00   0,78 100,00


*nodetools status*
UN  X.X.X.X  52.15 GiB  256          66,7%
UN  X.X.X.X  54.86 GiB  256          69,3%
UN  X.X.X.X  49.18 GiB  256          64,0%

*table structure*

CREATE TABLE document_search (

     id_document bigint,

     search_type ascii,

     searchkeyword_id bigint,

     nb_click counter,

     nb_display counter,

PRIMARY KEY ((id_document, search_type), searchkeyword_id)

) WITH CLUSTERING ORDER BY(searchkeyword_id ASC)

AND bloom_filter_fp_chance = 0.01

AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}

AND comment = ''

AND compaction = {'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32', 'min_threshold': '4'}

AND compression = {'chunk_length_in_kb': '64', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}

AND crc_check_chance = 1.0

AND dclocal_read_repair_chance = 0.1

AND default_time_to_live = 0

AND gc_grace_seconds = 864000

AND max_index_interval = 2048

AND memtable_flush_period_in_ms = 0

AND min_index_interval = 128

AND read_repair_chance = 0.0

AND speculative_retry = '99PERCENTILE';

*2 examples of nodetool tpstats at 2 different times
*

1

Pool Name Active Pending Completed Blocked All time blocked

Native-Transport-Requests 128 1083 1824166 0 0

CounterMutationStage 32 338 710480 0 0

2

Pool Name Active Pending Completed Blocked All time blocked

ReadStage 32 758 418822 0 0

CounterMutationStage 0 0 98310 0 0

*tablestats*

nodetool tablestats document_search

Total number of tables: 43

----------------

Read Count: 0

Read Latency: NaN ms.

Write Count: 288636

Write Latency: 2.354803579595061 ms.

Pending Flushes: 0

SSTable count: 11

Space used (live): 19683318113

Space used (total): 19683318113

Space used by snapshots (total): 0

Off heap memory used (total): 39258415

SSTable Compression Ratio: 0.3099081738824526

Number of keys (estimate): 4397936

Memtable cell count: 169182

Memtable data size: 20761379

Memtable off heap memory used: 0

Memtable switch count: 0

Local read count: 0

Local read latency: NaN ms

Local write count: 169182

Local write latency: NaN ms

Pending flushes: 0

Percent repaired: 61.58

Bloom filter false positives: 1

Bloom filter false ratio: 0,00000

Bloom filter space used: 26271840

Bloom filter off heap memory used: 26271752

Index summary off heap memory used: 5496319

Compression metadata off heap memory used: 7490344

Compacted partition minimum bytes: 104

Compacted partition maximum bytes: 4055269

Compacted partition mean bytes: 3206

Average live cells per slice (last five minutes): NaN

Maximum live cells per slice (last five minutes): 0

Average tombstones per slice (last five minutes): NaN

Maximum tombstones per slice (last five minutes): 0

Dropped Mutations: 19804

*nodetool info*

Gossip active          : true
Thrift active          : true
Native Transport active: true
Load                   : 53.85 GiB
Generation No          : 1503674199
Uptime (seconds)       : 194310
Heap Memory (MB)       : 4663,19 / 7774,75
Off Heap Memory (MB)   : 208,24
Exceptions             : 0
Key Cache              : entries 11987913, size 1,09 GiB, capacity 2 
GiB, 129046135 hits, 144375554 requests, 0,894 recent hit rate, 14400 
save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 
hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 7579853, size 1 GiB, capacity 1 GiB, 
9479923 hits, 39619041 requests, 0,239 recent hit rate, 7200 save period 
in seconds
Chunk Cache            : entries 97792, size 5,97 GiB, capacity 5,97 
GiB, 38965356 misses, 182409581 requests, 0,786 recent hit rate, 56,113 
microseconds miss latency
Percent Repaired       : 46.78765116584098%

Re: timeouts on counter tables

Posted by Rudi Bruchez <ru...@babaluga.com>.

Le 28/08/2017 à 03:30, kurt greaves a écrit :
> If every node is a replica it sounds like you've got hardware issues. 
> Have you compared iostat to the "normal" nodes? I assume there is 
> nothing different in the logs on this one node?
> Also sanity check, you are using DCAwareRoundRobinPolicy?
> 

Thanks for the answer, I had to concentrate on other things for a few 
days, I'm back to that problem.

The PHP Driver call is :

$cassandrabuilder->withDatacenterAwareRoundRobinLoadBalancingPolicy("mycluster", 
0, false)->withTokenAwareRouting(true)->withSchemaMetadata(true);

After that, the call is done like this :

$result = $cassandra->execute(new Cassandra\SimpleStatement($query));
$cassandrasession->executeAsync($this->queryPrepared, array('arguments' 
=> $values));

Could the async call put too much pressure on the server ? Calls from 11 
client machines.

Thanks !

Re: timeouts on counter tables

Posted by Rudi Bruchez <ru...@babaluga.com>.

I'm going to try different options. Do any of you have some experience 
with tweaking one of those conf parameters to improve read throughput, 
especially in case of counter tables ?


1/ using SSD :
trickle_fsync: true
trickle_fsync_interval_in_kb: 1024

2/ concurrent_compactors to the number of cores.

3/ concurrent_counter_writes

4/ Row Cache vs Chunk Cache

5/ change the compaction method to leveled, specifically when using 
counter columns ??

thanks !

>> On 3 September 2017 at 20:25, Rudi Bruchez <rudi@babaluga.com 
>> <ma...@babaluga.com>> wrote:
>>
>>     Le 30/08/2017 à 05:33, Erick Ramirez a écrit :
>>>     Is it possible at all that you may have a data hotspot if it's
>>>     not hardware-related?
>>>
>>>
>>     It does not seem so, The partition key seems well distributed and
>>     the queries update different keys.
>>
>>     We have dropped counter_mutation messages in the log :
>>
>>     COUNTER_MUTATION messages were dropped in last 5000 ms: 0
>>     internal and 2 cross node. Mean internal dropped latency: 0 ms
>>     and Mean cross-node dropped latency: 5960 ms
>>
>>     Pool Name                    Active   Pending Completed  
>>     Blocked  All Time Blocked
>>     ReadStage    32 503        7481787         0                 0
>>     CounterMutationStage     32       221 5722101        
>>     0                 0
>>
>>     The load could be too high ?
>>
>>     Thanks
>>
>>
>

Re: timeouts on counter tables

Posted by Rudi Bruchez <ru...@babaluga.com>.

It can happen on any of the nodes. We can have a large number of pending 
on ReadStage and CounterMutationStage. We'll try to increase 
concurrent_counter_writes to see how it changes things

> Likely. I believe counter mutations are a tad more expensive than a 
> normal mutation. If you're doing a lot of counter updates that 
> probably doesn't help. Regardless, high amounts of pending 
> reads/mutations is generally not good and indicates the node being 
> overloaded. Are you just seeing this on the 1 node with IO issues or 
> do other nodes have this problem as well?
>
> On 3 September 2017 at 20:25, Rudi Bruchez <rudi@babaluga.com 
> <ma...@babaluga.com>> wrote:
>
>     Le 30/08/2017 à 05:33, Erick Ramirez a écrit :
>>     Is it possible at all that you may have a data hotspot if it's
>>     not hardware-related?
>>
>>
>     It does not seem so, The partition key seems well distributed and
>     the queries update different keys.
>
>     We have dropped counter_mutation messages in the log :
>
>     COUNTER_MUTATION messages were dropped in last 5000 ms: 0 internal
>     and 2 cross node. Mean internal dropped latency: 0 ms and Mean
>     cross-node dropped latency: 5960 ms
>
>     Pool Name                    Active   Pending Completed   Blocked 
>     All Time Blocked
>     ReadStage    32       503 7481787         0                 0
>     CounterMutationStage     32       221 5722101        
>     0                 0
>
>     The load could be too high ?
>
>     Thanks
>
>

Re: timeouts on counter tables

Posted by kurt greaves <ku...@instaclustr.com>.

Likely. I believe counter mutations are a tad more expensive than a normal
mutation. If you're doing a lot of counter updates that probably doesn't
help. Regardless, high amounts of pending reads/mutations is generally not
good and indicates the node being overloaded. Are you just seeing this on
the 1 node with IO issues or do other nodes have this problem as well?

On 3 September 2017 at 20:25, Rudi Bruchez <ru...@babaluga.com> wrote:

> Le 30/08/2017 à 05:33, Erick Ramirez a écrit :
>
> Is it possible at all that you may have a data hotspot if it's not
> hardware-related?
>
>
> It does not seem so, The partition key seems well distributed and the
> queries update different keys.
>
> We have dropped counter_mutation messages in the log :
>
> COUNTER_MUTATION messages were dropped in last 5000 ms: 0 internal and 2
> cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped
> latency: 5960 ms
>
> Pool Name                    Active   Pending      Completed   Blocked
> All Time Blocked
> ReadStage                        32       503        7481787
> 0                 0
> CounterMutationStage     32       221        5722101
> 0                 0
>
> The load could be too high ?
>
> Thanks
>

Re: timeouts on counter tables

Posted by Rudi Bruchez <ru...@babaluga.com>.

Le 30/08/2017 à 05:33, Erick Ramirez a écrit :
> Is it possible at all that you may have a data hotspot if it's not 
> hardware-related?
>
>
It does not seem so, The partition key seems well distributed and the 
queries update different keys.

We have dropped counter_mutation messages in the log :

COUNTER_MUTATION messages were dropped in last 5000 ms: 0 internal and 2 
cross node. Mean internal dropped latency: 0 ms and Mean cross-node 
dropped latency: 5960 ms

Pool Name                    Active   Pending      Completed Blocked  
All Time Blocked
ReadStage                        32       503        7481787 
0                 0
CounterMutationStage     32       221        5722101 0                 0

The load could be too high ?

Thanks

Re: timeouts on counter tables

Posted by Erick Ramirez <fl...@gmail.com>.

Is it possible at all that you may have a data hotspot if it's not
hardware-related?

On Mon, Aug 28, 2017 at 11:30 AM, kurt greaves <ku...@instaclustr.com> wrote:

> If every node is a replica it sounds like you've got hardware issues. Have
> you compared iostat to the "normal" nodes? I assume there is nothing
> different in the logs on this one node?
> Also sanity check, you are using DCAwareRoundRobinPolicy?
> 
>

Re: timeouts on counter tables

Posted by kurt greaves <ku...@instaclustr.com>.

If every node is a replica it sounds like you've got hardware issues. Have
you compared iostat to the "normal" nodes? I assume there is nothing
different in the logs on this one node?
Also sanity check, you are using DCAwareRoundRobinPolicy?

Re: timeouts on counter tables

Posted by Rudi Bruchez <ru...@babaluga.com>.

Le 28/08/2017 à 00:11, kurt greaves a écrit :
> What is your RF?
>
> Also, as a side note RAID 1 shouldn't be necessary if you have >1 RF 
> and would give you worse performance

2 + 1 on a backup single node. Consistency one. You're right about RAID 
1, if the disk perf is the problem, that might be a way to improve on 
that part. Still it's strange that only one node suffers from IO problems.

Re: timeouts on counter tables

Posted by kurt greaves <ku...@instaclustr.com>.

What is your RF?

Also, as a side note RAID 1 shouldn't be necessary if you have >1 RF and
would give you worse performance