You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by onmstester onmstester <on...@zoho.com> on 2018/06/17 07:24:07 UTC

Write performance degradation

Hi, 



I was doing 500K inserts + 100K counter update in seconds on my cluster of 12 nodes (20 core/128GB ram/4 * 600 HDD 10K) using batch statements

with no problem.

I saw a lot of warning show that most of batches not concerning a single node, so they should not be in a batch, on the other hand input load of my application

increased by 50%, so i switched to non-batch async inserts and increased number of client threads so the load increased by 50%.

The system worked for 2 days with no problem with load of 750K inserts + 150K counter updates per seconds but suddendly a lot of timeout on insert generated in log files

Decreasing input load to previous load, even less than that did not help.

When i restart my client (after some hours that its been started log timeouts and erros) it works with no problem for 20 minutes but again starts logging timeout errors.
CPU load of nodes in cluster is less than 25%.

How can i solve this problem? I'm saving all jmx metrics of cassande\ra by monitoring system, What should i check?



Sent using Zoho Mail

Re: Write performance degradation

Posted by onmstester onmstester <on...@zoho.com>.

I think that could have pinpoint the problem, i have a table with a partition key related to timestamp so for one hour so many data would be inserted at one single node, this table creates a very big partitions (300MB-600MB), whatever node the current partition of that table would be inserted to, reports too many DroppedMutations (sometimes 6M in 5 minutes) and when the load increases it would slow down a single node in my cluster.

So i think that i should change my data model and use sharding in partition key of problematic table.


Sent using Zoho Mail






---- On Mon, 18 Jun 2018 16:24:48 +0430 DuyHai Doan &lt;doanduyhai@gmail.com&gt; wrote ----




Maybe the disk I/O cannot keep up with the high mutation rate ? 



Check the number of pending compactions




On Sun, Jun 17, 2018 at 9:24 AM, onmstester onmstester &lt;onmstester@zoho.com&gt; wrote:








Hi, 



I was doing 500K inserts + 100K counter update in seconds on my cluster of 12 nodes (20 core/128GB ram/4 * 600 HDD 10K) using batch statements

with no problem.

I saw a lot of warning show that most of batches not concerning a single node, so they should not be in a batch, on the other hand input load of my application

increased by 50%, so i switched to non-batch async inserts and increased number of client threads so the load increased by 50%.

The system worked for 2 days with no problem with load of 750K inserts + 150K counter updates per seconds but suddendly a lot of timeout on insert generated in log files

Decreasing input load to previous load, even less than that did not help.

When i restart my client (after some hours that its been started log timeouts and erros) it works with no problem for 20 minutes but again starts logging timeout errors.

CPU load of nodes in cluster is less than 25%.

How can i solve this problem? I'm saving all jmx metrics of cassande\ra by monitoring system, What should i check?



Sent using Zoho Mail

Re: Write performance degradation

Posted by DuyHai Doan <do...@gmail.com>.

Maybe the disk I/O cannot keep up with the high mutation rate ?

Check the number of pending compactions

On Sun, Jun 17, 2018 at 9:24 AM, onmstester onmstester <on...@zoho.com>
wrote:

> Hi,
>
> I was doing 500K inserts + 100K counter update in seconds on my cluster of
> 12 nodes (20 core/128GB ram/4 * 600 HDD 10K) using batch statements
> with no problem.
> I saw a lot of warning show that most of batches not concerning a single
> node, so they should not be in a batch, on the other hand input load of my
> application
> increased by 50%, so i switched to non-batch async inserts and increased
> number of client threads so the load increased by 50%.
> The system worked for 2 days with no problem with load of 750K inserts +
> 150K counter updates per seconds but suddendly a lot of timeout on insert
> generated in log files
> Decreasing input load to previous load, even less than that did not help.
> When i restart my client (after some hours that its been started log
> timeouts and erros) it works with no problem for 20 minutes but again
> starts logging timeout errors.
> CPU load of nodes in cluster is less than 25%.
> How can i solve this problem? I'm saving all jmx metrics of cassande\ra by
> monitoring system, What should i check?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>