You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Rajsekhar Mallick <ra...@gmail.com> on 2019/02/06 05:29:51 UTC

ReadStage filling up and leading to Read Timeouts

Hello Team,

Cluster Details:
1. Number of Nodes in cluster : 7
2. Number of CPU cores: 48
3. Swap is enabled on all nodes
4. Memory available on all nodes : 120GB
5. Disk space available : 745GB
6. Cassandra version: 2.1
7. Active tables are using size-tiered compaction strategy
8. Read Throughput: 6000 reads/s on each node (42000 reads/s cluster wide)
9. Read latency 99%: 300 ms
10. Write Throughput : 1800 writes/s
11. Write Latency 99%: 50 ms
12. Known issues in the cluster ( Large Partitions(upto 560MB, observed when they get compacted), tombstones)
13. To reduce the impact of tombstones, gc_grace_seconds set to 0 for the active tables
14. Heap size: 48 GB G1GC
15. Read timeout : 5000ms , Write timeouts: 2000ms
16. Number of concurrent reads: 64
17. Number of connections from clients on port 9042 stays almost constant (close to 1800)
18. Cassandra thread count also stays almost constant (close to 2000)

Problem Statement:
1. ReadStage often gets full (reaches max size 64) on 2 to 3 nodes and pending reads go upto 4000.
2. When the above happens Native-Transport-Stage gets full on neighbouring nodes(1024 max) and pending threads are also observed.
3. During this time, CPU load average rises, user % for Cassandra process reaches 90%
4. We see Read getting dropped, org.apache.cassandra.transport package errors of reads getting timeout is seen.
5. Read latency 99% reached 5seconds, client starts seeing impact.
6. No IOwait observed on any of the virtual cores, sjk ttop command shows max us% being used by “Worker Threads”

I have trying hard to zero upon what is the exact issue.
What I make out of these above observations is…there might be some slow queries, which get stuck on few nodes.
Then there is a cascading effect wherein other queries get lined up.
Unable to figure out any such slow queries up till now.
As I mentioned, there are large partitions. We using size-tiered compaction strategy, hence a large partition might be spread across multiple stables.
Can this fact lead to slow queries. I also tried to understand, that data in stables is stored in serialized format and when read into memory, it is unseralized. This would lead to a large object in memory which then needs to be transferred across the wire to the client.

Not sure what might be the reason. Kindly help on helping me understand what might be the impact on read performance when we have large partitions.
Kindly Suggest ways to catch these slow queries.
Also do add if you see any other issues from the above details
We are now considering to expand our cluster. Is the cluster under-sized. Will addition of nodes help resolve the issue.

Thanks,
Rajsekhar Mallick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: ReadStage filling up and leading to Read Timeouts

Posted by Rajsekhar Mallick <ra...@gmail.com>.

Thank you Jeff for the link.
Please do comment on the G1GC settings,if they are ok for the cluster.
Also comment on reducing the concurrent reads to 32 on all nodes in the
cluster.
As has earlier lead to reads getting dropped.
Will adding nodes to the cluster be helpful.

Thanks,
Rajsekhar Mallick



On Wed, 6 Feb, 2019, 1:12 PM Jeff Jirsa <jjirsa@gmail.com wrote:

>
> https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/
>
>
> --
> Jeff Jirsa
>
>
> On Feb 5, 2019, at 11:33 PM, Rajsekhar Mallick <ra...@gmail.com>
> wrote:
>
> Hello Jeff,
>
> Thanks for the reply.
> We do have GC logs enabled.
> We do observe gc pauses upto 2 seconds but quite often we see this issue
> even when the gc log reads good and clear.
>
> JVM Flags related to G1GC:
>
> Xms: 48G
> Xmx:48G
> Maxgcpausemillis=200
> Parallels gc threads=32
> Concurrent gc threads= 10
> Initiatingheapoccupancypercent=50
>
> You talked about dropping application page size. Please do elaborate on
> how to change the same.
> Reducing the concurrent reads to 32 does help as we have tried the
> same...the cpu load average remains under threshold...but read timeout
> keeps on happening.
>
> We will definitely try increasing the key cache sizes after verifying the
> current max heap usage in the cluster.
>
> Thanks,
> Rajsekhar Mallick
>
> On Wed, 6 Feb, 2019, 11:17 AM Jeff Jirsa <jjirsa@gmail.com wrote:
>
>> What you're potentially seeing is the GC impact of reading a large
>> partition - do you have GC logs or StatusLogger output indicating you're
>> pausing? What are you actual JVM flags you're using?
>>
>> Given your heap size, the easiest mitigation may be significantly
>> increasing your key cache size (up to a gigabyte or two, if needed).
>>
>> Yes, when you read data, it's materialized in memory (iterators from each
>> sstable are merged and sent to the client), so reading lots of rows from a
>> wide partition can cause GC pressure just from materializing the responses.
>> Dropping your application's paging size could help if this is the problem.
>>
>> You may be able to drop concurrent reads from 64 to something lower
>> (potentially 48 or 32, given your core count) to mitigate GC impact from
>> lots of objects when you have a lot of concurrent reads, or consider
>> upgrading to 3.11.4 (when it's out) to take advantage of CASSANDRA-11206
>> (which made reading wide partitions less expensive). STCS especially wont
>> help here - a large partition may be larger than you think, if it's
>> spanning a lot of sstables.
>>
>>
>>
>>
>> On Tue, Feb 5, 2019 at 9:30 PM Rajsekhar Mallick <ra...@gmail.com>
>> wrote:
>>
>>> Hello Team,
>>>
>>> Cluster Details:
>>> 1. Number of Nodes in cluster : 7
>>> 2. Number of CPU cores: 48
>>> 3. Swap is enabled on all nodes
>>> 4. Memory available on all nodes : 120GB
>>> 5. Disk space available : 745GB
>>> 6. Cassandra version: 2.1
>>> 7. Active tables are using size-tiered compaction strategy
>>> 8. Read Throughput: 6000 reads/s on each node (42000 reads/s cluster
>>> wide)
>>> 9. Read latency 99%: 300 ms
>>> 10. Write Throughput : 1800 writes/s
>>> 11. Write Latency 99%: 50 ms
>>> 12. Known issues in the cluster ( Large Partitions(upto 560MB, observed
>>> when they get compacted), tombstones)
>>> 13. To reduce the impact of tombstones, gc_grace_seconds set to 0 for
>>> the active tables
>>> 14. Heap size: 48 GB G1GC
>>> 15. Read timeout : 5000ms , Write timeouts: 2000ms
>>> 16. Number of concurrent reads: 64
>>> 17. Number of connections from clients on port 9042 stays almost
>>> constant (close to 1800)
>>> 18. Cassandra thread count also stays almost constant (close to 2000)
>>>
>>> Problem Statement:
>>> 1. ReadStage often gets full (reaches max size 64) on 2 to 3 nodes and
>>> pending reads go upto 4000.
>>> 2. When the above happens Native-Transport-Stage gets full on
>>> neighbouring nodes(1024 max) and pending threads are also observed.
>>> 3. During this time, CPU load average rises, user % for Cassandra
>>> process reaches 90%
>>> 4. We see Read getting dropped, org.apache.cassandra.transport package
>>> errors of reads getting timeout is seen.
>>> 5. Read latency 99% reached 5seconds, client starts seeing impact.
>>> 6. No IOwait observed on any of the virtual cores, sjk ttop command
>>> shows max us% being used by “Worker Threads”
>>>
>>> I have trying hard to zero upon what is the exact issue.
>>> What I make out of these above observations is…there might be some slow
>>> queries, which get stuck on few nodes.
>>> Then there is a cascading effect wherein other queries get lined up.
>>> Unable to figure out any such slow queries up till now.
>>> As I mentioned, there are large partitions. We using size-tiered
>>> compaction strategy, hence a large partition might be spread across
>>> multiple stables.
>>> Can this fact lead to slow queries. I also tried to understand, that
>>> data in stables is stored in serialized format and when read into memory,
>>> it is unseralized. This would lead to a large object in memory which then
>>> needs to be transferred across the wire to the client.
>>>
>>> Not sure what might be the reason. Kindly help on helping me understand
>>> what might be the impact on read performance when we have large partitions.
>>> Kindly Suggest ways to catch these slow queries.
>>> Also do add if you see any other issues from the above details
>>> We are now considering to expand our cluster. Is the cluster
>>> under-sized. Will addition of nodes help resolve the issue.
>>>
>>> Thanks,
>>> Rajsekhar Mallick
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>
>>>

Re: ReadStage filling up and leading to Read Timeouts

Posted by Jeff Jirsa <jj...@gmail.com>.

https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/


-- 
Jeff Jirsa


> On Feb 5, 2019, at 11:33 PM, Rajsekhar Mallick <ra...@gmail.com> wrote:
> 
> Hello Jeff,
> 
> Thanks for the reply.
> We do have GC logs enabled.
> We do observe gc pauses upto 2 seconds but quite often we see this issue even when the gc log reads good and clear.
> 
> JVM Flags related to G1GC:
> 
> Xms: 48G
> Xmx:48G
> Maxgcpausemillis=200
> Parallels gc threads=32
> Concurrent gc threads= 10
> Initiatingheapoccupancypercent=50
> 
> You talked about dropping application page size. Please do elaborate on how to change the same.
> Reducing the concurrent reads to 32 does help as we have tried the same...the cpu load average remains under threshold...but read timeout keeps on happening.
> 
> We will definitely try increasing the key cache sizes after verifying the current max heap usage in the cluster.
> 
> Thanks,
> Rajsekhar Mallick
> 
>> On Wed, 6 Feb, 2019, 11:17 AM Jeff Jirsa <jjirsa@gmail.com wrote:
>> What you're potentially seeing is the GC impact of reading a large partition - do you have GC logs or StatusLogger output indicating you're pausing? What are you actual JVM flags you're using? 
>> 
>> Given your heap size, the easiest mitigation may be significantly increasing your key cache size (up to a gigabyte or two, if needed).
>> 
>> Yes, when you read data, it's materialized in memory (iterators from each sstable are merged and sent to the client), so reading lots of rows from a wide partition can cause GC pressure just from materializing the responses. Dropping your application's paging size could help if this is the problem. 
>> 
>> You may be able to drop concurrent reads from 64 to something lower (potentially 48 or 32, given your core count) to mitigate GC impact from lots of objects when you have a lot of concurrent reads, or consider upgrading to 3.11.4 (when it's out) to take advantage of CASSANDRA-11206 (which made reading wide partitions less expensive). STCS especially wont help here - a large partition may be larger than you think, if it's spanning a lot of sstables. 
>> 
>> 
>> 
>> 
>>> On Tue, Feb 5, 2019 at 9:30 PM Rajsekhar Mallick <ra...@gmail.com> wrote:
>>> Hello Team,
>>> 
>>> Cluster Details:
>>> 1. Number of Nodes in cluster : 7
>>> 2. Number of CPU cores: 48
>>> 3. Swap is enabled on all nodes
>>> 4. Memory available on all nodes : 120GB 
>>> 5. Disk space available : 745GB
>>> 6. Cassandra version: 2.1
>>> 7. Active tables are using size-tiered compaction strategy
>>> 8. Read Throughput: 6000 reads/s on each node (42000 reads/s cluster wide)
>>> 9. Read latency 99%: 300 ms
>>> 10. Write Throughput : 1800 writes/s
>>> 11. Write Latency 99%: 50 ms
>>> 12. Known issues in the cluster ( Large Partitions(upto 560MB, observed when they get compacted), tombstones)
>>> 13. To reduce the impact of tombstones, gc_grace_seconds set to 0 for the active tables
>>> 14. Heap size: 48 GB G1GC
>>> 15. Read timeout : 5000ms , Write timeouts: 2000ms
>>> 16. Number of concurrent reads: 64
>>> 17. Number of connections from clients on port 9042 stays almost constant (close to 1800)
>>> 18. Cassandra thread count also stays almost constant (close to 2000)
>>> 
>>> Problem Statement:
>>> 1. ReadStage often gets full (reaches max size 64) on 2 to 3 nodes and pending reads go upto 4000.
>>> 2. When the above happens Native-Transport-Stage gets full on neighbouring nodes(1024 max) and pending threads are also observed.
>>> 3. During this time, CPU load average rises, user % for Cassandra process reaches 90%
>>> 4. We see Read getting dropped, org.apache.cassandra.transport package errors of reads getting timeout is seen.
>>> 5. Read latency 99% reached 5seconds, client starts seeing impact.
>>> 6. No IOwait observed on any of the virtual cores, sjk ttop command shows max us% being used by “Worker Threads”
>>> 
>>> I have trying hard to zero upon what is the exact issue.
>>> What I make out of these above observations is…there might be some slow queries, which get stuck on few nodes.
>>> Then there is a cascading effect wherein other queries get lined up.
>>> Unable to figure out any such slow queries up till now.
>>> As I mentioned, there are large partitions. We using size-tiered compaction strategy, hence a large partition might be spread across multiple stables.
>>> Can this fact lead to slow queries. I also tried to understand, that data in stables is stored in serialized format and when read into memory, it is unseralized. This would lead to a large object in memory which then needs to be transferred across the wire to the client.
>>> 
>>> Not sure what might be the reason. Kindly help on helping me understand what might be the impact on read performance when we have large partitions.
>>> Kindly Suggest ways to catch these slow queries.
>>> Also do add if you see any other issues from the above details
>>> We are now considering to expand our cluster. Is the cluster under-sized. Will addition of nodes help resolve the issue.
>>> 
>>> Thanks,
>>> Rajsekhar Mallick
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>

Re: ReadStage filling up and leading to Read Timeouts

Posted by Rajsekhar Mallick <ra...@gmail.com>.

Hello Jeff,

Thanks for the reply.
We do have GC logs enabled.
We do observe gc pauses upto 2 seconds but quite often we see this issue
even when the gc log reads good and clear.

JVM Flags related to G1GC:

Xms: 48G
Xmx:48G
Maxgcpausemillis=200
Parallels gc threads=32
Concurrent gc threads= 10
Initiatingheapoccupancypercent=50

You talked about dropping application page size. Please do elaborate on how
to change the same.
Reducing the concurrent reads to 32 does help as we have tried the
same...the cpu load average remains under threshold...but read timeout
keeps on happening.

We will definitely try increasing the key cache sizes after verifying the
current max heap usage in the cluster.

Thanks,
Rajsekhar Mallick

On Wed, 6 Feb, 2019, 11:17 AM Jeff Jirsa <jjirsa@gmail.com wrote:

> What you're potentially seeing is the GC impact of reading a large
> partition - do you have GC logs or StatusLogger output indicating you're
> pausing? What are you actual JVM flags you're using?
>
> Given your heap size, the easiest mitigation may be significantly
> increasing your key cache size (up to a gigabyte or two, if needed).
>
> Yes, when you read data, it's materialized in memory (iterators from each
> sstable are merged and sent to the client), so reading lots of rows from a
> wide partition can cause GC pressure just from materializing the responses.
> Dropping your application's paging size could help if this is the problem.
>
> You may be able to drop concurrent reads from 64 to something lower
> (potentially 48 or 32, given your core count) to mitigate GC impact from
> lots of objects when you have a lot of concurrent reads, or consider
> upgrading to 3.11.4 (when it's out) to take advantage of CASSANDRA-11206
> (which made reading wide partitions less expensive). STCS especially wont
> help here - a large partition may be larger than you think, if it's
> spanning a lot of sstables.
>
>
>
>
> On Tue, Feb 5, 2019 at 9:30 PM Rajsekhar Mallick <ra...@gmail.com>
> wrote:
>
>> Hello Team,
>>
>> Cluster Details:
>> 1. Number of Nodes in cluster : 7
>> 2. Number of CPU cores: 48
>> 3. Swap is enabled on all nodes
>> 4. Memory available on all nodes : 120GB
>> 5. Disk space available : 745GB
>> 6. Cassandra version: 2.1
>> 7. Active tables are using size-tiered compaction strategy
>> 8. Read Throughput: 6000 reads/s on each node (42000 reads/s cluster wide)
>> 9. Read latency 99%: 300 ms
>> 10. Write Throughput : 1800 writes/s
>> 11. Write Latency 99%: 50 ms
>> 12. Known issues in the cluster ( Large Partitions(upto 560MB, observed
>> when they get compacted), tombstones)
>> 13. To reduce the impact of tombstones, gc_grace_seconds set to 0 for the
>> active tables
>> 14. Heap size: 48 GB G1GC
>> 15. Read timeout : 5000ms , Write timeouts: 2000ms
>> 16. Number of concurrent reads: 64
>> 17. Number of connections from clients on port 9042 stays almost constant
>> (close to 1800)
>> 18. Cassandra thread count also stays almost constant (close to 2000)
>>
>> Problem Statement:
>> 1. ReadStage often gets full (reaches max size 64) on 2 to 3 nodes and
>> pending reads go upto 4000.
>> 2. When the above happens Native-Transport-Stage gets full on
>> neighbouring nodes(1024 max) and pending threads are also observed.
>> 3. During this time, CPU load average rises, user % for Cassandra process
>> reaches 90%
>> 4. We see Read getting dropped, org.apache.cassandra.transport package
>> errors of reads getting timeout is seen.
>> 5. Read latency 99% reached 5seconds, client starts seeing impact.
>> 6. No IOwait observed on any of the virtual cores, sjk ttop command shows
>> max us% being used by “Worker Threads”
>>
>> I have trying hard to zero upon what is the exact issue.
>> What I make out of these above observations is…there might be some slow
>> queries, which get stuck on few nodes.
>> Then there is a cascading effect wherein other queries get lined up.
>> Unable to figure out any such slow queries up till now.
>> As I mentioned, there are large partitions. We using size-tiered
>> compaction strategy, hence a large partition might be spread across
>> multiple stables.
>> Can this fact lead to slow queries. I also tried to understand, that data
>> in stables is stored in serialized format and when read into memory, it is
>> unseralized. This would lead to a large object in memory which then needs
>> to be transferred across the wire to the client.
>>
>> Not sure what might be the reason. Kindly help on helping me understand
>> what might be the impact on read performance when we have large partitions.
>> Kindly Suggest ways to catch these slow queries.
>> Also do add if you see any other issues from the above details
>> We are now considering to expand our cluster. Is the cluster under-sized.
>> Will addition of nodes help resolve the issue.
>>
>> Thanks,
>> Rajsekhar Mallick
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: user-help@cassandra.apache.org
>>
>>

Re: ReadStage filling up and leading to Read Timeouts

Posted by Jeff Jirsa <jj...@gmail.com>.

What you're potentially seeing is the GC impact of reading a large
partition - do you have GC logs or StatusLogger output indicating you're
pausing? What are you actual JVM flags you're using?

Given your heap size, the easiest mitigation may be significantly
increasing your key cache size (up to a gigabyte or two, if needed).

Yes, when you read data, it's materialized in memory (iterators from each
sstable are merged and sent to the client), so reading lots of rows from a
wide partition can cause GC pressure just from materializing the responses.
Dropping your application's paging size could help if this is the problem.

You may be able to drop concurrent reads from 64 to something lower
(potentially 48 or 32, given your core count) to mitigate GC impact from
lots of objects when you have a lot of concurrent reads, or consider
upgrading to 3.11.4 (when it's out) to take advantage of CASSANDRA-11206
(which made reading wide partitions less expensive). STCS especially wont
help here - a large partition may be larger than you think, if it's
spanning a lot of sstables.




On Tue, Feb 5, 2019 at 9:30 PM Rajsekhar Mallick <ra...@gmail.com>
wrote:

> Hello Team,
>
> Cluster Details:
> 1. Number of Nodes in cluster : 7
> 2. Number of CPU cores: 48
> 3. Swap is enabled on all nodes
> 4. Memory available on all nodes : 120GB
> 5. Disk space available : 745GB
> 6. Cassandra version: 2.1
> 7. Active tables are using size-tiered compaction strategy
> 8. Read Throughput: 6000 reads/s on each node (42000 reads/s cluster wide)
> 9. Read latency 99%: 300 ms
> 10. Write Throughput : 1800 writes/s
> 11. Write Latency 99%: 50 ms
> 12. Known issues in the cluster ( Large Partitions(upto 560MB, observed
> when they get compacted), tombstones)
> 13. To reduce the impact of tombstones, gc_grace_seconds set to 0 for the
> active tables
> 14. Heap size: 48 GB G1GC
> 15. Read timeout : 5000ms , Write timeouts: 2000ms
> 16. Number of concurrent reads: 64
> 17. Number of connections from clients on port 9042 stays almost constant
> (close to 1800)
> 18. Cassandra thread count also stays almost constant (close to 2000)
>
> Problem Statement:
> 1. ReadStage often gets full (reaches max size 64) on 2 to 3 nodes and
> pending reads go upto 4000.
> 2. When the above happens Native-Transport-Stage gets full on neighbouring
> nodes(1024 max) and pending threads are also observed.
> 3. During this time, CPU load average rises, user % for Cassandra process
> reaches 90%
> 4. We see Read getting dropped, org.apache.cassandra.transport package
> errors of reads getting timeout is seen.
> 5. Read latency 99% reached 5seconds, client starts seeing impact.
> 6. No IOwait observed on any of the virtual cores, sjk ttop command shows
> max us% being used by “Worker Threads”
>
> I have trying hard to zero upon what is the exact issue.
> What I make out of these above observations is…there might be some slow
> queries, which get stuck on few nodes.
> Then there is a cascading effect wherein other queries get lined up.
> Unable to figure out any such slow queries up till now.
> As I mentioned, there are large partitions. We using size-tiered
> compaction strategy, hence a large partition might be spread across
> multiple stables.
> Can this fact lead to slow queries. I also tried to understand, that data
> in stables is stored in serialized format and when read into memory, it is
> unseralized. This would lead to a large object in memory which then needs
> to be transferred across the wire to the client.
>
> Not sure what might be the reason. Kindly help on helping me understand
> what might be the impact on read performance when we have large partitions.
> Kindly Suggest ways to catch these slow queries.
> Also do add if you see any other issues from the above details
> We are now considering to expand our cluster. Is the cluster under-sized.
> Will addition of nodes help resolve the issue.
>
> Thanks,
> Rajsekhar Mallick
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>