You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Karthick V <ka...@zohocorp.com> on 2017/07/03 14:51:35 UTC

Node failure Due To Very high GC pause time

Hi,



      Recently In my test Cluster I faced a outrageous GC activity which made the Node unreachable inside the cluster itself.



Scenario : 

      In a Partition of 5Million rows we read first 500 (by giving the starting range) and delete the same 500 again.The same has been done recursively by changing the Start range alone. Initially I didn't see any difference in the query performance ( upto 50,000) but later I observed a significant increase in performance when reached about a 3.3Million the read request failed and the node went unreachable. After analysing my GC logs it is clear that 99% of my old-memory space is occupied and there are no more space for allocation it caused the machine stall.

       here my is doubt is that does all the deleted 3.3Million row will be loaded in my on-heap memory? if not what will be object that occupying those memory ?.   



PS : I am using C* 2.1.13 in cluster.

RE: Node failure Due To Very high GC pause time

Posted by "Durity, Sean R" <SE...@homedepot.com>.

I like Bryan’s terminology of an “antagonistic use case.” If I am reading this correctly, you are putting 5 (or 10) million records in a partition and then trying to delete them in the same order they are stored. This is not a good data model for Cassandra, in fact a dangerous data model. That partition will reside completely on one node (and a number of replicas). Then, you are forcing the reads to wade through all the tombstones to get to the undeleted records – all on the same nodes. This cannot scale to the scope you want.

For a distributed data store, you want the data distributed across all of your cluster. And you want to delete whole partitions, if at all possible. (Or at least a reasonable number of deletes within a partition.)

Sean Durity
From: Karthick V [mailto:karthick.vs@zohocorp.com]
Sent: Monday, July 03, 2017 12:47 PM
To: user <us...@cassandra.apache.org>
Subject: Re: Node failure Due To Very high GC pause time

Hi Bryan,

            Thanks for your quick response.  We have already tuned our memory and GC based on our hardware specification and it was working fine until yesterday, i.e before facing the below specified delete request. As you specified we will once again look into our GC & memory configuration.

FYKI :  We are using memtable_allocation_typ as offheap_objects.

Consider the following table

CREATE TABLE  EmployeeDetails (
    branch_id text,
    department_id  text,
    emp_id bigint,
    emp_details text,
    PRIMARY KEY (branch, department, emp_id)
) WITH CLUSTERING ORDER BY (department ASC, emp_id ASC)

    In this table I have 10 million records for the a particular branch_id and department_id . And following are the list of operation which I perform in C* in chronological order

  1.  Deleting 5 million records, from the start, in batches of 500 records per request for the particular branch_id (say 'xxx' ) and department_id (say 'yyy')
  2.  Read the next 500 records as soon the above delete operation is being completed ( Select * from EmployeeDetails where branch_id='xxx' and department_id = 'yyy' and emp_id >50000000 limit 500 )

It's only after executing the above read request there was a spike in memory and within few minutes the node has been marked down.

So my question here is , will the above read request will load all the deleted 5 million records in my memory before it starts fetching or will it jump directly to the offset of 50000001 record (since we have specified the greater than condition) ? If its going to the former case then for sure the read request will keep the data in main memory and performs merge operation before it delivers the data as per this wiki( https://wiki.apache.org/cassandra/ReadPathForUsers<https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_cassandra_ReadPathForUsers&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=vKUn5NzE_8ZTSnpd-nZm4PEv1cHlVlCWislk0ZFzuqM&s=zumGOb2d0jimG7vaGzRDMd8wnODr8sp55zh1KVURl2I&e=> ). If not let me know how the above specified read request will provide me the data .

Note : And also while analyzing my heap dump its clear that majority of the memory is being held my Tombstone threads.

Thanks in advance
-- karthick

---- On Mon, 03 Jul 2017 20:40:10 +0530 Bryan Cheng <br...@blockcypher.com>> wrote ----

This is a very antagonistic use case for Cassandra :P I assume you're familiar with Cassandra and deletes? (eg. http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__thelastpickle.com_blog_2016_07_27_about-2Ddeletes-2Dand-2Dtombstones.html&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=vKUn5NzE_8ZTSnpd-nZm4PEv1cHlVlCWislk0ZFzuqM&s=3UuijdoAetFFXUGpk68hBRTkeLcm5sPORFJgGnF1Axw&e=>, http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.datastax.com_en_cassandra_2.1_cassandra_dml_dml-5Fabout-5Fdeletes-5Fc.html&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=vKUn5NzE_8ZTSnpd-nZm4PEv1cHlVlCWislk0ZFzuqM&s=YPXQsUTjN7jh0ugK8zhzF1D3Z2ANjEP_Kv2Bm38EAaY&e=>)

That being said, are you giving enough time for your tables to flush to disk? Deletes generate markers which can and will consume memory until they have a chance to be flushed, after which they will impact query time and performance (but should relieve memory pressure). If you're saturating the capability of your nodes your tables will have difficulty flushing. See http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_memtable_thruput_c.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.datastax.com_en_cassandra_2.1_cassandra_operations_ops-5Fmemtable-5Fthruput-5Fc.html&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=vKUn5NzE_8ZTSnpd-nZm4PEv1cHlVlCWislk0ZFzuqM&s=fBEo99NK2d4zw-aIzj-l5UMxq-tp2n-xwYfIKXGsyLQ&e=>.

This could also be a heap/memory configuration issue as well or a GC tuning issue (although unlikely if you've left those at default)

--Bryan

On Mon, Jul 3, 2017 at 7:51 AM, Karthick V <ka...@zohocorp.com>> wrote:

Hi,

      Recently In my test Cluster I faced a outrageous GC activity which made the Node unreachable inside the cluster itself.

Scenario :
      In a Partition of 5Million rows we read first 500 (by giving the starting range) and delete the same 500 again.The same has been done recursively by changing the Start range alone. Initially I didn't see any difference in the query performance ( upto 50,000) but later I observed a significant increase in performance when reached about a 3.3Million the read request failed and the node went unreachable. After analysing my GC logs it is clear that 99% of my old-memory space is occupied and there are no more space for allocation it caused the machine stall.
       here my is doubt is that does all the deleted 3.3Million row will be loaded in my on-heap memory? if not what will be object that occupying those memory ?.

PS : I am using C* 2.1.13 in cluster.

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

RE: Node failure Due To Very high GC pause time

Posted by "ZAIDI, ASAD A" <az...@att.com>.

>>       here my is doubt is that does all the deleted 3.3Million row will be loaded in my on-heap memory? if not what will be object that occupying those memory ?

          It depends on your queries what data they’re fetching from your database.   Assuming you’re using CMS garbage collector and you’ve enabled GC logs with PrintGCDetails, PrintClassHistogramBeforeFullGC, PrintClassHistogramAfterFullGC – your logs should tell you what java classes occupies most of your  heap memory.

System.log file can also give you some clue  like if you see references to your tables with [tombstones],  A quick [grep –i tombstone /path/to/system.log] command would tell you what objects are suffering with tombstones!

From: Karthick V [mailto:karthick.vs@zohocorp.com]
Sent: Monday, July 03, 2017 11:47 AM
To: user <us...@cassandra.apache.org>
Subject: Re: Node failure Due To Very high GC pause time

Hi Bryan,

            Thanks for your quick response.  We have already tuned our memory and GC based on our hardware specification and it was working fine until yesterday, i.e before facing the below specified delete request. As you specified we will once again look into our GC & memory configuration.

FYKI :  We are using memtable_allocation_typ as offheap_objects.

Consider the following table

CREATE TABLE  EmployeeDetails (
    branch_id text,
    department_id  text,
    emp_id bigint,
    emp_details text,
    PRIMARY KEY (branch, department, emp_id)
) WITH CLUSTERING ORDER BY (department ASC, emp_id ASC)

    In this table I have 10 million records for the a particular branch_id and department_id . And following are the list of operation which I perform in C* in chronological order

  1.  Deleting 5 million records, from the start, in batches of 500 records per request for the particular branch_id (say 'xxx' ) and department_id (say 'yyy')
  2.  Read the next 500 records as soon the above delete operation is being completed ( Select * from EmployeeDetails where branch_id='xxx' and department_id = 'yyy' and emp_id >50000000 limit 500 )

It's only after executing the above read request there was a spike in memory and within few minutes the node has been marked down.

So my question here is , will the above read request will load all the deleted 5 million records in my memory before it starts fetching or will it jump directly to the offset of 50000001 record (since we have specified the greater than condition) ? If its going to the former case then for sure the read request will keep the data in main memory and performs merge operation before it delivers the data as per this wiki( https://wiki.apache.org/cassandra/ReadPathForUsers<https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_cassandra_ReadPathForUsers&d=DwMFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=FsmDztdsVuIKml8IDhdHdg&m=CDWTETAp4ysP2BZOZf0NmK0uMC8DqyczuMM153cHKwU&s=_TNbuiV87eU6XZXA_kQ3gOpFzaebRps5xE0dhIb7vcs&e=> ). If not let me know how the above specified read request will provide me the data .

Note : And also while analyzing my heap dump its clear that majority of the memory is being held my Tombstone threads.

Thanks in advance
-- karthick

---- On Mon, 03 Jul 2017 20:40:10 +0530 Bryan Cheng <br...@blockcypher.com>> wrote ----

This is a very antagonistic use case for Cassandra :P I assume you're familiar with Cassandra and deletes? (eg. http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__thelastpickle.com_blog_2016_07_27_about-2Ddeletes-2Dand-2Dtombstones.html&d=DwMFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=FsmDztdsVuIKml8IDhdHdg&m=CDWTETAp4ysP2BZOZf0NmK0uMC8DqyczuMM153cHKwU&s=V22H8IC2AtB4TjhmouEnIcWgUvXDPyH1WsWGttfGuFs&e=>, http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.datastax.com_en_cassandra_2.1_cassandra_dml_dml-5Fabout-5Fdeletes-5Fc.html&d=DwMFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=FsmDztdsVuIKml8IDhdHdg&m=CDWTETAp4ysP2BZOZf0NmK0uMC8DqyczuMM153cHKwU&s=js6Qj4wHUwpGKxZIKujRi850HoJKGGr9Hyuu9hMhB_M&e=>)

That being said, are you giving enough time for your tables to flush to disk? Deletes generate markers which can and will consume memory until they have a chance to be flushed, after which they will impact query time and performance (but should relieve memory pressure). If you're saturating the capability of your nodes your tables will have difficulty flushing. See http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_memtable_thruput_c.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.datastax.com_en_cassandra_2.1_cassandra_operations_ops-5Fmemtable-5Fthruput-5Fc.html&d=DwMFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=FsmDztdsVuIKml8IDhdHdg&m=CDWTETAp4ysP2BZOZf0NmK0uMC8DqyczuMM153cHKwU&s=WG7gSaQQvLC2tO1GliJANn9ZddUG5Kb0jLThOy8vKwM&e=>.

This could also be a heap/memory configuration issue as well or a GC tuning issue (although unlikely if you've left those at default)

--Bryan

On Mon, Jul 3, 2017 at 7:51 AM, Karthick V <ka...@zohocorp.com>> wrote:

Hi,

      Recently In my test Cluster I faced a outrageous GC activity which made the Node unreachable inside the cluster itself.

Scenario :
      In a Partition of 5Million rows we read first 500 (by giving the starting range) and delete the same 500 again.The same has been done recursively by changing the Start range alone. Initially I didn't see any difference in the query performance ( upto 50,000) but later I observed a significant increase in performance when reached about a 3.3Million the read request failed and the node went unreachable. After analysing my GC logs it is clear that 99% of my old-memory space is occupied and there are no more space for allocation it caused the machine stall.
       here my is doubt is that does all the deleted 3.3Million row will be loaded in my on-heap memory? if not what will be object that occupying those memory ?.

PS : I am using C* 2.1.13 in cluster.

Re: Node failure Due To Very high GC pause time

Posted by Karthick V <ka...@zohocorp.com>.

Hi Bryan,

Thanks for your quick response. We have already tuned our memory and GC based on our hardware specification and it was working fine until yesterday, i.e before facing the below specified delete request. As you specified we will once again look into our GC &amp; memory configuration.

FYKI : We are using memtable_allocation_typ as offheap_objects.

Consider the following table

CREATE TABLE EmployeeDetails (

branch_id text,

department_id text,

emp_id bigint,

emp_details text,

PRIMARY KEY (branch, department, emp_id)

) WITH CLUSTERING ORDER BY (department ASC, emp_id ASC)

In this table I have 10 million records for the a particular branch_id and department_id . And following are the list of operation which I perform in C* in chronological order

Deleting 5 million records, from the start, in batches of 500 records per request for the particular branch_id (say 'xxx' ) and department_id (say 'yyy')

Read the next 500 records as soon the above delete operation is being completed ( Select * from EmployeeDetails where branch_id='xxx' and department_id = 'yyy' and emp_id &gt;50000000 limit 500 )

It's only after executing the above read request there was a spike in memory and within few minutes the node has been marked down.

So my question here is , will the above read request will load all the deleted 5 million records in my memory before it starts fetching or will it jump directly to the offset of 50000001 record (since we have specified the greater than condition) ? If its going to the former case then for sure the read request will keep the data in main memory and performs merge operation before it delivers the data as per this wiki( https://wiki.apache.org/cassandra/ReadPathForUsers ). If not let me know how the above specified read request will provide me the data .

Note : And also while analyzing my heap dump its clear that majority of the memory is being held my Tombstone threads.

Thanks in advance

-- karthick

---- On Mon, 03 Jul 2017 20:40:10 +0530 Bryan Cheng &lt;bryan@blockcypher.com&gt; wrote ----

This is a very antagonistic use case for Cassandra :P I assume you're familiar with Cassandra and deletes? (eg. http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html, http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html)

That being said, are you giving enough time for your tables to flush to disk? Deletes generate markers which can and will consume memory until they have a chance to be flushed, after which they will impact query time and performance (but should relieve memory pressure). If you're saturating the capability of your nodes your tables will have difficulty flushing. See http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_memtable_thruput_c.html.

This could also be a heap/memory configuration issue as well or a GC tuning issue (although unlikely if you've left those at default)

--Bryan

On Mon, Jul 3, 2017 at 7:51 AM, Karthick V &lt;karthick.vs@zohocorp.com&gt; wrote:

Hi,

Recently In my test Cluster I faced a outrageous GC activity which made the Node unreachable inside the cluster itself.

Scenario :

In a Partition of 5Million rows we read first 500 (by giving the starting range) and delete the same 500 again.The same has been done recursively by changing the Start range alone. Initially I didn't see any difference in the query performance ( upto 50,000) but later I observed a significant increase in performance when reached about a 3.3Million the read request failed and the node went unreachable. After analysing my GC logs it is clear that 99% of my old-memory space is occupied and there are no more space for allocation it caused the machine stall.

here my is doubt is that does all the deleted 3.3Million row will be loaded in my on-heap memory? if not what will be object that occupying those memory ?.

PS : I am using C* 2.1.13 in cluster.

Re: Node failure Due To Very high GC pause time

Posted by Bryan Cheng <br...@blockcypher.com>.

This is a very antagonistic use case for Cassandra :P I assume you're
familiar with Cassandra and deletes? (eg.
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html,
http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html
)

That being said, are you giving enough time for your tables to flush to
disk? Deletes generate markers which can and will consume memory until they
have a chance to be flushed, after which they will impact query time and
performance (but should relieve memory pressure). If you're saturating the
capability of your nodes your tables will have difficulty flushing. See
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_memtable_thruput_c.html
.

This could also be a heap/memory configuration issue as well or a GC tuning
issue (although unlikely if you've left those at default)

--Bryan

On Mon, Jul 3, 2017 at 7:51 AM, Karthick V <ka...@zohocorp.com> wrote:

> Hi,
>
>       Recently In my test Cluster I faced a outrageous GC activity which
> made the Node unreachable inside the cluster itself.
>
> Scenario :
>       In a Partition of 5Million rows we read first 500 (by giving the
> starting range) and delete the same 500 again.The same has been done
> recursively by changing the Start range alone. Initially I didn't see any
> difference in the query performance ( upto 50,000) but later I observed a
> significant increase in performance when reached about a 3.3Million the
> read request failed and the node went unreachable. After analysing my GC
> logs it is clear that 99% of my old-memory space is occupied and there are
> no more space for allocation it caused the machine stall.
>        here my is doubt is that does all the deleted 3.3Million row will
> be loaded in my on-heap memory? if not what will be object that occupying
> those memory ?.
>
> PS : I am using C* 2.1.13 in cluster.
>
>
>
>
>