You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Jai Bheemsen Rao Dhanwada <ja...@gmail.com> on 2017/09/20 16:49:24 UTC

Re: Debugging write timeouts on Cassandra 2.2.5

Hello Nike,

were you able to fix the issue? If so what change helped you?

On Wed, Feb 24, 2016 at 5:36 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> Great that you found a specific release that triggers the problem - 2.1.x
> has a huge number of changes.
>
> How many partitions and rows do you have? What's the largest row count for
> a single partition? And all of these CQL tables are COMPACT STORAGE,
> correct? Are you writing a large number of skinny partitions or a smaller
> number of very wide partitions? It wouldn't surprise me if behavior for
> large partitions varies between releases since they can be so
> memory-intensive.
>
> I see this change in 2.1.5 that could possibly introduce some memory usage:
> Write partition size estimates into a system table (CASSANDRA-7688)
>
> At this stage it would probably help for you to try to produce a
> reasonably small repro test case that you could file as a Jira. And if you
> could run that repro test case on 3.x to verify that the problem still
> exists, that would be helpful as well.
>
> How long does it take to repro the timeout?
>
> Can you repro the timeout using a single node?
>
> What is the pattern of the timeouts - just random and occasional, or heavy
> and continuous once they start?
>
> Are they occurring uniformly on all three nodes?
>
> If you bounce the cluster and continue testing, do the timeouts commence
> immediately, fairly soon, or only after about as long as they take from a
> clean fresh start?
>
>
>
> -- Jack Krupansky
>
> On Wed, Feb 24, 2016 at 7:04 PM, Mike Heffner <mi...@librato.com> wrote:
>
>> Nate,
>>
>> So we have run several install tests, bisecting the 2.1.x release line,
>> and we believe that the regression was introduced in version 2.1.5. This is
>> the first release that clearly hits the timeout for us.
>>
>> It looks like quite a large release, so our next step will likely be
>> bisecting the major commits to see if we can narrow it down:
>> https://github.com/apache/cassandra/blob/3c0a337ebc90b0d99349d0aa152c92
>> b5b3494d8c/CHANGES.txt. Obviously, any suggestions on potential suspects
>> appreciated.
>>
>> These are the memtable settings we've configured diff from the defaults
>> during our testing:
>>
>> memtable_allocation_type: offheap_objects
>> memtable_flush_writers: 8
>>
>>
>> Cheers,
>>
>> Mike
>>
>> On Fri, Feb 19, 2016 at 1:46 PM, Nate McCall <na...@thelastpickle.com>
>> wrote:
>>
>>> The biggest change which *might* explain your behavior has to do with
>>> the changes in memtable flushing between 2.0 and 2.1:
>>> https://issues.apache.org/jira/browse/CASSANDRA-5549
>>>
>>> However, the tpstats you posted shows no dropped mutations which would
>>> make me more certain of this as the cause.
>>>
>>> What values do you have right now for each of these (my recommendations
>>> for each on a c4.2xl with stock cassandra-env.sh are in parenthesis):
>>>
>>> - memtable_flush_writers (2)
>>> - memtable_heap_space_in_mb  (2048)
>>> - memtable_offheap_space_in_mb (2048)
>>> - memtable_cleanup_threshold (0.11)
>>> - memtable_allocation_type (offheap_objects)
>>>
>>> The biggest win IMO will be moving to offheap_objects. By default,
>>> everything is on heap. Regardless, spending some time tuning these for your
>>> workload will pay off.
>>>
>>> You may also want to be explicit about
>>>
>>> - native_transport_max_concurrent_connections
>>> - native_transport_max_concurrent_connections_per_ip
>>>
>>> Depending on the driver, these may now be allowing 32k streams per
>>> connection(!) as detailed in v3 of the native protocol:
>>> https://github.com/apache/cassandra/blob/cassandra-2.1/
>>> doc/native_protocol_v3.spec#L130-L152
>>>
>>>
>>>
>>> On Fri, Feb 19, 2016 at 8:48 AM, Mike Heffner <mi...@librato.com> wrote:
>>>
>>>> Anuj,
>>>>
>>>> So we originally started testing with Java8 + G1, however we were able
>>>> to reproduce the same results with the default CMS settings that ship in
>>>> the cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses
>>>> during the runs.
>>>>
>>>> Query pattern during our testing was 100% writes, batching (via Thrift
>>>> mostly) to 5 tables, between 6-1500 rows per batch.
>>>>
>>>> Mike
>>>>
>>>> On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <an...@yahoo.co.in>
>>>> wrote:
>>>>
>>>>> Whats the GC overhead? Can you your share your GC collector and
>>>>> settings ?
>>>>>
>>>>>
>>>>> Whats your query pattern? Do you use secondary indexes, batches, in
>>>>> clause etc?
>>>>>
>>>>>
>>>>> Anuj
>>>>>
>>>>>
>>>>> Sent from Yahoo Mail on Android
>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>>>
>>>>> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner
>>>>> <mi...@librato.com> wrote:
>>>>> Alain,
>>>>>
>>>>> Thanks for the suggestions.
>>>>>
>>>>> Sure, tpstats are here: https://gist.github.com/
>>>>> mheffner/a979ae1a0304480b052a. Looking at the metrics across the
>>>>> ring, there were no blocked tasks nor dropped messages.
>>>>>
>>>>> Iowait metrics look fine, so it doesn't appear to be blocking on disk.
>>>>> Similarly, there are no long GC pauses.
>>>>>
>>>>> We haven't noticed latency on any particular table higher than others
>>>>> or correlated around the occurrence of a timeout. We have noticed with
>>>>> further testing that running cassandra-stress against the ring, while our
>>>>> workload is writing to the same ring, will incur similar 10 second
>>>>> timeouts. If our workload is not writing to the ring, cassandra stress will
>>>>> run without hitting timeouts. This seems to imply that our workload pattern
>>>>> is causing something to block cluster-wide, since the stress tool writes to
>>>>> a different keyspace then our workload.
>>>>>
>>>>> I mentioned in another reply that we've tracked it to something
>>>>> between 2.0.x and 2.1.x, so we are focusing on narrowing which point
>>>>> release it was introduced in.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Mike
>>>>>
>>>>> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <ar...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Mike,
>>>>>>
>>>>>> What about the output of tpstats ? I imagine you have dropped
>>>>>> messages there. Any blocked threads ? Could you paste this output here ?
>>>>>>
>>>>>> May this be due to some network hiccup to access the disks as they
>>>>>> are EBS ? Can you think of anyway of checking this ? Do you have a lot of
>>>>>> GC logs, how long are the pauses (use something like: grep -i 'GCInspector'
>>>>>> /var/log/cassandra/system.log) ?
>>>>>>
>>>>>> Something else you could check are local_writes stats to see if only
>>>>>> one table if affected or this is keyspace / cluster wide. You can use
>>>>>> metrics exposed by cassandra or if you have no dashboards I believe a:
>>>>>> 'nodetool cfstats <myks> | grep -e 'Table:' -e 'Local'' should give you a
>>>>>> rough idea of local latencies.
>>>>>>
>>>>>> Those are just things I would check, I have not a clue on what is
>>>>>> happening here, hope this will help.
>>>>>>
>>>>>> C*heers,
>>>>>> -----------------
>>>>>> Alain Rodriguez
>>>>>> France
>>>>>>
>>>>>> The Last Pickle
>>>>>> http://www.thelastpickle.com
>>>>>>
>>>>>> 2016-02-18 5:13 GMT+01:00 Mike Heffner <mi...@librato.com>:
>>>>>>
>>>>>>> Jaydeep,
>>>>>>>
>>>>>>> No, we don't use any light weight transactions.
>>>>>>>
>>>>>>> Mike
>>>>>>>
>>>>>>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <
>>>>>>> chovatia.jaydeep@gmail.com> wrote:
>>>>>>>
>>>>>>>> Are you guys using light weight transactions in your write path?
>>>>>>>>
>>>>>>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <
>>>>>>>> fabrice.facorat@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Are your commitlog and data on the same disk ? If yes, you should
>>>>>>>>> put
>>>>>>>>> commitlogs on a separate disk which don't have a lot of IO.
>>>>>>>>>
>>>>>>>>> Others IO may have great impact impact on your commitlog writing
>>>>>>>>> and
>>>>>>>>> it may even block.
>>>>>>>>>
>>>>>>>>> An example of impact IO may have, even for Async writes:
>>>>>>>>> https://engineering.linkedin.com/blog/2016/02/eliminating-
>>>>>>>>> large-jvm-gc-pauses-caused-by-background-io-traffic
>>>>>>>>>
>>>>>>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner <mi...@librato.com>:
>>>>>>>>> > Jeff,
>>>>>>>>> >
>>>>>>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
>>>>>>>>> >
>>>>>>>>> > Mike
>>>>>>>>> >
>>>>>>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <
>>>>>>>>> jeff.jirsa@crowdstrike.com>
>>>>>>>>> > wrote:
>>>>>>>>> >>
>>>>>>>>> >> What disk size are you using?
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> From: Mike Heffner
>>>>>>>>> >> Reply-To: "user@cassandra.apache.org"
>>>>>>>>> >> Date: Wednesday, February 10, 2016 at 2:24 PM
>>>>>>>>> >> To: "user@cassandra.apache.org"
>>>>>>>>> >> Cc: Peter Norton
>>>>>>>>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>>>>>>>>> >>
>>>>>>>>> >> Paulo,
>>>>>>>>> >>
>>>>>>>>> >> Thanks for the suggestion, we ran some tests against CMS and
>>>>>>>>> saw the same
>>>>>>>>> >> timeouts. On that note though, we are going to try doubling the
>>>>>>>>> instance
>>>>>>>>> >> sizes and testing with double the heap (even though current
>>>>>>>>> usage is low).
>>>>>>>>> >>
>>>>>>>>> >> Mike
>>>>>>>>> >>
>>>>>>>>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <
>>>>>>>>> pauloricardomg@gmail.com>
>>>>>>>>> >> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> Are you using the same GC settings as the staging 2.0 cluster?
>>>>>>>>> If not,
>>>>>>>>> >>> could you try using the default GC settings (CMS) and see if
>>>>>>>>> that changes
>>>>>>>>> >>> anything? This is just a wild guess, but there were reports
>>>>>>>>> before of
>>>>>>>>> >>> G1-caused instabilities with small heap sizes (< 16GB - see
>>>>>>>>> CASSANDRA-10403
>>>>>>>>> >>> for more context). Please ignore if you already tried
>>>>>>>>> reverting back to CMS.
>>>>>>>>> >>>
>>>>>>>>> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner <mi...@librato.com>:
>>>>>>>>> >>>>
>>>>>>>>> >>>> Hi all,
>>>>>>>>> >>>>
>>>>>>>>> >>>> We've recently embarked on a project to update our Cassandra
>>>>>>>>> >>>> infrastructure running on EC2. We are long time users of
>>>>>>>>> 2.0.x and are
>>>>>>>>> >>>> testing out a move to version 2.2.5 running on VPC with EBS.
>>>>>>>>> Our test setup
>>>>>>>>> >>>> is a 3 node, RF=3 cluster supporting a small write load
>>>>>>>>> (mirror of our
>>>>>>>>> >>>> staging load).
>>>>>>>>> >>>>
>>>>>>>>> >>>> We are writing at QUORUM and while p95's look good compared
>>>>>>>>> to our
>>>>>>>>> >>>> staging 2.0.x cluster, we are seeing frequent write
>>>>>>>>> operations that time out
>>>>>>>>> >>>> at the max write_request_timeout_in_ms (10 seconds). CPU
>>>>>>>>> across the cluster
>>>>>>>>> >>>> is < 10% and EBS write load is < 100 IOPS. Cassandra is
>>>>>>>>> running with the
>>>>>>>>> >>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are
>>>>>>>>> less than 500ms.
>>>>>>>>> >>>>
>>>>>>>>> >>>> We run on c4.2xl instances with GP2 EBS attached storage for
>>>>>>>>> data and
>>>>>>>>> >>>> commitlog directories. The nodes are using EC2 enhanced
>>>>>>>>> networking and have
>>>>>>>>> >>>> the latest Intel network driver module. We are running on HVM
>>>>>>>>> instances
>>>>>>>>> >>>> using Ubuntu 14.04.2.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table
>>>>>>>>> is similar
>>>>>>>>> >>>> to the definition here:
>>>>>>>>> >>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>>>>>>>>> >>>>
>>>>>>>>> >>>> This is our cassandra.yaml:
>>>>>>>>> >>>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#
>>>>>>>>> file-cassandra-yaml
>>>>>>>>> >>>>
>>>>>>>>> >>>> Like I mentioned we use 8u60 with G1GC and have used many of
>>>>>>>>> the GC
>>>>>>>>> >>>> settings in Al Tobey's tuning guide. This is our upstart
>>>>>>>>> config with JVM and
>>>>>>>>> >>>> other CPU settings: https://gist.github.com/
>>>>>>>>> mheffner/dc44613620b25c4fa46d
>>>>>>>>> >>>>
>>>>>>>>> >>>> We've used several of the sysctl settings from Al's guide as
>>>>>>>>> well:
>>>>>>>>> >>>> https://gist.github.com/mheffner/ea40d58f58a517028152
>>>>>>>>> >>>>
>>>>>>>>> >>>> Our client application is able to write using either Thrift
>>>>>>>>> batches
>>>>>>>>> >>>> using Asytanax driver or CQL async INSERT's using the
>>>>>>>>> Datastax Java driver.
>>>>>>>>> >>>>
>>>>>>>>> >>>> For testing against Thrift (our legacy infra uses this) we
>>>>>>>>> write batches
>>>>>>>>> >>>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch
>>>>>>>>> execution is
>>>>>>>>> >>>> around 45ms but our maximum (p100) sits less than 150ms
>>>>>>>>> except when it
>>>>>>>>> >>>> periodically spikes to the full 10seconds.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Testing the same write path using CQL writes instead
>>>>>>>>> demonstrates
>>>>>>>>> >>>> similar behavior. Low p99s except for periodic full timeouts.
>>>>>>>>> We enabled
>>>>>>>>> >>>> tracing for several operations but were unable to get a trace
>>>>>>>>> that completed
>>>>>>>>> >>>> successfully -- Cassandra started logging many messages as:
>>>>>>>>> >>>>
>>>>>>>>> >>>> INFO  [ScheduledTasks:1] - MessagingService.java:946 - _TRACE
>>>>>>>>> messages
>>>>>>>>> >>>> were dropped in last 5000 ms: 52499 for internal timeout and
>>>>>>>>> 0 for cross
>>>>>>>>> >>>> node timeout
>>>>>>>>> >>>>
>>>>>>>>> >>>> And all the traces contained rows with a "null"
>>>>>>>>> source_elapsed row:
>>>>>>>>> >>>> https://gist.githubusercontent.com/
>>>>>>>>> mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7
>>>>>>>>> e7d8f2911b/trace.out
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>> We've exhausted as many configuration option permutations
>>>>>>>>> that we can
>>>>>>>>> >>>> think of. This cluster does not appear to be under any
>>>>>>>>> significant load and
>>>>>>>>> >>>> latencies seem to largely fall in two bands: low normal or
>>>>>>>>> max timeout. This
>>>>>>>>> >>>> seems to imply that something is getting stuck and timing out
>>>>>>>>> at the max
>>>>>>>>> >>>> write timeout.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Any suggestions on what to look for? We had debug enabled for
>>>>>>>>> awhile but
>>>>>>>>> >>>> we didn't see any msg that pointed to something obvious.
>>>>>>>>> Happy to provide
>>>>>>>>> >>>> any more information that may help.
>>>>>>>>> >>>>
>>>>>>>>> >>>> We are pretty much at the point of sprinkling debug around
>>>>>>>>> the code to
>>>>>>>>> >>>> track down what could be blocking.
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>> Thanks,
>>>>>>>>> >>>>
>>>>>>>>> >>>> Mike
>>>>>>>>> >>>>
>>>>>>>>> >>>> --
>>>>>>>>> >>>>
>>>>>>>>> >>>>   Mike Heffner <mi...@librato.com>
>>>>>>>>> >>>>   Librato, Inc.
>>>>>>>>> >>>>
>>>>>>>>> >>>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> --
>>>>>>>>> >>
>>>>>>>>> >>   Mike Heffner <mi...@librato.com>
>>>>>>>>> >>   Librato, Inc.
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > --
>>>>>>>>> >
>>>>>>>>> >   Mike Heffner <mi...@librato.com>
>>>>>>>>> >   Librato, Inc.
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Close the World, Open the Net
>>>>>>>>> http://www.linux-wizard.net
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>   Mike Heffner <mi...@librato.com>
>>>>>>>   Librato, Inc.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>   Mike Heffner <mi...@librato.com>
>>>>>   Librato, Inc.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>   Mike Heffner <mi...@librato.com>
>>>>   Librato, Inc.
>>>>
>>>>
>>>
>>>
>>> --
>>> -----------------
>>> Nate McCall
>>> Austin, TX
>>> @zznate
>>>
>>> Co-Founder & Sr. Technical Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>
>>
>>
>> --
>>
>>   Mike Heffner <mi...@librato.com>
>>   Librato, Inc.
>>
>>
>

Re: Debugging write timeouts on Cassandra 2.2.5

Posted by Jai Bheemsen Rao Dhanwada <ja...@gmail.com>.

Apologies for the typo, Mike

On Wed, Sep 20, 2017 at 9:49 AM, Jai Bheemsen Rao Dhanwada <
jaibheemsen@gmail.com> wrote:

> Hello Nike,
>
> were you able to fix the issue? If so what change helped you?
>
> On Wed, Feb 24, 2016 at 5:36 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> Great that you found a specific release that triggers the problem - 2.1.x
>> has a huge number of changes.
>>
>> How many partitions and rows do you have? What's the largest row count
>> for a single partition? And all of these CQL tables are COMPACT STORAGE,
>> correct? Are you writing a large number of skinny partitions or a smaller
>> number of very wide partitions? It wouldn't surprise me if behavior for
>> large partitions varies between releases since they can be so
>> memory-intensive.
>>
>> I see this change in 2.1.5 that could possibly introduce some memory
>> usage:
>> Write partition size estimates into a system table (CASSANDRA-7688)
>>
>> At this stage it would probably help for you to try to produce a
>> reasonably small repro test case that you could file as a Jira. And if you
>> could run that repro test case on 3.x to verify that the problem still
>> exists, that would be helpful as well.
>>
>> How long does it take to repro the timeout?
>>
>> Can you repro the timeout using a single node?
>>
>> What is the pattern of the timeouts - just random and occasional, or
>> heavy and continuous once they start?
>>
>> Are they occurring uniformly on all three nodes?
>>
>> If you bounce the cluster and continue testing, do the timeouts commence
>> immediately, fairly soon, or only after about as long as they take from a
>> clean fresh start?
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Wed, Feb 24, 2016 at 7:04 PM, Mike Heffner <mi...@librato.com> wrote:
>>
>>> Nate,
>>>
>>> So we have run several install tests, bisecting the 2.1.x release line,
>>> and we believe that the regression was introduced in version 2.1.5. This is
>>> the first release that clearly hits the timeout for us.
>>>
>>> It looks like quite a large release, so our next step will likely be
>>> bisecting the major commits to see if we can narrow it down:
>>> https://github.com/apache/cassandra/blob/3c0a337ebc90b
>>> 0d99349d0aa152c92b5b3494d8c/CHANGES.txt. Obviously, any suggestions on
>>> potential suspects appreciated.
>>>
>>> These are the memtable settings we've configured diff from the defaults
>>> during our testing:
>>>
>>> memtable_allocation_type: offheap_objects
>>> memtable_flush_writers: 8
>>>
>>>
>>> Cheers,
>>>
>>> Mike
>>>
>>> On Fri, Feb 19, 2016 at 1:46 PM, Nate McCall <na...@thelastpickle.com>
>>> wrote:
>>>
>>>> The biggest change which *might* explain your behavior has to do with
>>>> the changes in memtable flushing between 2.0 and 2.1:
>>>> https://issues.apache.org/jira/browse/CASSANDRA-5549
>>>>
>>>> However, the tpstats you posted shows no dropped mutations which would
>>>> make me more certain of this as the cause.
>>>>
>>>> What values do you have right now for each of these (my recommendations
>>>> for each on a c4.2xl with stock cassandra-env.sh are in parenthesis):
>>>>
>>>> - memtable_flush_writers (2)
>>>> - memtable_heap_space_in_mb  (2048)
>>>> - memtable_offheap_space_in_mb (2048)
>>>> - memtable_cleanup_threshold (0.11)
>>>> - memtable_allocation_type (offheap_objects)
>>>>
>>>> The biggest win IMO will be moving to offheap_objects. By default,
>>>> everything is on heap. Regardless, spending some time tuning these for your
>>>> workload will pay off.
>>>>
>>>> You may also want to be explicit about
>>>>
>>>> - native_transport_max_concurrent_connections
>>>> - native_transport_max_concurrent_connections_per_ip
>>>>
>>>> Depending on the driver, these may now be allowing 32k streams per
>>>> connection(!) as detailed in v3 of the native protocol:
>>>> https://github.com/apache/cassandra/blob/cassandra-2.1/doc/
>>>> native_protocol_v3.spec#L130-L152
>>>>
>>>>
>>>>
>>>> On Fri, Feb 19, 2016 at 8:48 AM, Mike Heffner <mi...@librato.com> wrote:
>>>>
>>>>> Anuj,
>>>>>
>>>>> So we originally started testing with Java8 + G1, however we were able
>>>>> to reproduce the same results with the default CMS settings that ship in
>>>>> the cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses
>>>>> during the runs.
>>>>>
>>>>> Query pattern during our testing was 100% writes, batching (via Thrift
>>>>> mostly) to 5 tables, between 6-1500 rows per batch.
>>>>>
>>>>> Mike
>>>>>
>>>>> On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <anujw_2003@yahoo.co.in
>>>>> > wrote:
>>>>>
>>>>>> Whats the GC overhead? Can you your share your GC collector and
>>>>>> settings ?
>>>>>>
>>>>>>
>>>>>> Whats your query pattern? Do you use secondary indexes, batches, in
>>>>>> clause etc?
>>>>>>
>>>>>>
>>>>>> Anuj
>>>>>>
>>>>>>
>>>>>> Sent from Yahoo Mail on Android
>>>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>>>>
>>>>>> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner
>>>>>> <mi...@librato.com> wrote:
>>>>>> Alain,
>>>>>>
>>>>>> Thanks for the suggestions.
>>>>>>
>>>>>> Sure, tpstats are here: https://gist.github.com/
>>>>>> mheffner/a979ae1a0304480b052a. Looking at the metrics across the
>>>>>> ring, there were no blocked tasks nor dropped messages.
>>>>>>
>>>>>> Iowait metrics look fine, so it doesn't appear to be blocking on
>>>>>> disk. Similarly, there are no long GC pauses.
>>>>>>
>>>>>> We haven't noticed latency on any particular table higher than others
>>>>>> or correlated around the occurrence of a timeout. We have noticed with
>>>>>> further testing that running cassandra-stress against the ring, while our
>>>>>> workload is writing to the same ring, will incur similar 10 second
>>>>>> timeouts. If our workload is not writing to the ring, cassandra stress will
>>>>>> run without hitting timeouts. This seems to imply that our workload pattern
>>>>>> is causing something to block cluster-wide, since the stress tool writes to
>>>>>> a different keyspace then our workload.
>>>>>>
>>>>>> I mentioned in another reply that we've tracked it to something
>>>>>> between 2.0.x and 2.1.x, so we are focusing on narrowing which point
>>>>>> release it was introduced in.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <ar...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mike,
>>>>>>>
>>>>>>> What about the output of tpstats ? I imagine you have dropped
>>>>>>> messages there. Any blocked threads ? Could you paste this output here ?
>>>>>>>
>>>>>>> May this be due to some network hiccup to access the disks as they
>>>>>>> are EBS ? Can you think of anyway of checking this ? Do you have a lot of
>>>>>>> GC logs, how long are the pauses (use something like: grep -i 'GCInspector'
>>>>>>> /var/log/cassandra/system.log) ?
>>>>>>>
>>>>>>> Something else you could check are local_writes stats to see if only
>>>>>>> one table if affected or this is keyspace / cluster wide. You can use
>>>>>>> metrics exposed by cassandra or if you have no dashboards I believe a:
>>>>>>> 'nodetool cfstats <myks> | grep -e 'Table:' -e 'Local'' should give you a
>>>>>>> rough idea of local latencies.
>>>>>>>
>>>>>>> Those are just things I would check, I have not a clue on what is
>>>>>>> happening here, hope this will help.
>>>>>>>
>>>>>>> C*heers,
>>>>>>> -----------------
>>>>>>> Alain Rodriguez
>>>>>>> France
>>>>>>>
>>>>>>> The Last Pickle
>>>>>>> http://www.thelastpickle.com
>>>>>>>
>>>>>>> 2016-02-18 5:13 GMT+01:00 Mike Heffner <mi...@librato.com>:
>>>>>>>
>>>>>>>> Jaydeep,
>>>>>>>>
>>>>>>>> No, we don't use any light weight transactions.
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <
>>>>>>>> chovatia.jaydeep@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Are you guys using light weight transactions in your write path?
>>>>>>>>>
>>>>>>>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <
>>>>>>>>> fabrice.facorat@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Are your commitlog and data on the same disk ? If yes, you should
>>>>>>>>>> put
>>>>>>>>>> commitlogs on a separate disk which don't have a lot of IO.
>>>>>>>>>>
>>>>>>>>>> Others IO may have great impact impact on your commitlog writing
>>>>>>>>>> and
>>>>>>>>>> it may even block.
>>>>>>>>>>
>>>>>>>>>> An example of impact IO may have, even for Async writes:
>>>>>>>>>> https://engineering.linkedin.com/blog/2016/02/eliminating-la
>>>>>>>>>> rge-jvm-gc-pauses-caused-by-background-io-traffic
>>>>>>>>>>
>>>>>>>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner <mi...@librato.com>:
>>>>>>>>>> > Jeff,
>>>>>>>>>> >
>>>>>>>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
>>>>>>>>>> >
>>>>>>>>>> > Mike
>>>>>>>>>> >
>>>>>>>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <
>>>>>>>>>> jeff.jirsa@crowdstrike.com>
>>>>>>>>>> > wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> What disk size are you using?
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> From: Mike Heffner
>>>>>>>>>> >> Reply-To: "user@cassandra.apache.org"
>>>>>>>>>> >> Date: Wednesday, February 10, 2016 at 2:24 PM
>>>>>>>>>> >> To: "user@cassandra.apache.org"
>>>>>>>>>> >> Cc: Peter Norton
>>>>>>>>>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>>>>>>>>>> >>
>>>>>>>>>> >> Paulo,
>>>>>>>>>> >>
>>>>>>>>>> >> Thanks for the suggestion, we ran some tests against CMS and
>>>>>>>>>> saw the same
>>>>>>>>>> >> timeouts. On that note though, we are going to try doubling
>>>>>>>>>> the instance
>>>>>>>>>> >> sizes and testing with double the heap (even though current
>>>>>>>>>> usage is low).
>>>>>>>>>> >>
>>>>>>>>>> >> Mike
>>>>>>>>>> >>
>>>>>>>>>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <
>>>>>>>>>> pauloricardomg@gmail.com>
>>>>>>>>>> >> wrote:
>>>>>>>>>> >>>
>>>>>>>>>> >>> Are you using the same GC settings as the staging 2.0
>>>>>>>>>> cluster? If not,
>>>>>>>>>> >>> could you try using the default GC settings (CMS) and see if
>>>>>>>>>> that changes
>>>>>>>>>> >>> anything? This is just a wild guess, but there were reports
>>>>>>>>>> before of
>>>>>>>>>> >>> G1-caused instabilities with small heap sizes (< 16GB - see
>>>>>>>>>> CASSANDRA-10403
>>>>>>>>>> >>> for more context). Please ignore if you already tried
>>>>>>>>>> reverting back to CMS.
>>>>>>>>>> >>>
>>>>>>>>>> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner <mi...@librato.com>:
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Hi all,
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We've recently embarked on a project to update our Cassandra
>>>>>>>>>> >>>> infrastructure running on EC2. We are long time users of
>>>>>>>>>> 2.0.x and are
>>>>>>>>>> >>>> testing out a move to version 2.2.5 running on VPC with EBS.
>>>>>>>>>> Our test setup
>>>>>>>>>> >>>> is a 3 node, RF=3 cluster supporting a small write load
>>>>>>>>>> (mirror of our
>>>>>>>>>> >>>> staging load).
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We are writing at QUORUM and while p95's look good compared
>>>>>>>>>> to our
>>>>>>>>>> >>>> staging 2.0.x cluster, we are seeing frequent write
>>>>>>>>>> operations that time out
>>>>>>>>>> >>>> at the max write_request_timeout_in_ms (10 seconds). CPU
>>>>>>>>>> across the cluster
>>>>>>>>>> >>>> is < 10% and EBS write load is < 100 IOPS. Cassandra is
>>>>>>>>>> running with the
>>>>>>>>>> >>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are
>>>>>>>>>> less than 500ms.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We run on c4.2xl instances with GP2 EBS attached storage for
>>>>>>>>>> data and
>>>>>>>>>> >>>> commitlog directories. The nodes are using EC2 enhanced
>>>>>>>>>> networking and have
>>>>>>>>>> >>>> the latest Intel network driver module. We are running on
>>>>>>>>>> HVM instances
>>>>>>>>>> >>>> using Ubuntu 14.04.2.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table
>>>>>>>>>> is similar
>>>>>>>>>> >>>> to the definition here:
>>>>>>>>>> >>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> This is our cassandra.yaml:
>>>>>>>>>> >>>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-
>>>>>>>>>> cassandra-yaml
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Like I mentioned we use 8u60 with G1GC and have used many of
>>>>>>>>>> the GC
>>>>>>>>>> >>>> settings in Al Tobey's tuning guide. This is our upstart
>>>>>>>>>> config with JVM and
>>>>>>>>>> >>>> other CPU settings: https://gist.github.com/mheffn
>>>>>>>>>> er/dc44613620b25c4fa46d
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We've used several of the sysctl settings from Al's guide as
>>>>>>>>>> well:
>>>>>>>>>> >>>> https://gist.github.com/mheffner/ea40d58f58a517028152
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Our client application is able to write using either Thrift
>>>>>>>>>> batches
>>>>>>>>>> >>>> using Asytanax driver or CQL async INSERT's using the
>>>>>>>>>> Datastax Java driver.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> For testing against Thrift (our legacy infra uses this) we
>>>>>>>>>> write batches
>>>>>>>>>> >>>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch
>>>>>>>>>> execution is
>>>>>>>>>> >>>> around 45ms but our maximum (p100) sits less than 150ms
>>>>>>>>>> except when it
>>>>>>>>>> >>>> periodically spikes to the full 10seconds.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Testing the same write path using CQL writes instead
>>>>>>>>>> demonstrates
>>>>>>>>>> >>>> similar behavior. Low p99s except for periodic full
>>>>>>>>>> timeouts. We enabled
>>>>>>>>>> >>>> tracing for several operations but were unable to get a
>>>>>>>>>> trace that completed
>>>>>>>>>> >>>> successfully -- Cassandra started logging many messages as:
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> INFO  [ScheduledTasks:1] - MessagingService.java:946 -
>>>>>>>>>> _TRACE messages
>>>>>>>>>> >>>> were dropped in last 5000 ms: 52499 for internal timeout and
>>>>>>>>>> 0 for cross
>>>>>>>>>> >>>> node timeout
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> And all the traces contained rows with a "null"
>>>>>>>>>> source_elapsed row:
>>>>>>>>>> >>>> https://gist.githubusercontent.com/mheffner/
>>>>>>>>>> 1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f
>>>>>>>>>> 2911b/trace.out
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We've exhausted as many configuration option permutations
>>>>>>>>>> that we can
>>>>>>>>>> >>>> think of. This cluster does not appear to be under any
>>>>>>>>>> significant load and
>>>>>>>>>> >>>> latencies seem to largely fall in two bands: low normal or
>>>>>>>>>> max timeout. This
>>>>>>>>>> >>>> seems to imply that something is getting stuck and timing
>>>>>>>>>> out at the max
>>>>>>>>>> >>>> write timeout.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Any suggestions on what to look for? We had debug enabled
>>>>>>>>>> for awhile but
>>>>>>>>>> >>>> we didn't see any msg that pointed to something obvious.
>>>>>>>>>> Happy to provide
>>>>>>>>>> >>>> any more information that may help.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> We are pretty much at the point of sprinkling debug around
>>>>>>>>>> the code to
>>>>>>>>>> >>>> track down what could be blocking.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Thanks,
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Mike
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> --
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>   Mike Heffner <mi...@librato.com>
>>>>>>>>>> >>>>   Librato, Inc.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> --
>>>>>>>>>> >>
>>>>>>>>>> >>   Mike Heffner <mi...@librato.com>
>>>>>>>>>> >>   Librato, Inc.
>>>>>>>>>> >>
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > --
>>>>>>>>>> >
>>>>>>>>>> >   Mike Heffner <mi...@librato.com>
>>>>>>>>>> >   Librato, Inc.
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Close the World, Open the Net
>>>>>>>>>> http://www.linux-wizard.net
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>>   Mike Heffner <mi...@librato.com>
>>>>>>>>   Librato, Inc.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>   Mike Heffner <mi...@librato.com>
>>>>>>   Librato, Inc.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>   Mike Heffner <mi...@librato.com>
>>>>>   Librato, Inc.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -----------------
>>>> Nate McCall
>>>> Austin, TX
>>>> @zznate
>>>>
>>>> Co-Founder & Sr. Technical Consultant
>>>> Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>   Mike Heffner <mi...@librato.com>
>>>   Librato, Inc.
>>>
>>>
>>
>