You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by John Sanda <jo...@gmail.com> on 2016/09/20 22:30:02 UTC

Client-side timeouts after dropping table

I am deploying multiple Java web apps that connect to a Cassandra 3.7
instance. Each app creates its own schema at start up. One of the schema
changes involves dropping a table. I am seeing frequent client-side
timeouts reported by the DataStax driver after the DROP TABLE statement is
executed. I don't see this behavior in all environments. I do see it
consistently in a QA environment in which Cassandra is running in docker
with network storage, so writes are pretty slow from the get go. In my logs
I see a lot of tables getting flushed, which I guess are all of the dirty
column families in the respective commit log segment. Then I seen a whole
bunch of flushes getting queued up. Can I reach a point in which too many
table flushes get queued such that writes would be blocked?

-- 

- John

Re: Client-side timeouts after dropping table

Posted by John Sanda <jo...@gmail.com>.
I was able to get metrics, but nothing stands out. When the applications
start up and a table is dropped, shortly thereafter on a subsequent write I
get a NoHostAvailableException that is caused by an
OperationTimedOutException. I am not 100% certain on which write the
timeout occurs because there are multiple apps running, but it does happen
fairly consistently almost immediately after the table is dropped. I don't
see any indication of a server side timeout or any dropped mutations being
reported in the log.

On Tue, Sep 20, 2016 at 11:07 PM, John Sanda <jo...@gmail.com> wrote:

> Thanks Nate. We do not have monitoring set up yet, but I should be able to
> get the deployment updated with a metrics reporter. I'll update the thread
> with my findings.
>
> On Tue, Sep 20, 2016 at 10:30 PM, Nate McCall <na...@thelastpickle.com>
> wrote:
>
>> If you can get to them in the test env. you want to look in
>> o.a.c.metrics.CommitLog for:
>> - TotalCommitlogSize: if this hovers near commitlog_size_in_mb and never
>> goes down, you are thrashing on segment allocation
>> - WaitingOnCommit: this is the time spent waiting on calls to sync and
>> will start to climb real fast if you cant sync within sync_interval
>> - WaitingOnSegmentAllocation: how long it took to allocate a new
>> commitlog segment, if it is all over the place it is IO bound
>>
>> Try turning all the commit log settings way down for low-IO test
>> infrastructure like this. Maybe total commit log size of like 32mb with 4mb
>> segments (or even lower depending on test data volume) so they basically
>> flush constantly and don't try to hold any tables open. Also lower
>> concurrent_writes substantially while you are at it to add some write
>> throttling.
>>
>> On Wed, Sep 21, 2016 at 2:14 PM, John Sanda <jo...@gmail.com> wrote:
>>
>>> I have seen in various threads on the list that 3.0.x is probably best
>>> for prod. Just wondering though if there is anything in particular in 3.7
>>> to be weary of.
>>>
>>> I need to check with one of our QA engineers to get specifics on the
>>> storage. Here is what I do know. We have a blade center running lots of
>>> virtual machines for various testing. Some of those vm's are running
>>> Cassandra and the Java web apps I previously mentioned via docker
>>> containers. The storage is shared. Beyond that I don't have any more
>>> specific details at the moment. I can also tell you that the storage can be
>>> quite slow.
>>>
>>> I have come across different threads that talk to one degree or another
>>> about the flush queue getting full. I have been looking at the code in
>>> ColumnFamilyStore.java. Is perDiskFlushExecutors the thread pool I should
>>> be interested in? It uses an unbounded queue, so I am not really sure what
>>> it means for it to get full. Is there anything I can check or look for to
>>> see if writes are getting blocked?
>>>
>>> On Tue, Sep 20, 2016 at 8:41 PM, Jonathan Haddad <jo...@jonhaddad.com>
>>> wrote:
>>>
>>>> If you haven't yet deployed to prod I strongly recommend *not* using
>>>> 3.7.
>>>>
>>>> What network storage are you using?  Outside of a handful of highly
>>>> experienced experts using EBS in very specific ways, it usually ends in
>>>> failure.
>>>>
>>>> On Tue, Sep 20, 2016 at 3:30 PM John Sanda <jo...@gmail.com>
>>>> wrote:
>>>>
>>>>> I am deploying multiple Java web apps that connect to a Cassandra 3.7
>>>>> instance. Each app creates its own schema at start up. One of the schema
>>>>> changes involves dropping a table. I am seeing frequent client-side
>>>>> timeouts reported by the DataStax driver after the DROP TABLE statement is
>>>>> executed. I don't see this behavior in all environments. I do see it
>>>>> consistently in a QA environment in which Cassandra is running in docker
>>>>> with network storage, so writes are pretty slow from the get go. In my logs
>>>>> I see a lot of tables getting flushed, which I guess are all of the dirty
>>>>> column families in the respective commit log segment. Then I seen a whole
>>>>> bunch of flushes getting queued up. Can I reach a point in which too many
>>>>> table flushes get queued such that writes would be blocked?
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> - John
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> - John
>>>
>>
>>
>>
>> --
>> -----------------
>> Nate McCall
>> Wellington, NZ
>> @zznate
>>
>> CTO
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>
>
> --
>
> - John
>



-- 

- John

Re: Client-side timeouts after dropping table

Posted by John Sanda <jo...@gmail.com>.
Thanks Nate. We do not have monitoring set up yet, but I should be able to
get the deployment updated with a metrics reporter. I'll update the thread
with my findings.

On Tue, Sep 20, 2016 at 10:30 PM, Nate McCall <na...@thelastpickle.com>
wrote:

> If you can get to them in the test env. you want to look in
> o.a.c.metrics.CommitLog for:
> - TotalCommitlogSize: if this hovers near commitlog_size_in_mb and never
> goes down, you are thrashing on segment allocation
> - WaitingOnCommit: this is the time spent waiting on calls to sync and
> will start to climb real fast if you cant sync within sync_interval
> - WaitingOnSegmentAllocation: how long it took to allocate a new commitlog
> segment, if it is all over the place it is IO bound
>
> Try turning all the commit log settings way down for low-IO test
> infrastructure like this. Maybe total commit log size of like 32mb with 4mb
> segments (or even lower depending on test data volume) so they basically
> flush constantly and don't try to hold any tables open. Also lower
> concurrent_writes substantially while you are at it to add some write
> throttling.
>
> On Wed, Sep 21, 2016 at 2:14 PM, John Sanda <jo...@gmail.com> wrote:
>
>> I have seen in various threads on the list that 3.0.x is probably best
>> for prod. Just wondering though if there is anything in particular in 3.7
>> to be weary of.
>>
>> I need to check with one of our QA engineers to get specifics on the
>> storage. Here is what I do know. We have a blade center running lots of
>> virtual machines for various testing. Some of those vm's are running
>> Cassandra and the Java web apps I previously mentioned via docker
>> containers. The storage is shared. Beyond that I don't have any more
>> specific details at the moment. I can also tell you that the storage can be
>> quite slow.
>>
>> I have come across different threads that talk to one degree or another
>> about the flush queue getting full. I have been looking at the code in
>> ColumnFamilyStore.java. Is perDiskFlushExecutors the thread pool I should
>> be interested in? It uses an unbounded queue, so I am not really sure what
>> it means for it to get full. Is there anything I can check or look for to
>> see if writes are getting blocked?
>>
>> On Tue, Sep 20, 2016 at 8:41 PM, Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>
>>> If you haven't yet deployed to prod I strongly recommend *not* using
>>> 3.7.
>>>
>>> What network storage are you using?  Outside of a handful of highly
>>> experienced experts using EBS in very specific ways, it usually ends in
>>> failure.
>>>
>>> On Tue, Sep 20, 2016 at 3:30 PM John Sanda <jo...@gmail.com> wrote:
>>>
>>>> I am deploying multiple Java web apps that connect to a Cassandra 3.7
>>>> instance. Each app creates its own schema at start up. One of the schema
>>>> changes involves dropping a table. I am seeing frequent client-side
>>>> timeouts reported by the DataStax driver after the DROP TABLE statement is
>>>> executed. I don't see this behavior in all environments. I do see it
>>>> consistently in a QA environment in which Cassandra is running in docker
>>>> with network storage, so writes are pretty slow from the get go. In my logs
>>>> I see a lot of tables getting flushed, which I guess are all of the dirty
>>>> column families in the respective commit log segment. Then I seen a whole
>>>> bunch of flushes getting queued up. Can I reach a point in which too many
>>>> table flushes get queued such that writes would be blocked?
>>>>
>>>>
>>>> --
>>>>
>>>> - John
>>>>
>>>
>>
>>
>> --
>>
>> - John
>>
>
>
>
> --
> -----------------
> Nate McCall
> Wellington, NZ
> @zznate
>
> CTO
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>



-- 

- John

Re: Client-side timeouts after dropping table

Posted by Nate McCall <na...@thelastpickle.com>.
If you can get to them in the test env. you want to look in
o.a.c.metrics.CommitLog for:
- TotalCommitlogSize: if this hovers near commitlog_size_in_mb and never
goes down, you are thrashing on segment allocation
- WaitingOnCommit: this is the time spent waiting on calls to sync and will
start to climb real fast if you cant sync within sync_interval
- WaitingOnSegmentAllocation: how long it took to allocate a new commitlog
segment, if it is all over the place it is IO bound

Try turning all the commit log settings way down for low-IO test
infrastructure like this. Maybe total commit log size of like 32mb with 4mb
segments (or even lower depending on test data volume) so they basically
flush constantly and don't try to hold any tables open. Also lower
concurrent_writes substantially while you are at it to add some write
throttling.

On Wed, Sep 21, 2016 at 2:14 PM, John Sanda <jo...@gmail.com> wrote:

> I have seen in various threads on the list that 3.0.x is probably best for
> prod. Just wondering though if there is anything in particular in 3.7 to be
> weary of.
>
> I need to check with one of our QA engineers to get specifics on the
> storage. Here is what I do know. We have a blade center running lots of
> virtual machines for various testing. Some of those vm's are running
> Cassandra and the Java web apps I previously mentioned via docker
> containers. The storage is shared. Beyond that I don't have any more
> specific details at the moment. I can also tell you that the storage can be
> quite slow.
>
> I have come across different threads that talk to one degree or another
> about the flush queue getting full. I have been looking at the code in
> ColumnFamilyStore.java. Is perDiskFlushExecutors the thread pool I should
> be interested in? It uses an unbounded queue, so I am not really sure what
> it means for it to get full. Is there anything I can check or look for to
> see if writes are getting blocked?
>
> On Tue, Sep 20, 2016 at 8:41 PM, Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>> If you haven't yet deployed to prod I strongly recommend *not* using 3.7.
>>
>>
>> What network storage are you using?  Outside of a handful of highly
>> experienced experts using EBS in very specific ways, it usually ends in
>> failure.
>>
>> On Tue, Sep 20, 2016 at 3:30 PM John Sanda <jo...@gmail.com> wrote:
>>
>>> I am deploying multiple Java web apps that connect to a Cassandra 3.7
>>> instance. Each app creates its own schema at start up. One of the schema
>>> changes involves dropping a table. I am seeing frequent client-side
>>> timeouts reported by the DataStax driver after the DROP TABLE statement is
>>> executed. I don't see this behavior in all environments. I do see it
>>> consistently in a QA environment in which Cassandra is running in docker
>>> with network storage, so writes are pretty slow from the get go. In my logs
>>> I see a lot of tables getting flushed, which I guess are all of the dirty
>>> column families in the respective commit log segment. Then I seen a whole
>>> bunch of flushes getting queued up. Can I reach a point in which too many
>>> table flushes get queued such that writes would be blocked?
>>>
>>>
>>> --
>>>
>>> - John
>>>
>>
>
>
> --
>
> - John
>



-- 
-----------------
Nate McCall
Wellington, NZ
@zznate

CTO
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Client-side timeouts after dropping table

Posted by John Sanda <jo...@gmail.com>.
I have seen in various threads on the list that 3.0.x is probably best for
prod. Just wondering though if there is anything in particular in 3.7 to be
weary of.

I need to check with one of our QA engineers to get specifics on the
storage. Here is what I do know. We have a blade center running lots of
virtual machines for various testing. Some of those vm's are running
Cassandra and the Java web apps I previously mentioned via docker
containers. The storage is shared. Beyond that I don't have any more
specific details at the moment. I can also tell you that the storage can be
quite slow.

I have come across different threads that talk to one degree or another
about the flush queue getting full. I have been looking at the code in
ColumnFamilyStore.java. Is perDiskFlushExecutors the thread pool I should
be interested in? It uses an unbounded queue, so I am not really sure what
it means for it to get full. Is there anything I can check or look for to
see if writes are getting blocked?

On Tue, Sep 20, 2016 at 8:41 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> If you haven't yet deployed to prod I strongly recommend *not* using 3.7.
>
> What network storage are you using?  Outside of a handful of highly
> experienced experts using EBS in very specific ways, it usually ends in
> failure.
>
> On Tue, Sep 20, 2016 at 3:30 PM John Sanda <jo...@gmail.com> wrote:
>
>> I am deploying multiple Java web apps that connect to a Cassandra 3.7
>> instance. Each app creates its own schema at start up. One of the schema
>> changes involves dropping a table. I am seeing frequent client-side
>> timeouts reported by the DataStax driver after the DROP TABLE statement is
>> executed. I don't see this behavior in all environments. I do see it
>> consistently in a QA environment in which Cassandra is running in docker
>> with network storage, so writes are pretty slow from the get go. In my logs
>> I see a lot of tables getting flushed, which I guess are all of the dirty
>> column families in the respective commit log segment. Then I seen a whole
>> bunch of flushes getting queued up. Can I reach a point in which too many
>> table flushes get queued such that writes would be blocked?
>>
>>
>> --
>>
>> - John
>>
>


-- 

- John

Re: Client-side timeouts after dropping table

Posted by Jesse Hodges <ho...@gmail.com>.
Thanks, filing this under "things I wish I'd realized sooner" :)

On Tue, Sep 20, 2016 at 10:27 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> 3.7 falls under the Tick Tock release cycle, which is almost completely
> untested in production by experienced operators.  In the cases where it has
> been tested, there have been numerous bugs found which I (and I think most
> people on this list) consider to be show stoppers.  Additionally, the Tick
> Tock release cycle puts the operator in the uncomfortable position of
> having to decide between upgrading to a new version with new features
> (probably new bugs) or back porting bug fixes from future versions
> themselves.    There will never be a 3.7.1 release which fixes bugs in 3.7
> without adding new features.
>
> https://github.com/apache/cassandra/blob/trunk/NEWS.txt
>
> For new projects I recommend starting with the recently released 3.0.9.
>
> Assuming the project changes it's policy on releases (all signs point to
> yes), then by the time 4.0 rolls out a lot of the features which have been
> released in the 3.x series will have matured a bit, so it's very possible
> 4.0 will stabilize faster than the usual 6 months it takes for a major
> release.
>
> All that said, there's nothing wrong with doing compatibility & smoke
> tests against the latest 3.x release as well as 3.0 and reporting bugs back
> to the Apache Cassandra JIRA, I'm sure it would be greatly appreciated.
>
> https://issues.apache.org/jira/secure/Dashboard.jspa
>
> Jon
>
>
> On Tue, Sep 20, 2016 at 8:10 PM Jesse Hodges <ho...@gmail.com>
> wrote:
>
>> Can you elaborate on why not 3.7?
>>
>> On Tue, Sep 20, 2016 at 7:41 PM, Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>
>>> If you haven't yet deployed to prod I strongly recommend *not* using
>>> 3.7.
>>>
>>> What network storage are you using?  Outside of a handful of highly
>>> experienced experts using EBS in very specific ways, it usually ends in
>>> failure.
>>>
>>> On Tue, Sep 20, 2016 at 3:30 PM John Sanda <jo...@gmail.com> wrote:
>>>
>>>> I am deploying multiple Java web apps that connect to a Cassandra 3.7
>>>> instance. Each app creates its own schema at start up. One of the schema
>>>> changes involves dropping a table. I am seeing frequent client-side
>>>> timeouts reported by the DataStax driver after the DROP TABLE statement is
>>>> executed. I don't see this behavior in all environments. I do see it
>>>> consistently in a QA environment in which Cassandra is running in docker
>>>> with network storage, so writes are pretty slow from the get go. In my logs
>>>> I see a lot of tables getting flushed, which I guess are all of the dirty
>>>> column families in the respective commit log segment. Then I seen a whole
>>>> bunch of flushes getting queued up. Can I reach a point in which too many
>>>> table flushes get queued such that writes would be blocked?
>>>>
>>>>
>>>> --
>>>>
>>>> - John
>>>>
>>>
>>

Re: Client-side timeouts after dropping table

Posted by Jonathan Haddad <jo...@jonhaddad.com>.
3.7 falls under the Tick Tock release cycle, which is almost completely
untested in production by experienced operators.  In the cases where it has
been tested, there have been numerous bugs found which I (and I think most
people on this list) consider to be show stoppers.  Additionally, the Tick
Tock release cycle puts the operator in the uncomfortable position of
having to decide between upgrading to a new version with new features
(probably new bugs) or back porting bug fixes from future versions
themselves.    There will never be a 3.7.1 release which fixes bugs in 3.7
without adding new features.

https://github.com/apache/cassandra/blob/trunk/NEWS.txt

For new projects I recommend starting with the recently released 3.0.9.

Assuming the project changes it's policy on releases (all signs point to
yes), then by the time 4.0 rolls out a lot of the features which have been
released in the 3.x series will have matured a bit, so it's very possible
4.0 will stabilize faster than the usual 6 months it takes for a major
release.

All that said, there's nothing wrong with doing compatibility & smoke tests
against the latest 3.x release as well as 3.0 and reporting bugs back to
the Apache Cassandra JIRA, I'm sure it would be greatly appreciated.

https://issues.apache.org/jira/secure/Dashboard.jspa

Jon


On Tue, Sep 20, 2016 at 8:10 PM Jesse Hodges <ho...@gmail.com> wrote:

> Can you elaborate on why not 3.7?
>
> On Tue, Sep 20, 2016 at 7:41 PM, Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>> If you haven't yet deployed to prod I strongly recommend *not* using 3.7.
>>
>>
>> What network storage are you using?  Outside of a handful of highly
>> experienced experts using EBS in very specific ways, it usually ends in
>> failure.
>>
>> On Tue, Sep 20, 2016 at 3:30 PM John Sanda <jo...@gmail.com> wrote:
>>
>>> I am deploying multiple Java web apps that connect to a Cassandra 3.7
>>> instance. Each app creates its own schema at start up. One of the schema
>>> changes involves dropping a table. I am seeing frequent client-side
>>> timeouts reported by the DataStax driver after the DROP TABLE statement is
>>> executed. I don't see this behavior in all environments. I do see it
>>> consistently in a QA environment in which Cassandra is running in docker
>>> with network storage, so writes are pretty slow from the get go. In my logs
>>> I see a lot of tables getting flushed, which I guess are all of the dirty
>>> column families in the respective commit log segment. Then I seen a whole
>>> bunch of flushes getting queued up. Can I reach a point in which too many
>>> table flushes get queued such that writes would be blocked?
>>>
>>>
>>> --
>>>
>>> - John
>>>
>>
>

Re: Client-side timeouts after dropping table

Posted by Jesse Hodges <ho...@gmail.com>.
Can you elaborate on why not 3.7?

On Tue, Sep 20, 2016 at 7:41 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> If you haven't yet deployed to prod I strongly recommend *not* using 3.7.
>
> What network storage are you using?  Outside of a handful of highly
> experienced experts using EBS in very specific ways, it usually ends in
> failure.
>
> On Tue, Sep 20, 2016 at 3:30 PM John Sanda <jo...@gmail.com> wrote:
>
>> I am deploying multiple Java web apps that connect to a Cassandra 3.7
>> instance. Each app creates its own schema at start up. One of the schema
>> changes involves dropping a table. I am seeing frequent client-side
>> timeouts reported by the DataStax driver after the DROP TABLE statement is
>> executed. I don't see this behavior in all environments. I do see it
>> consistently in a QA environment in which Cassandra is running in docker
>> with network storage, so writes are pretty slow from the get go. In my logs
>> I see a lot of tables getting flushed, which I guess are all of the dirty
>> column families in the respective commit log segment. Then I seen a whole
>> bunch of flushes getting queued up. Can I reach a point in which too many
>> table flushes get queued such that writes would be blocked?
>>
>>
>> --
>>
>> - John
>>
>

Re: Client-side timeouts after dropping table

Posted by Jonathan Haddad <jo...@jonhaddad.com>.
If you haven't yet deployed to prod I strongly recommend *not* using 3.7.

What network storage are you using?  Outside of a handful of highly
experienced experts using EBS in very specific ways, it usually ends in
failure.

On Tue, Sep 20, 2016 at 3:30 PM John Sanda <jo...@gmail.com> wrote:

> I am deploying multiple Java web apps that connect to a Cassandra 3.7
> instance. Each app creates its own schema at start up. One of the schema
> changes involves dropping a table. I am seeing frequent client-side
> timeouts reported by the DataStax driver after the DROP TABLE statement is
> executed. I don't see this behavior in all environments. I do see it
> consistently in a QA environment in which Cassandra is running in docker
> with network storage, so writes are pretty slow from the get go. In my logs
> I see a lot of tables getting flushed, which I guess are all of the dirty
> column families in the respective commit log segment. Then I seen a whole
> bunch of flushes getting queued up. Can I reach a point in which too many
> table flushes get queued such that writes would be blocked?
>
>
> --
>
> - John
>