You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Roni Balthazar <ro...@gmail.com> on 2015/01/08 20:14:04 UTC

High read latency after data volume increased

Hi there,

We are using C* 2.1.2 with 2 DCs. 30 nodes DC1 and 10 nodes DC2.

While our data volume is increasing (34 TB now), we are running into
some problems:

1) Read latency is around 1000 ms when running 600 reads/sec (DC1
CL.LOCAL_ONE). At the same time the load average is about 20-30 on all
DC1 nodes(8 cores CPU - 32 GB RAM). C* starts timing out connections.
Still in this scenario OpsCenter has some issues as well. Opscenter
resets all Graphs layout and backs to the default layout on every
refresh. It doesn't back to normal after the load decrease. I only
managed to put OpsCenter to it's normal behavior after reinstalling
it.
Just for reference, we are using SATA HDDs on all nodes and running
hdparm to check disk performance under this load, some nodes are
reporting very low read rates (under 10 MB/sec), while others above
100 MB/sec. Under low load average this rate is above 250 MB/sec.

2) Repair takes at least 4-5 days to complete. Last repair was 20 days
ago. Running repair under high loads is bringing some nodes down with
the exception: "JVMStabilityInspector.java:94 - JVM state determined
to be unstable. Exiting forcefully due to: java.lang.OutOfMemoryError:
Java heap space"

Any hints?

Regards,

Roni Balthazar

Re: High read latency after data volume increased

Posted by Jonathan Lacefield <jl...@datastax.com>.
There's likely 2 things occurring

1) the cfhistograms error is due to
https://issues.apache.org/jira/browse/CASSANDRA-8028
Which is resolved in 2.1.3.  Looks like voting is under way for 2.1.3. As
rcoli mentioned, you are running the latest open source of C* which should
be treated as beta until a few dot releases are published.

2) compaction running all the time doesn't mean that compaction is "caught
up".  It's possible that the nodes are behind in compaction which will
cause slow reads.  C* read performance is typically associated with disk
system performance, both to service reads from disk as well as to enable
fast background processing, like compaction.   You mentioned raided hdds.
What type of raid is configured?  How fast are your disks responding?  You
may want to check iostat to see how large your queues and awaits are.  If
the await is high, then you could be experiencing disk perf issues
impacting reads.

Hope this helps


On Jan 9, 2015, at 9:29 AM, Roni Balthazar <ro...@gmail.com> wrote:

Hi there,

The compaction remains running with our workload.
We are using SATA HDDs RAIDs.

When trying to run cfhistograms on our user_data table, we are getting
this message:
nodetool: Unable to compute when histogram overflowed

Please see what happens when running some queries on this cf:
http://pastebin.com/jbAgDzVK

Thanks,

Roni Balthazar

On Fri, Jan 9, 2015 at 12:03 PM, datastax <jl...@datastax.com> wrote:

Hello


 You may not be experiencing versioning issues.   Do you know if compaction

is keeping up with your workload?  The behavior described in the subject is

typically associated with compaction falling behind or having a suboptimal

compaction strategy configured.   What does the output of nodetool

cfhistograms <keyspace> <table> look like for a table that is experiencing

this issue?  Also, what type of disks are you using on the nodes?


Sent from my iPad


On Jan 9, 2015, at 8:55 AM, Brian Tarbox <br...@gmail.com> wrote:


C* seems to have more than its share of "version x doesn't work, use version

y " type issues....


On Thu, Jan 8, 2015 at 2:23 PM, Robert Coli <rc...@eventbrite.com> wrote:


On Thu, Jan 8, 2015 at 11:14 AM, Roni Balthazar <ro...@gmail.com>

wrote:


We are using C* 2.1.2 with 2 DCs. 30 nodes DC1 and 10 nodes DC2.



https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/


2.1.2 in particular is known to have significant issues. You'd be better

off running 2.1.1 ...


=Rob






--

http://about.me/BrianTarbox

Re: High read latency after data volume increased

Posted by Roni Balthazar <ro...@gmail.com>.
Hi there,

The compaction remains running with our workload.
We are using SATA HDDs RAIDs.

When trying to run cfhistograms on our user_data table, we are getting
this message:
nodetool: Unable to compute when histogram overflowed

Please see what happens when running some queries on this cf:
http://pastebin.com/jbAgDzVK

Thanks,

Roni Balthazar

On Fri, Jan 9, 2015 at 12:03 PM, datastax <jl...@datastax.com> wrote:
> Hello
>
>   You may not be experiencing versioning issues.   Do you know if compaction
> is keeping up with your workload?  The behavior described in the subject is
> typically associated with compaction falling behind or having a suboptimal
> compaction strategy configured.   What does the output of nodetool
> cfhistograms <keyspace> <table> look like for a table that is experiencing
> this issue?  Also, what type of disks are you using on the nodes?
>
> Sent from my iPad
>
> On Jan 9, 2015, at 8:55 AM, Brian Tarbox <br...@gmail.com> wrote:
>
> C* seems to have more than its share of "version x doesn't work, use version
> y " type issues....
>
> On Thu, Jan 8, 2015 at 2:23 PM, Robert Coli <rc...@eventbrite.com> wrote:
>>
>> On Thu, Jan 8, 2015 at 11:14 AM, Roni Balthazar <ro...@gmail.com>
>> wrote:
>>>
>>> We are using C* 2.1.2 with 2 DCs. 30 nodes DC1 and 10 nodes DC2.
>>
>>
>> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>>
>> 2.1.2 in particular is known to have significant issues. You'd be better
>> off running 2.1.1 ...
>>
>> =Rob
>>
>
>
>
>
> --
> http://about.me/BrianTarbox

Re: High read latency after data volume increased

Posted by datastax <jl...@datastax.com>.
Hello

  You may not be experiencing versioning issues.   Do you know if compaction is keeping up with your workload?  The behavior described in the subject is typically associated with compaction falling behind or having a suboptimal compaction strategy configured.   What does the output of nodetool cfhistograms <keyspace> <table> look like for a table that is experiencing this issue?  Also, what type of disks are you using on the nodes?

Sent from my iPad

> On Jan 9, 2015, at 8:55 AM, Brian Tarbox <br...@gmail.com> wrote:
> 
> C* seems to have more than its share of "version x doesn't work, use version y " type issues....
> 
>> On Thu, Jan 8, 2015 at 2:23 PM, Robert Coli <rc...@eventbrite.com> wrote:
>>> On Thu, Jan 8, 2015 at 11:14 AM, Roni Balthazar <ro...@gmail.com> wrote:
>>> We are using C* 2.1.2 with 2 DCs. 30 nodes DC1 and 10 nodes DC2.
>> 
>> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>> 
>> 2.1.2 in particular is known to have significant issues. You'd be better off running 2.1.1 ...
>> 
>> =Rob
> 
> 
> 
> -- 
> http://about.me/BrianTarbox

RE: High read latency after data volume increased

Posted by Jason Kushmaul | WDA <ja...@wda.com>.
I was about to say I thought 2.1 was a development version, but when I went to prove that to myself:
http://cassandra.apache.org/download/
“ The latest stable release of Apache Cassandra is 2.1.2 (released on 2014-11-10). If you're just starting out, download this one.”

But then, after visiting planet Cassandra (this is what I was thinking of, I had just read it)
http://planetcassandra.org/cassandra/
“
v2.0.11(Stable & Recommended)

v2.1.2(Latest Development Release)

v1.2.19
(Archive)

“

Seems to be a mixed message of what is stable between the two sites…

Jason

From: Brian Tarbox [mailto:briantarbox@gmail.com]
Sent: Friday, January 9, 2015 8:56 AM
To: user@cassandra.apache.org
Subject: Re: High read latency after data volume increased

C* seems to have more than its share of "version x doesn't work, use version y " type issues....

On Thu, Jan 8, 2015 at 2:23 PM, Robert Coli <rc...@eventbrite.com>> wrote:
On Thu, Jan 8, 2015 at 11:14 AM, Roni Balthazar <ro...@gmail.com>> wrote:
We are using C* 2.1.2 with 2 DCs. 30 nodes DC1 and 10 nodes DC2.

https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

2.1.2 in particular is known to have significant issues. You'd be better off running 2.1.1 ...

=Rob




--
http://about.me/BrianTarbox

Re: High read latency after data volume increased

Posted by Brian Tarbox <br...@gmail.com>.
C* seems to have more than its share of "version x doesn't work, use
version y " type issues....

On Thu, Jan 8, 2015 at 2:23 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Thu, Jan 8, 2015 at 11:14 AM, Roni Balthazar <ro...@gmail.com>
> wrote:
>
>> We are using C* 2.1.2 with 2 DCs. 30 nodes DC1 and 10 nodes DC2.
>>
>
> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>
> 2.1.2 in particular is known to have significant issues. You'd be better
> off running 2.1.1 ...
>
> =Rob
>
>



-- 
http://about.me/BrianTarbox

Re: High read latency after data volume increased

Posted by Robert Coli <rc...@eventbrite.com>.
On Thu, Jan 8, 2015 at 6:38 PM, Roni Balthazar <ro...@gmail.com>
wrote:

> We downgraded to 2.1.1, but got the very same result. The read latency is
> still high, but we figured out that it happens only using a specific
> keyspace.
>

Note that downgrading is officially unsupported, but is probably safe
between those two versions.

Enable tracing and paste results for a high latency query.

Also, how much RAM is used for heap?

=Rob

Re: High read latency after data volume increased

Posted by Roni Balthazar <ro...@gmail.com>.
Hi Robert,

We downgraded to 2.1.1, but got the very same result. The read latency is
still high, but we figured out that it happens only using a specific
keyspace.
Please see the graphs below...

​
Trying another keyspace with 600+ reads/sec, we are getting the acceptable
~30ms read latency.

Let me know if I need to provide more information.

Thanks,

Roni Balthazar

On Thu, Jan 8, 2015 at 5:23 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Thu, Jan 8, 2015 at 11:14 AM, Roni Balthazar <ro...@gmail.com>
> wrote:
>
>> We are using C* 2.1.2 with 2 DCs. 30 nodes DC1 and 10 nodes DC2.
>>
>
> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>
> 2.1.2 in particular is known to have significant issues. You'd be better
> off running 2.1.1 ...
>
> =Rob
>
>

Re: High read latency after data volume increased

Posted by Robert Coli <rc...@eventbrite.com>.
On Thu, Jan 8, 2015 at 11:14 AM, Roni Balthazar <ro...@gmail.com>
wrote:

> We are using C* 2.1.2 with 2 DCs. 30 nodes DC1 and 10 nodes DC2.
>

https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

2.1.2 in particular is known to have significant issues. You'd be better
off running 2.1.1 ...

=Rob