You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Klaus Brunner <kl...@gmail.com> on 2013/12/06 11:01:56 UTC

OOMs during high (read?) load in Cassandra 1.2.11

We're getting fairly reproducible OOMs on a 2-node cluster using
Cassandra 1.2.11, typically in situations with a heavy read load. A
sample of some stack traces is at
https://gist.github.com/KlausBrunner/7820902 - they're all failing
somewhere down from table.getRow(), though I don't know if that's part
of query processing, compaction, or something else.

- The main CFs contain some 100k rows, none of them particularly wide.
- Heap dumps invariably show a single huge byte array (~1.6 GiB
associated with the OOM'ing thread) hogging > 80% of the Java heap.
The array seems to contain all/many rows of one CF.
- We're moderately certain there's no "killer query" with a huge
result set involved here, but we can't see exactly what triggers this.
- We've tried to switch to LeveledCompaction, to no avail.
- Xms/x is set to some 4 GB.
- The logs show the usual signs of panic ("flushing memtables") before
actually OOMing. It seems that this scenario is often or even always
after a compaction, but it's not quite conclusive.

I'm somewhat worried that Cassandra would read so much data into a
single contiguous byte[] at any point. Could this be related to
compaction? Any ideas what we could do about this?

Thanks

Klaus

Re: OOMs during high (read?) load in Cassandra 1.2.11

Posted by Aaron Morton <aa...@thelastpickle.com>.

Do you have the back trace for from the heap dump so we can see what the array was and what was using it ?

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 10/12/2013, at 4:41 am, Klaus Brunner <kl...@gmail.com> wrote:

> 2013/12/9 Nate McCall <na...@thelastpickle.com>:
>> Do you have any secondary indexes defined in the schema? That could lead to
>> a 'mega row' pretty easily depending on the cardinality of the value.
> 
> That's an interesting point - but no, we don't have any secondary
> indexes anywhere. From the heap dump, it's fairly evident that it's
> not a single huge row but actually many rows.
> 
> I'll keep watching if this occurs again, or if the compaction fixed it for good.
> 
> Thanks,
> 
> Klaus

Re: OOMs during high (read?) load in Cassandra 1.2.11

Posted by Klaus Brunner <kl...@gmail.com>.

2013/12/9 Nate McCall <na...@thelastpickle.com>:
> Do you have any secondary indexes defined in the schema? That could lead to
> a 'mega row' pretty easily depending on the cardinality of the value.

That's an interesting point - but no, we don't have any secondary
indexes anywhere. From the heap dump, it's fairly evident that it's
not a single huge row but actually many rows.

I'll keep watching if this occurs again, or if the compaction fixed it for good.

Thanks,

Klaus

Re: OOMs during high (read?) load in Cassandra 1.2.11

Posted by Nate McCall <na...@thelastpickle.com>.

Do you have any secondary indexes defined in the schema? That could lead to
a 'mega row' pretty easily depending on the cardinality of the value.


On Mon, Dec 9, 2013 at 3:02 AM, Klaus Brunner <kl...@gmail.com>wrote:

> We're running largely default settings, with the exception of shard
> (1) and replica (0-n) counts and EC2-related snitch etc. No row
> caching at all. The logs never showed the same kind of entries
> pre-OOM, it basically occurred out of the blue.
>
> However, it seems that the problem has now subsided after forcing a
> compaction (using LeveledCompaction) that took several hours. Not sure
> if that's a permanent solution yet, but things are looking good so
> far.
>
> Klaus
>
>
> 2013/12/6 Vicky Kak <vi...@gmail.com>:
> > I am not sure if you had got a chance to take a look at this
> > http://www.datastax.com/docs/1.1/troubleshooting/index#oom
> > http://www.datastax.com/docs/1.1/install/recommended_settings
> >
> > Can you attach the cassandra logs and the cassandra.yaml, it should be
> able
> > to give us more details about the issue?
> >
> > Thanks,
> > -VK
> >
>



-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: OOMs during high (read?) load in Cassandra 1.2.11

Posted by Klaus Brunner <kl...@gmail.com>.

We're running largely default settings, with the exception of shard
(1) and replica (0-n) counts and EC2-related snitch etc. No row
caching at all. The logs never showed the same kind of entries
pre-OOM, it basically occurred out of the blue.

However, it seems that the problem has now subsided after forcing a
compaction (using LeveledCompaction) that took several hours. Not sure
if that's a permanent solution yet, but things are looking good so
far.

Klaus


2013/12/6 Vicky Kak <vi...@gmail.com>:
> I am not sure if you had got a chance to take a look at this
> http://www.datastax.com/docs/1.1/troubleshooting/index#oom
> http://www.datastax.com/docs/1.1/install/recommended_settings
>
> Can you attach the cassandra logs and the cassandra.yaml, it should be able
> to give us more details about the issue?
>
> Thanks,
> -VK
>

Re: OOMs during high (read?) load in Cassandra 1.2.11

Posted by Vicky Kak <vi...@gmail.com>.

I am not sure if you had got a chance to take a look at this
http://www.datastax.com/docs/1.1/troubleshooting/index#oom
http://www.datastax.com/docs/1.1/install/recommended_settings

Can you attach the cassandra logs and the cassandra.yaml, it should be able
to give us more details about the issue?

Thanks,
-VK


On Fri, Dec 6, 2013 at 3:31 PM, Klaus Brunner <kl...@gmail.com>wrote:

> We're getting fairly reproducible OOMs on a 2-node cluster using
> Cassandra 1.2.11, typically in situations with a heavy read load. A
> sample of some stack traces is at
> https://gist.github.com/KlausBrunner/7820902 - they're all failing
> somewhere down from table.getRow(), though I don't know if that's part
> of query processing, compaction, or something else.
>
> - The main CFs contain some 100k rows, none of them particularly wide.
> - Heap dumps invariably show a single huge byte array (~1.6 GiB
> associated with the OOM'ing thread) hogging > 80% of the Java heap.
> The array seems to contain all/many rows of one CF.
> - We're moderately certain there's no "killer query" with a huge
> result set involved here, but we can't see exactly what triggers this.
> - We've tried to switch to LeveledCompaction, to no avail.
> - Xms/x is set to some 4 GB.
> - The logs show the usual signs of panic ("flushing memtables") before
> actually OOMing. It seems that this scenario is often or even always
> after a compaction, but it's not quite conclusive.
>
> I'm somewhat worried that Cassandra would read so much data into a
> single contiguous byte[] at any point. Could this be related to
> compaction? Any ideas what we could do about this?
>
> Thanks
>
> Klaus
>

Re: OOMs during high (read?) load in Cassandra 1.2.11

Posted by Jason Wee <pe...@gmail.com>.

Hi,

Just taking a wild shot here, sorry if it does not help. Could it be thrown
during reading the sstable? That is, try to find the configuration
parameters for read operation, tune down a little for those settings. Also
check on the the chunk_length_kb.

http://www.datastax.com/documentation/cql/3.1/webhelp/cql/cql_reference/cql_storage_options_c.html

/Jason


On Fri, Dec 6, 2013 at 6:01 PM, Klaus Brunner <kl...@gmail.com>wrote:

> We're getting fairly reproducible OOMs on a 2-node cluster using
> Cassandra 1.2.11, typically in situations with a heavy read load. A
> sample of some stack traces is at
> https://gist.github.com/KlausBrunner/7820902 - they're all failing
> somewhere down from table.getRow(), though I don't know if that's part
> of query processing, compaction, or something else.
>
> - The main CFs contain some 100k rows, none of them particularly wide.
> - Heap dumps invariably show a single huge byte array (~1.6 GiB
> associated with the OOM'ing thread) hogging > 80% of the Java heap.
> The array seems to contain all/many rows of one CF.
> - We're moderately certain there's no "killer query" with a huge
> result set involved here, but we can't see exactly what triggers this.
> - We've tried to switch to LeveledCompaction, to no avail.
> - Xms/x is set to some 4 GB.
> - The logs show the usual signs of panic ("flushing memtables") before
> actually OOMing. It seems that this scenario is often or even always
> after a compaction, but it's not quite conclusive.
>
> I'm somewhat worried that Cassandra would read so much data into a
> single contiguous byte[] at any point. Could this be related to
> compaction? Any ideas what we could do about this?
>
> Thanks
>
> Klaus
>