You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Kevin Burton <bu...@spinn3r.com> on 2014/06/30 18:53:59 UTC

"Compacting large row … incrementally" … with HUGE values.

I'm running a full compaction now and noticed this:

Compacting large row … incrementally

… and the values were in the 300-500MB range.

I'm storing NOTHING anywhere near that large.  Max is about 200k...

However, I'm storing my schema in a way so that I can do efficient
time/range scans of the data and placing things into buckets.

So my schema looks like:

bucket,
timestamp

… and the partition key is bucket.  Since this is a clustering row, does
that mean that EVERYTHING is in one "row" under 'bucket' ?

So even though my INSERTs are like 200k, they're all pooling under the same
'bucket' which is the partition key so cassandra is going to have a hard
time compacting them.

Part of the problem here is the serious abuse of vocabulary.  The
thrift/CQL impedance mismatch means that things have slightly different
names and not-so-straigtforward nomenclature.  So it makes it confusing as
to what's actually happening under the hood.

….

Then I saw:

http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTik0g+ePQ4CtW28ty+dpexprtiSwLQ@mail.gmail.com%3E


look for in_memory_compaction_limit_in_mb in cassandra.yaml


… so this seems like it will be a problem and slow me down moving forward.
 Unless I figure out a workaround.

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

Re: "Compacting large row … incrementally" … with HUGE values.

Posted by Kevin Burton <bu...@spinn3r.com>.

yup… that's what I was thinking.. but good point on the physical vs logical
row… cassandra should be more rigorous about this term… it just says "large
row" not "large physical row"

…

Any idea how much this is going to slow me down?


On Mon, Jun 30, 2014 at 10:10 AM, DuyHai Doan <do...@gmail.com> wrote:

> Hello Kevin.
>
>  With CQL3 there are some important terms to define:
>
>  a. Row : means a logical row in the CQL3 semantics, logical row is what
> is displayed as a row in cqlsh client
>  b. Partition: means a physical row on disk in the CQL3 semantics
>
> Even if you have tiny logical rows, if you store a lot of them under the
> same partition (physical row on disk) then it can add up a lot.
>
> Quick maths: 200k per logical row * 1000 logical rows = 200Mb roughtly for
> the partition
>
>
> On Mon, Jun 30, 2014 at 6:53 PM, Kevin Burton <bu...@spinn3r.com> wrote:
>
>> I'm running a full compaction now and noticed this:
>>
>> Compacting large row … incrementally
>>
>> … and the values were in the 300-500MB range.
>>
>> I'm storing NOTHING anywhere near that large.  Max is about 200k...
>>
>> However, I'm storing my schema in a way so that I can do efficient
>> time/range scans of the data and placing things into buckets.
>>
>> So my schema looks like:
>>
>> bucket,
>> timestamp
>>
>> … and the partition key is bucket.  Since this is a clustering row, does
>> that mean that EVERYTHING is in one "row" under 'bucket' ?
>>
>> So even though my INSERTs are like 200k, they're all pooling under the
>> same 'bucket' which is the partition key so cassandra is going to have a
>> hard time compacting them.
>>
>> Part of the problem here is the serious abuse of vocabulary.  The
>> thrift/CQL impedance mismatch means that things have slightly different
>> names and not-so-straigtforward nomenclature.  So it makes it confusing as
>> to what's actually happening under the hood.
>>
>> ….
>>
>> Then I saw:
>>
>>
>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTik0g+ePQ4CtW28ty+dpexprtiSwLQ@mail.gmail.com%3E
>>
>>
>>
>> look for in_memory_compaction_limit_in_mb in cassandra.yaml
>>
>>
>> … so this seems like it will be a problem and slow me down moving
>> forward.  Unless I figure out a workaround.
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>>
>>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

Re: "Compacting large row … incrementally" … with HUGE values.

Posted by DuyHai Doan <do...@gmail.com>.

Hello Kevin.

 With CQL3 there are some important terms to define:

 a. Row : means a logical row in the CQL3 semantics, logical row is what is
displayed as a row in cqlsh client
 b. Partition: means a physical row on disk in the CQL3 semantics

Even if you have tiny logical rows, if you store a lot of them under the
same partition (physical row on disk) then it can add up a lot.

Quick maths: 200k per logical row * 1000 logical rows = 200Mb roughtly for
the partition


On Mon, Jun 30, 2014 at 6:53 PM, Kevin Burton <bu...@spinn3r.com> wrote:

> I'm running a full compaction now and noticed this:
>
> Compacting large row … incrementally
>
> … and the values were in the 300-500MB range.
>
> I'm storing NOTHING anywhere near that large.  Max is about 200k...
>
> However, I'm storing my schema in a way so that I can do efficient
> time/range scans of the data and placing things into buckets.
>
> So my schema looks like:
>
> bucket,
> timestamp
>
> … and the partition key is bucket.  Since this is a clustering row, does
> that mean that EVERYTHING is in one "row" under 'bucket' ?
>
> So even though my INSERTs are like 200k, they're all pooling under the
> same 'bucket' which is the partition key so cassandra is going to have a
> hard time compacting them.
>
> Part of the problem here is the serious abuse of vocabulary.  The
> thrift/CQL impedance mismatch means that things have slightly different
> names and not-so-straigtforward nomenclature.  So it makes it confusing as
> to what's actually happening under the hood.
>
> ….
>
> Then I saw:
>
>
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTik0g+ePQ4CtW28ty+dpexprtiSwLQ@mail.gmail.com%3E
>
>
> look for in_memory_compaction_limit_in_mb in cassandra.yaml
>
>
> … so this seems like it will be a problem and slow me down moving forward.
>  Unless I figure out a workaround.
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>