You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Kevin Burton <bu...@spinn3r.com> on 2014/06/30 18:53:59 UTC
"Compacting large row … incrementally" … with HUGE values.
I'm running a full compaction now and noticed this:
Compacting large row … incrementally
… and the values were in the 300-500MB range.
I'm storing NOTHING anywhere near that large. Max is about 200k...
However, I'm storing my schema in a way so that I can do efficient
time/range scans of the data and placing things into buckets.
So my schema looks like:
bucket,
timestamp
… and the partition key is bucket. Since this is a clustering row, does
that mean that EVERYTHING is in one "row" under 'bucket' ?
So even though my INSERTs are like 200k, they're all pooling under the same
'bucket' which is the partition key so cassandra is going to have a hard
time compacting them.
Part of the problem here is the serious abuse of vocabulary. The
thrift/CQL impedance mismatch means that things have slightly different
names and not-so-straigtforward nomenclature. So it makes it confusing as
to what's actually happening under the hood.
….
Then I saw:
http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTik0g+ePQ4CtW28ty+dpexprtiSwLQ@mail.gmail.com%3E
look for in_memory_compaction_limit_in_mb in cassandra.yaml
… so this seems like it will be a problem and slow me down moving forward.
Unless I figure out a workaround.
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Re: "Compacting large row … incrementally" … with HUGE values.
Posted by Kevin Burton <bu...@spinn3r.com>.
yup… that's what I was thinking.. but good point on the physical vs logical
row… cassandra should be more rigorous about this term… it just says "large
row" not "large physical row"
…
Any idea how much this is going to slow me down?
On Mon, Jun 30, 2014 at 10:10 AM, DuyHai Doan <do...@gmail.com> wrote:
> Hello Kevin.
>
> With CQL3 there are some important terms to define:
>
> a. Row : means a logical row in the CQL3 semantics, logical row is what
> is displayed as a row in cqlsh client
> b. Partition: means a physical row on disk in the CQL3 semantics
>
> Even if you have tiny logical rows, if you store a lot of them under the
> same partition (physical row on disk) then it can add up a lot.
>
> Quick maths: 200k per logical row * 1000 logical rows = 200Mb roughtly for
> the partition
>
>
> On Mon, Jun 30, 2014 at 6:53 PM, Kevin Burton <bu...@spinn3r.com> wrote:
>
>> I'm running a full compaction now and noticed this:
>>
>> Compacting large row … incrementally
>>
>> … and the values were in the 300-500MB range.
>>
>> I'm storing NOTHING anywhere near that large. Max is about 200k...
>>
>> However, I'm storing my schema in a way so that I can do efficient
>> time/range scans of the data and placing things into buckets.
>>
>> So my schema looks like:
>>
>> bucket,
>> timestamp
>>
>> … and the partition key is bucket. Since this is a clustering row, does
>> that mean that EVERYTHING is in one "row" under 'bucket' ?
>>
>> So even though my INSERTs are like 200k, they're all pooling under the
>> same 'bucket' which is the partition key so cassandra is going to have a
>> hard time compacting them.
>>
>> Part of the problem here is the serious abuse of vocabulary. The
>> thrift/CQL impedance mismatch means that things have slightly different
>> names and not-so-straigtforward nomenclature. So it makes it confusing as
>> to what's actually happening under the hood.
>>
>> ….
>>
>> Then I saw:
>>
>>
>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTik0g+ePQ4CtW28ty+dpexprtiSwLQ@mail.gmail.com%3E
>>
>>
>>
>> look for in_memory_compaction_limit_in_mb in cassandra.yaml
>>
>>
>> … so this seems like it will be a problem and slow me down moving
>> forward. Unless I figure out a workaround.
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>>
>>
>
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Re: "Compacting large row … incrementally" … with HUGE values.
Posted by DuyHai Doan <do...@gmail.com>.
Hello Kevin.
With CQL3 there are some important terms to define:
a. Row : means a logical row in the CQL3 semantics, logical row is what is
displayed as a row in cqlsh client
b. Partition: means a physical row on disk in the CQL3 semantics
Even if you have tiny logical rows, if you store a lot of them under the
same partition (physical row on disk) then it can add up a lot.
Quick maths: 200k per logical row * 1000 logical rows = 200Mb roughtly for
the partition
On Mon, Jun 30, 2014 at 6:53 PM, Kevin Burton <bu...@spinn3r.com> wrote:
> I'm running a full compaction now and noticed this:
>
> Compacting large row … incrementally
>
> … and the values were in the 300-500MB range.
>
> I'm storing NOTHING anywhere near that large. Max is about 200k...
>
> However, I'm storing my schema in a way so that I can do efficient
> time/range scans of the data and placing things into buckets.
>
> So my schema looks like:
>
> bucket,
> timestamp
>
> … and the partition key is bucket. Since this is a clustering row, does
> that mean that EVERYTHING is in one "row" under 'bucket' ?
>
> So even though my INSERTs are like 200k, they're all pooling under the
> same 'bucket' which is the partition key so cassandra is going to have a
> hard time compacting them.
>
> Part of the problem here is the serious abuse of vocabulary. The
> thrift/CQL impedance mismatch means that things have slightly different
> names and not-so-straigtforward nomenclature. So it makes it confusing as
> to what's actually happening under the hood.
>
> ….
>
> Then I saw:
>
>
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTik0g+ePQ4CtW28ty+dpexprtiSwLQ@mail.gmail.com%3E
>
>
> look for in_memory_compaction_limit_in_mb in cassandra.yaml
>
>
> … so this seems like it will be a problem and slow me down moving forward.
> Unless I figure out a workaround.
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>