You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Kevin Burton <bu...@spinn3r.com> on 2011/09/04 01:00:39 UTC

Not all data structures need timestamps (and don't require wasted memory).

I was thinking more about the excessive (IMO) use of memory in Cassandra due
to 8 bytes per column/row (cell) in Cassandra.

Any operation that is idempotent does not require a timestamp.

For example, set membership.

A link adjacency list is a good example.

If you have a list of source->targets, adding a new member to 'targets'
shouldn't require another timestamp because multiple additions end up with
the same result (it is idempotent.)

This can be modeled by just adding another column.

The results of ETL jobs that are being bulk loaded back into Cassandra don't
require timestamps.  You could create a long running ZK lock to represent
each load to prevent multiple writers per key.

In these scenarios, timestamps are just a waste of memory.  It's a
significant one as well. For our usage it will require 3-4x more memory to
deploy Cassandra… I'm not really jumping at the bit to pay an extra
$120-150k per month in hosting costs… though I'm sure my hosting provider
would love it :)

Kevin

-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Re: Not all data structures need timestamps (and don't require wasted memory).

Posted by Colin <co...@gmail.com>.
Kevin,

You will find that many of us using cassanda are already doing what you suggest (custom serializer/deserializer).

We call it JSON.

--
Colin

*Sent from Star Trek like flat panel device, which although larger than my Star Trek like communicator device, may have typo's and exhibit improper grammar due to haste and less than perfect use of the virtual keyboard*
 

On Sep 4, 2011, at 12:11 AM, Kevin Burton <bu...@spinn3r.com> wrote:

> 
> 
> On Sat, Sep 3, 2011 at 8:53 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> I strongly suspect that you're optimizing prematurely.  What evidence
> do you have that timestamps are producing unacceptable overhead for
> your workload?  
> 
> It's possible … this is back of the envelope at the moment as right now it's a nonstarter.  
>  
> You do realize that the sparse data model means that
> we spend a lot more than 8 bytes storing column names in-line with
> each column too, right?
> 
> Yeah… this can be mitigated if the column names are your data.
>  
> 
> If disk space is really the limiting factor for your workload, I would
> recommend testing the compression code in trunk.  That will get you a
> lot farther than adding extra options for a very niche scenario.
> 
> 
> Another thing I've been considering is building a serializer/deserializer in front of Cassandra and running my own protocol to talk to it which builds its own encoding per row to avoid using excessive columns.
> 
> Kevin 
> 
> -- 
> Founder/CEO Spinn3r.com
> 
> Location: San Francisco, CA
> Skype: burtonator
> Skype-in: (415) 871-0687
> 

Re: Not all data structures need timestamps (and don't require wasted memory).

Posted by Kevin Burton <bu...@spinn3r.com>.
On Sat, Sep 3, 2011 at 8:53 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> I strongly suspect that you're optimizing prematurely.  What evidence
> do you have that timestamps are producing unacceptable overhead for
> your workload?


It's possible … this is back of the envelope at the moment as right now it's
a nonstarter.


> You do realize that the sparse data model means that
> we spend a lot more than 8 bytes storing column names in-line with
> each column too, right?
>

Yeah… this can be mitigated if the column names are your data.


>
> If disk space is really the limiting factor for your workload, I would
> recommend testing the compression code in trunk.  That will get you a
> lot farther than adding extra options for a very niche scenario.
>
>
Another thing I've been considering is building a serializer/deserializer in
front of Cassandra and running my own protocol to talk to it which builds
its own encoding per row to avoid using excessive columns.

Kevin

-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Re: Not all data structures need timestamps (and don't require wasted memory).

Posted by Jonathan Ellis <jb...@gmail.com>.
I strongly suspect that you're optimizing prematurely.  What evidence
do you have that timestamps are producing unacceptable overhead for
your workload?  You do realize that the sparse data model means that
we spend a lot more than 8 bytes storing column names in-line with
each column too, right?

If disk space is really the limiting factor for your workload, I would
recommend testing the compression code in trunk.  That will get you a
lot farther than adding extra options for a very niche scenario.

On Sat, Sep 3, 2011 at 10:26 PM, Kevin Burton <bu...@spinn3r.com> wrote:
> Sure….. I'm willing to concede that Cassandra isn't for anyone but why make
> it worse than it has to be?
> Why 8 bytes?  Why not 64 bytes?
> I imagine even in your situation a 8x boost in storage would not be nice ;)
> The point is that replication in Cassandra only needs timestamps to handle
> out of order writes … for values that are idempotent, this isn't necessary.
>  The order doesn't matter.
> Adding support for Cassandra to support variable width resolution (1ms is
> probably too high for most uses) and to turn timestamps off on a per
> tablespace basis could be really handy.
> Kevin
> On Sat, Sep 3, 2011 at 7:56 PM, Stephen Connolly
> <st...@gmail.com> wrote:
>>
>> maybe not all nosql applications fit cassandra.
>>
>> the whole core logic of how cassandra is eventually consistent is because
>> of the per column timestamps... if they are a pain for you consider storing
>> eg as a small number of fat columns rather than many skinny ones... either
>> that or look at a different database for your use case. ;-)
>>
>> - Stephen
>>
>> ---
>> Sent from my Android phone, so random spelling mistakes, random nonsense
>> words and other nonsense are a direct result of using swype to type on the
>> screen
>>
>> On 3 Sep 2011 16:01, "Kevin Burton" <bu...@spinn3r.com> wrote:
>
>
>
> --
>
> Founder/CEO Spinn3r.com
>
> Location: San Francisco, CA
> Skype: burtonator
>
> Skype-in: (415) 871-0687
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Not all data structures need timestamps (and don't require wasted memory).

Posted by David Jeske <da...@gmail.com>.
After writing my message, I recognized a scenerio you might be referring to
Kevin.

If I understand correctly, you're not referring to set-membership in the
general sense, where one could add and remove entries. General
set-membership, in the context of eventual-consistency, requires timestamps.
The timestamps distinguish between the two values "present" and
"not-present". (not-present being represented by timestamped tombstones in
the case of deletion/removal).

So I suppose you're referring to "additive-only set membership", where there
is no need to distinguish between two different states (such as present or
not present in a set), because items can only be added, never changed or
removed. If entries are not allowed to be deleted or modified, then
cassandra-style eventual consistency replication could occur without any
timestamp, because you're simply replicating the existence of keys to all
replicas.

To me this seems a particularly narrow use-case. Any inadvertant write (even
one from a bug or data-corruption), would require very frustrating manual
intervention to remove. (you'd have to manually shutdown all nodes, manually
purge bad values out of the dataset, then bring the nodes back online) I'm
not a cassandra developer, but this seems like a path which is very
specialized and not very in-line with Cassandra's design.

You might have better luck with a distributed store that is not based on
timestamp eventual consistency. I don't know if you can explicitly turn off
timestamps in HBase, but AFAIK the client is allowed to supply them, so you
can just supply zero and they should be compressed out quite well.

Re: Not all data structures need timestamps (and don't require wasted memory).

Posted by David Jeske <da...@gmail.com>.
On Sat, Sep 3, 2011 at 8:26 PM, Kevin Burton <bu...@spinn3r.com> wrote:

> The point is that replication in Cassandra only needs timestamps to handle
> out of order writes … for values that are idempotent, this isn't necessary.
>  The order doesn't matter.
>

I believe this is a mis-understanding of how idempotency applies to
Cassandra replication. If there were no timestamps stored, how would
read-repair work? There would be two different values with no way to tell
which was written second.

Re: Not all data structures need timestamps (and don't require wasted memory).

Posted by Kevin Burton <bu...@spinn3r.com>.
Sure….. I'm willing to concede that Cassandra isn't for anyone but why make
it worse than it has to be?

Why 8 bytes?  Why not 64 bytes?

I imagine even in your situation a 8x boost in storage would not be nice ;)

The point is that replication in Cassandra only needs timestamps to handle
out of order writes … for values that are idempotent, this isn't necessary.
 The order doesn't matter.

Adding support for Cassandra to support variable width resolution (1ms is
probably too high for most uses) and to turn timestamps off on a per
tablespace basis could be really handy.

Kevin

On Sat, Sep 3, 2011 at 7:56 PM, Stephen Connolly <
stephen.alan.connolly@gmail.com> wrote:

> maybe not all nosql applications fit cassandra.
>
> the whole core logic of how cassandra is eventually consistent is because
> of the per column timestamps... if they are a pain for you consider storing
> eg as a small number of fat columns rather than many skinny ones... either
> that or look at a different database for your use case. ;-)
>
> - Stephen
>
> ---
> Sent from my Android phone, so random spelling mistakes, random nonsense
> words and other nonsense are a direct result of using swype to type on the
> screen
> On 3 Sep 2011 16:01, "Kevin Burton" <bu...@spinn3r.com> wrote:
>



-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Re: Not all data structures need timestamps (and don't require wasted memory).

Posted by Stephen Connolly <st...@gmail.com>.
maybe not all nosql applications fit cassandra.

the whole core logic of how cassandra is eventually consistent is because of
the per column timestamps... if they are a pain for you consider storing eg
as a small number of fat columns rather than many skinny ones... either that
or look at a different database for your use case. ;-)

- Stephen

---
Sent from my Android phone, so random spelling mistakes, random nonsense
words and other nonsense are a direct result of using swype to type on the
screen
On 3 Sep 2011 16:01, "Kevin Burton" <bu...@spinn3r.com> wrote: