You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Denis Haskin <de...@haskinferguson.net> on 2010/05/05 06:50:31 UTC

Appropriate use for Cassandra?

I've been reading everything I can get my hands on about Cassandra and
it sounds like a possibly very good framework for our data needs; I'm
about to take the plunge and do some prototyping, but I thought I'd
see if I can get a reality check here on whether it makes sense.

Our schema should be fairly simple; we may only keep our original data
in Cassandra, and the rollups and analyzed results in a relational db
(although this is still open for discussion).

We have fairly small records: 120-150 bytes, in maybe 18 columns.
Data is additive only; we would rarely, if ever, be deleting data.

Our core data set will accumulate at somewhere between 14 and 27
million rows per day; we'll be starting with about a year and a half
of data (7.5 - 15 billion rows) and eventually would like to keep 5
years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
per year, data only.  Not sure about the overhead yet.)

Ideally we'd like to also have a cluster with our complete data set,
which is maybe 38 billion rows per year (we could live with less than
5 years of that).

I haven't really thought through what the schema's going to be; our
primary key is an entity's ID plus a timestamp.  But there's 2 or 3
other retrieval paths we'll need to support as well.

Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?

Thanks,

-- 
dwh

RE: Appropriate use for Cassandra?

Posted by "Dr. Martin Grabmüller" <Ma...@eleven.de>.
>    http://www.youtube.com/watch?v=eaCCkfjPm0o <http://www.youtube.com/watch?v=eaCCkfjPm0o>  
 
Thank you. You saved my day.
 
Martin

Re: Appropriate use for Cassandra?

Posted by philip andrew <ph...@gmail.com>.
http://www.youtube.com/watch?v=eaCCkfjPm0o
3.30 song begins
4.00 starfish loves you and Cassandra loves you!

On Thu, May 6, 2010 at 11:03 AM, Denis Haskin <de...@haskinferguson.net>wrote:

> i can haz hints pleez?
>
> On Wed, May 5, 2010 at 9:28 PM, philip andrew <ph...@gmail.com>
> wrote:
> > Starfish loves you.
> >
> > On Wed, May 5, 2010 at 1:16 PM, David Strauss <da...@fourkitchens.com>
> > wrote:
> >>
> >> On 2010-05-05 04:50, Denis Haskin wrote:
> >> > I've been reading everything I can get my hands on about Cassandra and
> >> > it sounds like a possibly very good framework for our data needs; I'm
> >> > about to take the plunge and do some prototyping, but I thought I'd
> >> > see if I can get a reality check here on whether it makes sense.
> >> >
> >> > Our schema should be fairly simple; we may only keep our original data
> >> > in Cassandra, and the rollups and analyzed results in a relational db
> >> > (although this is still open for discussion).
> >>
> >> This is what we do on some projects. This is a particularly nice
> >> strategy if the raw : aggregated ratio is really high or the raw data is
> >> bursty or highly volatile.
> >>
> >> Consider Hadoop integration for your aggregation needs.
> >>
> >> > We have fairly small records: 120-150 bytes, in maybe 18 columns.
> >> > Data is additive only; we would rarely, if ever, be deleting data.
> >>
> >> Cassandra loves you.
> >>
> >> > Our core data set will accumulate at somewhere between 14 and 27
> >> > million rows per day; we'll be starting with about a year and a half
> >> > of data (7.5 - 15 billion rows) and eventually would like to keep 5
> >> > years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
> >> > per year, data only.  Not sure about the overhead yet.)
> >> >
> >> > Ideally we'd like to also have a cluster with our complete data set,
> >> > which is maybe 38 billion rows per year (we could live with less than
> >> > 5 years of that).
> >> >
> >> > I haven't really thought through what the schema's going to be; our
> >> > primary key is an entity's ID plus a timestamp.  But there's 2 or 3
> >> > other retrieval paths we'll need to support as well.
> >>
> >> Generally, you do multiple retrieval paths through denormalization in
> >> Cassandra.
> >>
> >> > Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?
> >>
> >> Does the random partitioner support what you need?
> >>
> >> --
> >> David Strauss
> >>   | david@fourkitchens.com
> >> Four Kitchens
> >>   | http://fourkitchens.com
> >>   | +1 512 454 6659 [office]
> >>   | +1 512 870 8453 [direct]
> >>
> >
> >
>
>
>
> --
> dwh
>

Re: Appropriate use for Cassandra?

Posted by Denis Haskin <de...@haskinferguson.net>.
i can haz hints pleez?

On Wed, May 5, 2010 at 9:28 PM, philip andrew <ph...@gmail.com> wrote:
> Starfish loves you.
>
> On Wed, May 5, 2010 at 1:16 PM, David Strauss <da...@fourkitchens.com>
> wrote:
>>
>> On 2010-05-05 04:50, Denis Haskin wrote:
>> > I've been reading everything I can get my hands on about Cassandra and
>> > it sounds like a possibly very good framework for our data needs; I'm
>> > about to take the plunge and do some prototyping, but I thought I'd
>> > see if I can get a reality check here on whether it makes sense.
>> >
>> > Our schema should be fairly simple; we may only keep our original data
>> > in Cassandra, and the rollups and analyzed results in a relational db
>> > (although this is still open for discussion).
>>
>> This is what we do on some projects. This is a particularly nice
>> strategy if the raw : aggregated ratio is really high or the raw data is
>> bursty or highly volatile.
>>
>> Consider Hadoop integration for your aggregation needs.
>>
>> > We have fairly small records: 120-150 bytes, in maybe 18 columns.
>> > Data is additive only; we would rarely, if ever, be deleting data.
>>
>> Cassandra loves you.
>>
>> > Our core data set will accumulate at somewhere between 14 and 27
>> > million rows per day; we'll be starting with about a year and a half
>> > of data (7.5 - 15 billion rows) and eventually would like to keep 5
>> > years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
>> > per year, data only.  Not sure about the overhead yet.)
>> >
>> > Ideally we'd like to also have a cluster with our complete data set,
>> > which is maybe 38 billion rows per year (we could live with less than
>> > 5 years of that).
>> >
>> > I haven't really thought through what the schema's going to be; our
>> > primary key is an entity's ID plus a timestamp.  But there's 2 or 3
>> > other retrieval paths we'll need to support as well.
>>
>> Generally, you do multiple retrieval paths through denormalization in
>> Cassandra.
>>
>> > Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?
>>
>> Does the random partitioner support what you need?
>>
>> --
>> David Strauss
>>   | david@fourkitchens.com
>> Four Kitchens
>>   | http://fourkitchens.com
>>   | +1 512 454 6659 [office]
>>   | +1 512 870 8453 [direct]
>>
>
>



-- 
dwh

Re: Appropriate use for Cassandra?

Posted by philip andrew <ph...@gmail.com>.
Starfish loves you.

On Wed, May 5, 2010 at 1:16 PM, David Strauss <da...@fourkitchens.com>wrote:

> On 2010-05-05 04:50, Denis Haskin wrote:
> > I've been reading everything I can get my hands on about Cassandra and
> > it sounds like a possibly very good framework for our data needs; I'm
> > about to take the plunge and do some prototyping, but I thought I'd
> > see if I can get a reality check here on whether it makes sense.
> >
> > Our schema should be fairly simple; we may only keep our original data
> > in Cassandra, and the rollups and analyzed results in a relational db
> > (although this is still open for discussion).
>
> This is what we do on some projects. This is a particularly nice
> strategy if the raw : aggregated ratio is really high or the raw data is
> bursty or highly volatile.
>
> Consider Hadoop integration for your aggregation needs.
>
> > We have fairly small records: 120-150 bytes, in maybe 18 columns.
> > Data is additive only; we would rarely, if ever, be deleting data.
>
> Cassandra loves you.
>
> > Our core data set will accumulate at somewhere between 14 and 27
> > million rows per day; we'll be starting with about a year and a half
> > of data (7.5 - 15 billion rows) and eventually would like to keep 5
> > years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
> > per year, data only.  Not sure about the overhead yet.)
> >
> > Ideally we'd like to also have a cluster with our complete data set,
> > which is maybe 38 billion rows per year (we could live with less than
> > 5 years of that).
> >
> > I haven't really thought through what the schema's going to be; our
> > primary key is an entity's ID plus a timestamp.  But there's 2 or 3
> > other retrieval paths we'll need to support as well.
>
> Generally, you do multiple retrieval paths through denormalization in
> Cassandra.
>
> > Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?
>
> Does the random partitioner support what you need?
>
> --
> David Strauss
>   | david@fourkitchens.com
> Four Kitchens
>   | http://fourkitchens.com
>   | +1 512 454 6659 [office]
>   | +1 512 870 8453 [direct]
>
>

Re: Appropriate use for Cassandra?

Posted by Denis Haskin <de...@haskinferguson.net>.
Hmm... I was actually thinking of the inverse of that: 20K rows (one
per entity), with one supercolumn per time-series sample... it would
be something like 700,000 supercolumns (1.5 years, to start with)
growing to maybe 2,400,000 supercolumns.

That may be an issue for our access path needs, however... and may not
even be possible at all: seems to me I've been reading that Cassandra
needs to be able to have an entire supercolumn in memory at once for
deserialization?

Thanks,

dwh


On Wed, May 5, 2010 at 7:47 AM, David Strauss <da...@fourkitchens.com> wrote:
> Given that your current schema has ~18 small columns per row, adding a
> level by using supercolumns may make sense for you because the
> limitation of unserializing a whole supercolumn at once isn't going to
> be a problem for you.
>
> 20K supercolumns per row with ~18 small subcolumns each is completely
> reasonable. The (super)columns within each row will be ordered, and you
> can use the much-easier-to-administer RandomPartitioner.
>
> On 2010-05-05 11:22, Denis Haskin wrote:
>> David -- thanks for the thoughts.
>>
>> In re: your question
>>> Does the random partitioner support what you need?
>>
>> I guess my answer is "I'm not sure yet", but also my initial thought
>> was that we'd use the (or a) OrderPreservingPartitioner so that we
>> could use range scans and that rows for a given entity would be
>> co-located (if I'm understanding Cassandra's storage architecture
>> properly).  But that may be a naive approach.
>>
>> In our core data set, we have maybe 20,000 entities about which we are
>> storing time-series data (and its fairly well distributed across these
>> entities).  Occurs to me it's also possible to store a entity per row,
>> with the time-series data as (or in?) super columns (and maybe it
>> would make sense to break those out in column families by date range).
>>  I'd have to think through a little more what that might mean for our
>> secondary indexing needs.
>>
>> Thanks,
>>
>> dwh
>>
>>
>>
>> On Wed, May 5, 2010 at 1:16 AM, David Strauss <da...@fourkitchens.com> wrote:
>>> On 2010-05-05 04:50, Denis Haskin wrote:
>>>> I've been reading everything I can get my hands on about Cassandra and
>>>> it sounds like a possibly very good framework for our data needs; I'm
>>>> about to take the plunge and do some prototyping, but I thought I'd
>>>> see if I can get a reality check here on whether it makes sense.
>>>>
>>>> Our schema should be fairly simple; we may only keep our original data
>>>> in Cassandra, and the rollups and analyzed results in a relational db
>>>> (although this is still open for discussion).
>>>
>>> This is what we do on some projects. This is a particularly nice
>>> strategy if the raw : aggregated ratio is really high or the raw data is
>>> bursty or highly volatile.
>>>
>>> Consider Hadoop integration for your aggregation needs.
>>>
>>>> We have fairly small records: 120-150 bytes, in maybe 18 columns.
>>>> Data is additive only; we would rarely, if ever, be deleting data.
>>>
>>> Cassandra loves you.
>>>
>>>> Our core data set will accumulate at somewhere between 14 and 27
>>>> million rows per day; we'll be starting with about a year and a half
>>>> of data (7.5 - 15 billion rows) and eventually would like to keep 5
>>>> years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
>>>> per year, data only.  Not sure about the overhead yet.)
>>>>
>>>> Ideally we'd like to also have a cluster with our complete data set,
>>>> which is maybe 38 billion rows per year (we could live with less than
>>>> 5 years of that).
>>>>
>>>> I haven't really thought through what the schema's going to be; our
>>>> primary key is an entity's ID plus a timestamp.  But there's 2 or 3
>>>> other retrieval paths we'll need to support as well.
>>>
>>> Generally, you do multiple retrieval paths through denormalization in
>>> Cassandra.
>>>
>>>> Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?
>>>
>>> Does the random partitioner support what you need?
>>>
>>> --
>>> David Strauss
>>>   | david@fourkitchens.com
>>> Four Kitchens
>>>   | http://fourkitchens.com
>>>   | +1 512 454 6659 [office]
>>>   | +1 512 870 8453 [direct]
>>>
>>>
>
>
> --
> David Strauss
>   | david@fourkitchens.com
>   | +1 512 577 5827 [mobile]
> Four Kitchens
>   | http://fourkitchens.com
>   | +1 512 454 6659 [office]
>   | +1 512 870 8453 [direct]
>
>



-- 
dwh

Re: Appropriate use for Cassandra?

Posted by David Strauss <da...@fourkitchens.com>.
Given that your current schema has ~18 small columns per row, adding a
level by using supercolumns may make sense for you because the
limitation of unserializing a whole supercolumn at once isn't going to
be a problem for you.

20K supercolumns per row with ~18 small subcolumns each is completely
reasonable. The (super)columns within each row will be ordered, and you
can use the much-easier-to-administer RandomPartitioner.

On 2010-05-05 11:22, Denis Haskin wrote:
> David -- thanks for the thoughts.
> 
> In re: your question
>> Does the random partitioner support what you need?
> 
> I guess my answer is "I'm not sure yet", but also my initial thought
> was that we'd use the (or a) OrderPreservingPartitioner so that we
> could use range scans and that rows for a given entity would be
> co-located (if I'm understanding Cassandra's storage architecture
> properly).  But that may be a naive approach.
> 
> In our core data set, we have maybe 20,000 entities about which we are
> storing time-series data (and its fairly well distributed across these
> entities).  Occurs to me it's also possible to store a entity per row,
> with the time-series data as (or in?) super columns (and maybe it
> would make sense to break those out in column families by date range).
>  I'd have to think through a little more what that might mean for our
> secondary indexing needs.
> 
> Thanks,
> 
> dwh
> 
> 
> 
> On Wed, May 5, 2010 at 1:16 AM, David Strauss <da...@fourkitchens.com> wrote:
>> On 2010-05-05 04:50, Denis Haskin wrote:
>>> I've been reading everything I can get my hands on about Cassandra and
>>> it sounds like a possibly very good framework for our data needs; I'm
>>> about to take the plunge and do some prototyping, but I thought I'd
>>> see if I can get a reality check here on whether it makes sense.
>>>
>>> Our schema should be fairly simple; we may only keep our original data
>>> in Cassandra, and the rollups and analyzed results in a relational db
>>> (although this is still open for discussion).
>>
>> This is what we do on some projects. This is a particularly nice
>> strategy if the raw : aggregated ratio is really high or the raw data is
>> bursty or highly volatile.
>>
>> Consider Hadoop integration for your aggregation needs.
>>
>>> We have fairly small records: 120-150 bytes, in maybe 18 columns.
>>> Data is additive only; we would rarely, if ever, be deleting data.
>>
>> Cassandra loves you.
>>
>>> Our core data set will accumulate at somewhere between 14 and 27
>>> million rows per day; we'll be starting with about a year and a half
>>> of data (7.5 - 15 billion rows) and eventually would like to keep 5
>>> years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
>>> per year, data only.  Not sure about the overhead yet.)
>>>
>>> Ideally we'd like to also have a cluster with our complete data set,
>>> which is maybe 38 billion rows per year (we could live with less than
>>> 5 years of that).
>>>
>>> I haven't really thought through what the schema's going to be; our
>>> primary key is an entity's ID plus a timestamp.  But there's 2 or 3
>>> other retrieval paths we'll need to support as well.
>>
>> Generally, you do multiple retrieval paths through denormalization in
>> Cassandra.
>>
>>> Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?
>>
>> Does the random partitioner support what you need?
>>
>> --
>> David Strauss
>>   | david@fourkitchens.com
>> Four Kitchens
>>   | http://fourkitchens.com
>>   | +1 512 454 6659 [office]
>>   | +1 512 870 8453 [direct]
>>
>>


-- 
David Strauss
   | david@fourkitchens.com
   | +1 512 577 5827 [mobile]
Four Kitchens
   | http://fourkitchens.com
   | +1 512 454 6659 [office]
   | +1 512 870 8453 [direct]


Re: Appropriate use for Cassandra?

Posted by Denis Haskin <de...@haskinferguson.net>.
David -- thanks for the thoughts.

In re: your question
> Does the random partitioner support what you need?

I guess my answer is "I'm not sure yet", but also my initial thought
was that we'd use the (or a) OrderPreservingPartitioner so that we
could use range scans and that rows for a given entity would be
co-located (if I'm understanding Cassandra's storage architecture
properly).  But that may be a naive approach.

In our core data set, we have maybe 20,000 entities about which we are
storing time-series data (and its fairly well distributed across these
entities).  Occurs to me it's also possible to store a entity per row,
with the time-series data as (or in?) super columns (and maybe it
would make sense to break those out in column families by date range).
 I'd have to think through a little more what that might mean for our
secondary indexing needs.

Thanks,

dwh



On Wed, May 5, 2010 at 1:16 AM, David Strauss <da...@fourkitchens.com> wrote:
> On 2010-05-05 04:50, Denis Haskin wrote:
>> I've been reading everything I can get my hands on about Cassandra and
>> it sounds like a possibly very good framework for our data needs; I'm
>> about to take the plunge and do some prototyping, but I thought I'd
>> see if I can get a reality check here on whether it makes sense.
>>
>> Our schema should be fairly simple; we may only keep our original data
>> in Cassandra, and the rollups and analyzed results in a relational db
>> (although this is still open for discussion).
>
> This is what we do on some projects. This is a particularly nice
> strategy if the raw : aggregated ratio is really high or the raw data is
> bursty or highly volatile.
>
> Consider Hadoop integration for your aggregation needs.
>
>> We have fairly small records: 120-150 bytes, in maybe 18 columns.
>> Data is additive only; we would rarely, if ever, be deleting data.
>
> Cassandra loves you.
>
>> Our core data set will accumulate at somewhere between 14 and 27
>> million rows per day; we'll be starting with about a year and a half
>> of data (7.5 - 15 billion rows) and eventually would like to keep 5
>> years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
>> per year, data only.  Not sure about the overhead yet.)
>>
>> Ideally we'd like to also have a cluster with our complete data set,
>> which is maybe 38 billion rows per year (we could live with less than
>> 5 years of that).
>>
>> I haven't really thought through what the schema's going to be; our
>> primary key is an entity's ID plus a timestamp.  But there's 2 or 3
>> other retrieval paths we'll need to support as well.
>
> Generally, you do multiple retrieval paths through denormalization in
> Cassandra.
>
>> Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?
>
> Does the random partitioner support what you need?
>
> --
> David Strauss
>   | david@fourkitchens.com
> Four Kitchens
>   | http://fourkitchens.com
>   | +1 512 454 6659 [office]
>   | +1 512 870 8453 [direct]
>
>

Re: Appropriate use for Cassandra?

Posted by David Strauss <da...@fourkitchens.com>.
On 2010-05-05 04:50, Denis Haskin wrote:
> I've been reading everything I can get my hands on about Cassandra and
> it sounds like a possibly very good framework for our data needs; I'm
> about to take the plunge and do some prototyping, but I thought I'd
> see if I can get a reality check here on whether it makes sense.
> 
> Our schema should be fairly simple; we may only keep our original data
> in Cassandra, and the rollups and analyzed results in a relational db
> (although this is still open for discussion).

This is what we do on some projects. This is a particularly nice
strategy if the raw : aggregated ratio is really high or the raw data is
bursty or highly volatile.

Consider Hadoop integration for your aggregation needs.

> We have fairly small records: 120-150 bytes, in maybe 18 columns.
> Data is additive only; we would rarely, if ever, be deleting data.

Cassandra loves you.

> Our core data set will accumulate at somewhere between 14 and 27
> million rows per day; we'll be starting with about a year and a half
> of data (7.5 - 15 billion rows) and eventually would like to keep 5
> years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
> per year, data only.  Not sure about the overhead yet.)
> 
> Ideally we'd like to also have a cluster with our complete data set,
> which is maybe 38 billion rows per year (we could live with less than
> 5 years of that).
> 
> I haven't really thought through what the schema's going to be; our
> primary key is an entity's ID plus a timestamp.  But there's 2 or 3
> other retrieval paths we'll need to support as well.

Generally, you do multiple retrieval paths through denormalization in
Cassandra.

> Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?

Does the random partitioner support what you need?

-- 
David Strauss
   | david@fourkitchens.com
Four Kitchens
   | http://fourkitchens.com
   | +1 512 454 6659 [office]
   | +1 512 870 8453 [direct]