You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jim Ancona <ji...@anconafamily.com> on 2017/04/04 20:58:57 UTC

Re: Effective partition key for time series data, which allows range queries?

The typical recommendation for maximum partition size is on the order of
100mb and/or 100,000 rows. That's not a hard limit, but you may be setting
yourself up for issues as you approach or exceed those numbers.

If you need to reduce partition size, the typical way to do this is by
"bucketing," that is adding a synthetic column to the partition key to
separate the data into separate buckets. In your example above, I assume
that your current primary key is (user, timestamp), where user is the
partition key and timestamp is the clustering key. Say that you want to
store a maximum of a years worth of data in a partition. You would make the
primary key be ((user, year), timestamp). The partition key is now (user,
year) where year is the year part of the timestamp. Now if you want to
query the data for last month, you would do:

select * from data where user_id = 'foo' and year = 2017 and timestamp >=
'<1 month ago>' and timestamp <= '<today>'


If you wanted to get the data for the last 6 month, you'd do something
like:

select * from data where user_id = 'foo' and year in (2016, 2017) and
timestamp >= '<6 months ago>' and timestamp <= '<today>'  (Notice that
because the query spans two years, you have to include both years in the
select criteria so that C* knows which partitions to query. )


You can make the buckets smaller (e.g. weeks, days, hours instead of
years), but of course querying multiple buckets is less efficient, so it's
worth making your buckets as large as you can without making them too big.

Hope this helps!

Jim




On Mon, Mar 27, 2017 at 8:47 PM, Ali Akhtar <al...@gmail.com> wrote:

> I have a use case where the data for individual users is being tracked,
> and every 15 minutes or so, the data for the past 15 minutes is inserted
> into the table.
>
> The table schema looks like:
> user id, timestamp, foo, bar, etc.
>
> Where foo, bar, etc are the items being tracked, and their values over the
> past 15 minutes.
>
> I initially planned to use the user id as the primary key of the table.
> But, I realized that this may cause really wide rows ( tracking for 24
> hours means 96 records inserted (1 for each 15 min window), over 1 year
> this means 36k records per user, over 2 years, 72k, etc).
>
> I know the  limit of wide rows is billions of records, but I've heard that
> the practical limit is much lower.
>
> So I considered using a composite primary key: (user, timestamp)
>
> If I'm correct, the above should create a new row for each user &
> timestamp logged.
>
> However, will i still be able to do range queries on the timestamp, to e.g
> return the data for the last week?
>
> E.g select * from data where user_id = 'foo' and timestamp >= '<1 month
> ago>' and timestamp <= '<today>' ?
>
>

Re: Effective partition key for time series data, which allows range queries?

Posted by Jim Ancona <ji...@anconafamily.com>.
That's an interesting refinement! I'll keep it in mind the next time this
sort of thing comes up.

Jim

On Wed, Apr 5, 2017 at 9:22 AM, Eric Stevens <mi...@gmail.com> wrote:

> Jim's basic model is similar to how we've solved this exact kind of
> problem many times.  From my own experience, I strongly recommend that you
> make a `bucket` field in the partition key, and a `time` field in the
> cluster key.  Make both of these of data type `timestamp`.  Then use
> application logic to floor the bucket to an appropriate interval according
> to your chosen bucket size.
>
> The reason is that as your data needs grow, the one thing you can be
> pretty confident in is that your assumptions about data density per
> partition will turn out to be eventually wrong.  This is either because of
> expanding requirements (you're adding new fields to this table), because of
> increased application usage (you're being successful!), or because you
> didn't anticipate a use case with different data density per bucket than
> you anticipated (you're not prescient).
>
> It's easy in application code to adjust your timestamp interval if your
> keying allows for arbitrary adjustments.  Most often you're going to end up
> making smaller buckets as your needs progress.  The upshot is that with a
> little careful selection of bucketing strategy, partition key range
> iterations are still correct if you adjust from say a 24 hour bucket to a
> 12 hour, 6 hour, 3 hour, 1 hour, 30 minute, 15 minute, or 1 minute bucket.
> The data written under the larger bucket size still lands on a smaller
> bucket interval, so you don't really even need to use complex logic in the
> application to adapt to the new bucket size.  You definitely don't want to
> paint yourself into a corner where you need a smaller bucket size but your
> data model didn't leave room for it.
>
> On Tue, Apr 4, 2017 at 2:59 PM Jim Ancona <ji...@anconafamily.com> wrote:
>
>> The typical recommendation for maximum partition size is on the order of
>> 100mb and/or 100,000 rows. That's not a hard limit, but you may be setting
>> yourself up for issues as you approach or exceed those numbers.
>>
>> If you need to reduce partition size, the typical way to do this is by
>> "bucketing," that is adding a synthetic column to the partition key to
>> separate the data into separate buckets. In your example above, I assume
>> that your current primary key is (user, timestamp), where user is the
>> partition key and timestamp is the clustering key. Say that you want to
>> store a maximum of a years worth of data in a partition. You would make the
>> primary key be ((user, year), timestamp). The partition key is now
>> (user, year) where year is the year part of the timestamp. Now if you want
>> to query the data for last month, you would do:
>>
>> select * from data where user_id = 'foo' and year = 2017 and timestamp >=
>> '<1 month ago>' and timestamp <= '<today>'
>>
>>
>> If you wanted to get the data for the last 6 month, you'd do something
>> like:
>>
>> select * from data where user_id = 'foo' and year in (2016, 2017) and
>> timestamp >= '<6 months ago>' and timestamp <= '<today>'  (Notice that
>> because the query spans two years, you have to include both years in the
>> select criteria so that C* knows which partitions to query. )
>>
>>
>> You can make the buckets smaller (e.g. weeks, days, hours instead of
>> years), but of course querying multiple buckets is less efficient, so it's
>> worth making your buckets as large as you can without making them too big.
>>
>> Hope this helps!
>>
>> Jim
>>
>>
>>
>>
>> On Mon, Mar 27, 2017 at 8:47 PM, Ali Akhtar <al...@gmail.com> wrote:
>>
>> I have a use case where the data for individual users is being tracked,
>> and every 15 minutes or so, the data for the past 15 minutes is inserted
>> into the table.
>>
>> The table schema looks like:
>> user id, timestamp, foo, bar, etc.
>>
>> Where foo, bar, etc are the items being tracked, and their values over
>> the past 15 minutes.
>>
>> I initially planned to use the user id as the primary key of the table.
>> But, I realized that this may cause really wide rows ( tracking for 24
>> hours means 96 records inserted (1 for each 15 min window), over 1 year
>> this means 36k records per user, over 2 years, 72k, etc).
>>
>> I know the  limit of wide rows is billions of records, but I've heard
>> that the practical limit is much lower.
>>
>> So I considered using a composite primary key: (user, timestamp)
>>
>> If I'm correct, the above should create a new row for each user &
>> timestamp logged.
>>
>> However, will i still be able to do range queries on the timestamp, to
>> e.g return the data for the last week?
>>
>> E.g select * from data where user_id = 'foo' and timestamp >= '<1 month
>> ago>' and timestamp <= '<today>' ?
>>
>>
>>

Re: Effective partition key for time series data, which allows range queries?

Posted by Eric Stevens <mi...@gmail.com>.
Jim's basic model is similar to how we've solved this exact kind of problem
many times.  From my own experience, I strongly recommend that you make a
`bucket` field in the partition key, and a `time` field in the cluster
key.  Make both of these of data type `timestamp`.  Then use application
logic to floor the bucket to an appropriate interval according to your
chosen bucket size.

The reason is that as your data needs grow, the one thing you can be pretty
confident in is that your assumptions about data density per partition will
turn out to be eventually wrong.  This is either because of expanding
requirements (you're adding new fields to this table), because of increased
application usage (you're being successful!), or because you didn't
anticipate a use case with different data density per bucket than you
anticipated (you're not prescient).

It's easy in application code to adjust your timestamp interval if your
keying allows for arbitrary adjustments.  Most often you're going to end up
making smaller buckets as your needs progress.  The upshot is that with a
little careful selection of bucketing strategy, partition key range
iterations are still correct if you adjust from say a 24 hour bucket to a
12 hour, 6 hour, 3 hour, 1 hour, 30 minute, 15 minute, or 1 minute bucket.
The data written under the larger bucket size still lands on a smaller
bucket interval, so you don't really even need to use complex logic in the
application to adapt to the new bucket size.  You definitely don't want to
paint yourself into a corner where you need a smaller bucket size but your
data model didn't leave room for it.

On Tue, Apr 4, 2017 at 2:59 PM Jim Ancona <ji...@anconafamily.com> wrote:

> The typical recommendation for maximum partition size is on the order of
> 100mb and/or 100,000 rows. That's not a hard limit, but you may be setting
> yourself up for issues as you approach or exceed those numbers.
>
> If you need to reduce partition size, the typical way to do this is by
> "bucketing," that is adding a synthetic column to the partition key to
> separate the data into separate buckets. In your example above, I assume
> that your current primary key is (user, timestamp), where user is the
> partition key and timestamp is the clustering key. Say that you want to
> store a maximum of a years worth of data in a partition. You would make the
> primary key be ((user, year), timestamp). The partition key is now (user,
> year) where year is the year part of the timestamp. Now if you want to
> query the data for last month, you would do:
>
> select * from data where user_id = 'foo' and year = 2017 and timestamp >=
> '<1 month ago>' and timestamp <= '<today>'
>
>
> If you wanted to get the data for the last 6 month, you'd do something
> like:
>
> select * from data where user_id = 'foo' and year in (2016, 2017) and
> timestamp >= '<6 months ago>' and timestamp <= '<today>'  (Notice that
> because the query spans two years, you have to include both years in the
> select criteria so that C* knows which partitions to query. )
>
>
> You can make the buckets smaller (e.g. weeks, days, hours instead of
> years), but of course querying multiple buckets is less efficient, so it's
> worth making your buckets as large as you can without making them too big.
>
> Hope this helps!
>
> Jim
>
>
>
>
> On Mon, Mar 27, 2017 at 8:47 PM, Ali Akhtar <al...@gmail.com> wrote:
>
> I have a use case where the data for individual users is being tracked,
> and every 15 minutes or so, the data for the past 15 minutes is inserted
> into the table.
>
> The table schema looks like:
> user id, timestamp, foo, bar, etc.
>
> Where foo, bar, etc are the items being tracked, and their values over the
> past 15 minutes.
>
> I initially planned to use the user id as the primary key of the table.
> But, I realized that this may cause really wide rows ( tracking for 24
> hours means 96 records inserted (1 for each 15 min window), over 1 year
> this means 36k records per user, over 2 years, 72k, etc).
>
> I know the  limit of wide rows is billions of records, but I've heard that
> the practical limit is much lower.
>
> So I considered using a composite primary key: (user, timestamp)
>
> If I'm correct, the above should create a new row for each user &
> timestamp logged.
>
> However, will i still be able to do range queries on the timestamp, to e.g
> return the data for the last week?
>
> E.g select * from data where user_id = 'foo' and timestamp >= '<1 month
> ago>' and timestamp <= '<today>' ?
>
>
>