You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Robert Wille <rw...@fold3.com> on 2014/11/23 16:41:36 UTC

Getting the counters with the highest values

I’m working on moving a bunch of counters out of our relational database to Cassandra. For the most part, Cassandra is a very nice fit, except for one feature on our website. We manage a time series of view counts for each document, and display a list of the most popular documents in the last seven days. This seems like a pretty strong anti-pattern for Cassandra, but also seems like something a lot of people would want to do. If you’re keeping counters, its pretty likely that you’d want to know which ones have the highest counts. 

Here’s what I came up with to implement this feature. Create a counter table with primary key (doc_id, day) and a single counter. Whenever a document is viewed, increment the counter for the document for today and the previous six days. Sometime after midnight each day, compile the counters into a table with primary key (day, count, doc_id) and no additional columns. For each partition in the counter table, I would sum up the counters, delete any counters that are over a week old, and put the sum into the second table with day = today. When I query the table, i would ask for data where day = yesterday. During the compilation process, I would delete old partitions. In theory I’d only need two partitions. One that is being built, and one for querying.

I’d be interested to hear critiques on this strategy, as well as hearing how other people have implemented a "most-popular" feature using Cassandra counters.

Robert

Re: Getting the counters with the highest values

Posted by Eric Stevens <mi...@gmail.com>.

> We have too many documents per day to materialize in memory, so querying
per day and aggregating the results isn’t really possible.

You don't really need to, that's part of the point.  You can paginate
across a partition with most client drivers, and materializing this view is
just copying data from one table to another with a different layout.  So
you end up just having to read then write a few thousand records at a shot.

doc_id as the partitioning key and day as the clustering key means that you
have to iterate over documents from some outside knowledge (a definitive
source on what the set of doc_id's is), and reading so many separate
partitions (one per doc_id) will produce memory pressure in your cluster.
Compared against ((day), doc_id) where you can SELECT * WHERE day=?.  Your
approach would give you a very nice time series view of views per document
over time, which itself might be useful elsewhere in your application.

That said, there are physical and practical limits on the number of columns
you can have in a partition (2 billion physical, and depending on the data
sometimes just on the order of a few hundred thousand practical without
causing some troubles in areas such as compaction, repair, and bootstrap).

However, I still suspect you may benefit by keying the counters table
primarily by date, but maybe add another key rotator in there, like ((day,
subpartition), doc_id).  Compute your sub partition deterministically but
in an evenly distributed manner from your doc_id (eg, doc_id mod 16, or
md5(doc_id).take(2), etc., depending on what the data type is for your
doc_id).

This will split your single logical partition up across n physical
partitions, opens you up to parallel processing of those partitions
(materializing can get a lot faster as you can very easily have a single
worker working on each physical partition without reduced hotspotting).
Each worker can be assigned a day and subpartition and work exclusively on
that data set for materialization (or a single worker can iterate the
subpartitions for the same effect).

Now to make a too-long email even longer, if you're approaching practical
limits on using doc_id as part of a clustering key, your materialized view
is going to have similar issues.  So you may have to either only
materialize documents with a view count over a certain threshold, or engage
in a similar sub partitioning scheme there.

On Mon Nov 24 2014 at 10:33:50 AM Robert Wille <rw...@fold3.com> wrote:

>  We do get a large number of documents getting counts each day, which is
> why I’m thinking the running totals table be ((doc_id), day) rather than
> ((day), doc_id). We have too many documents per day to materialize in
> memory, so querying per day and aggregating the results isn’t really
> possible.
>
>  I’m planning on bucketing the materialized ordering because we get
> enough unique document views per day that the rows will be quite large. Not
> so large as to be unmanageable, but pushing the limits. If we were so lucky
> as to get a significant increase in traffic, I might start having issues. I
> didn’t include bucketing in my post because I didn’t want to complicate my
> question. I hadn’t considered that I could bucket by hour and then use a
> local midnight instead of a global midnight. Interesting idea.
>
>  Thanks for your response.
>
>  Robert
>
>  On Nov 24, 2014, at 9:40 AM, Eric Stevens <mi...@gmail.com> wrote:
>
> You're right that there's no way to use the counter data type to
> materialize a view ordered by the counter.  Computing this post hoc is
> the way to go if your needs allow for it (if not, something like
> Summingbird or vanilla Storm may be necessary).
>
>  I might suggest that you make your primary key for your running totals
> by day table be ((day), doc_id) because it will make it easy to compute the
> materialized ordered view (SELECT doc_id, count FROM running_totals WHERE
> day=?) unless you expect to have a really large number of documents getting
> counts each day.
>
>  For your materialized ordering, I'd suggest a primary key of ((day),
> count) as then for a given day you'll be able to select top by count
> (SELECT count, doc_id FROM doc_counts WHERE day=? ORDER BY count DESC).
>
>  One more thing to consider if your users are not all in a single
> timezone is having your time bucket be hour instead of day so that you can
> answer by day goal posted by local midnight (except the handful of
> locations that use half hour timezone offsets) instead of a single global
> midnight.  You can then either include either just each hour in each row
> (and aggregate at read time), or you can make each row a rolling 24 hours
> (aggregating at write time), depending on which use case fits your needs
> better.
>
> On Sun Nov 23 2014 at 8:42:11 AM Robert Wille <rw...@fold3.com> wrote:
>
>> I’m working on moving a bunch of counters out of our relational database
>> to Cassandra. For the most part, Cassandra is a very nice fit, except for
>> one feature on our website. We manage a time series of view counts for each
>> document, and display a list of the most popular documents in the last
>> seven days. This seems like a pretty strong anti-pattern for Cassandra, but
>> also seems like something a lot of people would want to do. If you’re
>> keeping counters, its pretty likely that you’d want to know which ones have
>> the highest counts.
>>
>> Here’s what I came up with to implement this feature. Create a counter
>> table with primary key (doc_id, day) and a single counter. Whenever a
>> document is viewed, increment the counter for the document for today and
>> the previous six days. Sometime after midnight each day, compile the
>> counters into a table with primary key (day, count, doc_id) and no
>> additional columns. For each partition in the counter table, I would sum up
>> the counters, delete any counters that are over a week old, and put the sum
>> into the second table with day = today. When I query the table, i would ask
>> for data where day = yesterday. During the compilation process, I would
>> delete old partitions. In theory I’d only need two partitions. One that is
>> being built, and one for querying.
>>
>> I’d be interested to hear critiques on this strategy, as well as hearing
>> how other people have implemented a "most-popular" feature using Cassandra
>> counters.
>>
>> Robert
>>
>>
>

Re: Getting the counters with the highest values

Posted by Robert Wille <rw...@fold3.com>.

We do get a large number of documents getting counts each day, which is why I’m thinking the running totals table be ((doc_id), day) rather than ((day), doc_id). We have too many documents per day to materialize in memory, so querying per day and aggregating the results isn’t really possible.

I’m planning on bucketing the materialized ordering because we get enough unique document views per day that the rows will be quite large. Not so large as to be unmanageable, but pushing the limits. If we were so lucky as to get a significant increase in traffic, I might start having issues. I didn’t include bucketing in my post because I didn’t want to complicate my question. I hadn’t considered that I could bucket by hour and then use a local midnight instead of a global midnight. Interesting idea.

Thanks for your response.

Robert

On Nov 24, 2014, at 9:40 AM, Eric Stevens <mi...@gmail.com>> wrote:

You're right that there's no way to use the counter data type to materialize a view ordered by the counter. Computing this post hoc is the way to go if your needs allow for it (if not, something like Summingbird or vanilla Storm may be necessary).

I might suggest that you make your primary key for your running totals by day table be ((day), doc_id) because it will make it easy to compute the materialized ordered view (SELECT doc_id, count FROM running_totals WHERE day=?) unless you expect to have a really large number of documents getting counts each day.

For your materialized ordering, I'd suggest a primary key of ((day), count) as then for a given day you'll be able to select top by count (SELECT count, doc_id FROM doc_counts WHERE day=? ORDER BY count DESC).

One more thing to consider if your users are not all in a single timezone is having your time bucket be hour instead of day so that you can answer by day goal posted by local midnight (except the handful of locations that use half hour timezone offsets) instead of a single global midnight. You can then either include either just each hour in each row (and aggregate at read time), or you can make each row a rolling 24 hours (aggregating at write time), depending on which use case fits your needs better.

On Sun Nov 23 2014 at 8:42:11 AM Robert Wille <rw...@fold3.com>> wrote:
I’m working on moving a bunch of counters out of our relational database to Cassandra. For the most part, Cassandra is a very nice fit, except for one feature on our website. We manage a time series of view counts for each document, and display a list of the most popular documents in the last seven days. This seems like a pretty strong anti-pattern for Cassandra, but also seems like something a lot of people would want to do. If you’re keeping counters, its pretty likely that you’d want to know which ones have the highest counts.

Here’s what I came up with to implement this feature. Create a counter table with primary key (doc_id, day) and a single counter. Whenever a document is viewed, increment the counter for the document for today and the previous six days. Sometime after midnight each day, compile the counters into a table with primary key (day, count, doc_id) and no additional columns. For each partition in the counter table, I would sum up the counters, delete any counters that are over a week old, and put the sum into the second table with day = today. When I query the table, i would ask for data where day = yesterday. During the compilation process, I would delete old partitions. In theory I’d only need two partitions. One that is being built, and one for querying.

I’d be interested to hear critiques on this strategy, as well as hearing how other people have implemented a "most-popular" feature using Cassandra counters.

Robert

Re: Getting the counters with the highest values

Posted by Eric Stevens <mi...@gmail.com>.

You're right that there's no way to use the counter data type to
materialize a view ordered by the counter.  Computing this post hoc is the
way to go if your needs allow for it (if not, something like Summingbird or
vanilla Storm may be necessary).

I might suggest that you make your primary key for your running totals by
day table be ((day), doc_id) because it will make it easy to compute the
materialized ordered view (SELECT doc_id, count FROM running_totals WHERE
day=?) unless you expect to have a really large number of documents getting
counts each day.

For your materialized ordering, I'd suggest a primary key of ((day), count)
as then for a given day you'll be able to select top by count (SELECT
count, doc_id FROM doc_counts WHERE day=? ORDER BY count DESC).

One more thing to consider if your users are not all in a single timezone
is having your time bucket be hour instead of day so that you can answer by
day goal posted by local midnight (except the handful of locations that use
half hour timezone offsets) instead of a single global midnight.  You can
then either include either just each hour in each row (and aggregate at
read time), or you can make each row a rolling 24 hours (aggregating at
write time), depending on which use case fits your needs better.

On Sun Nov 23 2014 at 8:42:11 AM Robert Wille <rw...@fold3.com> wrote:

> I’m working on moving a bunch of counters out of our relational database
> to Cassandra. For the most part, Cassandra is a very nice fit, except for
> one feature on our website. We manage a time series of view counts for each
> document, and display a list of the most popular documents in the last
> seven days. This seems like a pretty strong anti-pattern for Cassandra, but
> also seems like something a lot of people would want to do. If you’re
> keeping counters, its pretty likely that you’d want to know which ones have
> the highest counts.
>
> Here’s what I came up with to implement this feature. Create a counter
> table with primary key (doc_id, day) and a single counter. Whenever a
> document is viewed, increment the counter for the document for today and
> the previous six days. Sometime after midnight each day, compile the
> counters into a table with primary key (day, count, doc_id) and no
> additional columns. For each partition in the counter table, I would sum up
> the counters, delete any counters that are over a week old, and put the sum
> into the second table with day = today. When I query the table, i would ask
> for data where day = yesterday. During the compilation process, I would
> delete old partitions. In theory I’d only need two partitions. One that is
> being built, and one for querying.
>
> I’d be interested to hear critiques on this strategy, as well as hearing
> how other people have implemented a "most-popular" feature using Cassandra
> counters.
>
> Robert
>
>