You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "Christopher J. Bottaro" <cj...@academicworks.com> on 2013/06/19 19:05:36 UTC

Date range queries

Hello,

We are considering using Cassandra and I want to make sure our use case
fits Cassandra's strengths.  We have the table like:

answers
-------
user_id | question_id | result | created_at

Where our most common query will be something like:

SELECT * FROM answers WHERE user_id = 123 AND created_at > '01/01/2012' AND
created_at < '01/01/2013'

Sometimes we will also limit by a question_id or a list of question_ids.

Secondary indexes will be created on user_id and question_id.  We expect
the upper bound of number of answers for a given user to be around 10,000.

Now my understanding of how Cassandra will run the aforementioned query is
that it will load all the answers for a given user into memory using the
secondary index, then scan over that set filtering based on the dates.

Considering that that will be our most used query and it will happen very
often, is this a bad use case for Cassandra?

Thanks for the help.

Re: Date range queries

Posted by Oleksandr Petrov <ol...@gmail.com>.
Maybe i'm a bit late to the party, but that can be still useful for
reference in future.

We've tried to keep documentation for Clojure cassandra driver as elaborate
and generic as possible, and it contains raw CQL examples,
so you can refer to the docs even if you're using any other driver.

Here's a Range Query guide:
http://clojurecassandra.info/articles/kv.html#toc_8 there's also
information about ordering a resultset,
One more thing that may be useful is Data Modelling guide here:
http://clojurecassandra.info/articles/data_modelling.html#toc_2 which
describes usage of compound keys (which is directly related to range
queries, too).



On Wed, Jun 26, 2013 at 3:05 AM, Colin Blower <cb...@barracuda.com> wrote:

>  You could just separate the history data from the current data. Then
> when the user's result is updated, just write into two tables.
>
> CREATE TABLE all_answers (
>   user_id uuid,
>   created timeuuid,
>   result text,
>   question_id varint,
>   PRIMARY KEY (user_id, created)
> )
>
> CREATE TABLE current_answers (
>   user_id uuid,
>   question_id varint,
>   created timeuuid,
>   result text,
>   PRIMARY KEY (user_id, question_id)
> )
>
>
> > select * FROM current_answers ;
>  user_id                              | question_id | result | created
>
> --------------------------------------+-------------+--------+--------------------------------------
>  11b1e59c-ddfa-11e2-a28f-0800200c9a66 |           1 |     no |
> f9893ee0-ddfa-11e2-b74c-35d7be46b354
>  11b1e59c-ddfa-11e2-a28f-0800200c9a66 |           2 |   blah |
> f7af75d0-ddfa-11e2-b74c-35d7be46b354
>
> > select * FROM all_answers ;
>  user_id                              |
> created                              | question_id | result
>
> --------------------------------------+--------------------------------------+-------------+--------
>  11b1e59c-ddfa-11e2-a28f-0800200c9a66 |
> f0141234-ddfa-11e2-b74c-35d7be46b354 |           1 |    yes
>  11b1e59c-ddfa-11e2-a28f-0800200c9a66 |
> f7af75d0-ddfa-11e2-b74c-35d7be46b354 |           2 |   blah
>  11b1e59c-ddfa-11e2-a28f-0800200c9a66 |
> f9893ee0-ddfa-11e2-b74c-35d7be46b354 |           1 |     no
>
> This way you can get the history of answers if you want and there is a
> simple way to get the most current answers.
>
> Just a thought.
> -Colin B.
>
>
>
> On 06/24/2013 03:28 PM, Christopher J. Bottaro wrote:
>
> Yes, that makes sense and that article helped a lot, but I still have a
> few questions...
>
>  The created_at in our answers table is basically used as a version id.
>  When a user updates his answer, we don't overwrite the old answer, but
> rather insert a new answer with a more recent timestamp (the version).
>
>  answers
> -------
> user_id | created_at | question_id | result
> -------------------------------------------
>       1 | 2013-01-01 | 1           | yes
>       1 | 2013-01-01 | 2           | blah
>        1 | 2013-01-02 | 1           | no
>
>  So the queries we really want to run are "find me all the answers for a
> given user at a given time."  So given the date of 2013-01-02 and user_id
> 1, we would want rows 2 and 3 returned (since rows 3 obsoletes row 1).  Is
> it possible to do this with CQL given the current schema?
>
>  As an aside, we can do this in Postgresql using window functions, not
> standard SQL, but pretty neat.
>
>  We can alter our schema like so...
>
>  answers
> -------
> user_id | start_at | end_at | question_id | result
>
>  Where the start_at and end_at denote when an answer is active.  So the
> example above would become:
>
>  answers
> -------
> user_id | start_at   | end_at     | question_id | result
> --------------------------------------------------------
>       1 | 2013-01-01 | 2013-01-02 | 1           | yes
>       1 | 2013-01-01 | null       | 2           | blah
>        1 | 2013-01-02 | null       | 1           | no
>
>  Now we can query "SELECT * FROM answers WHERE user_id = 1 AND start_at
> >= '2013-01-02' AND (end_at < '2013-01-02' OR end_at IS NULL)".
>
>  How would one define the partitioning key and cluster columns in CQL to
> accomplish this?  Is it as simple as PRIMARY KEY (user_id, start_at,
> end_at, question_id) (remembering that we sometimes want to limit by
> question_id)?
>
>  Also, we are a bit worried about race conditions.  Consider two separate
> processes updating an answer for a given user_id / question_id.  There will
> be a race condition between the two to update the correct row's end_at
> field.  Does that make sense?  I can draw it out with ASCII tables, but I
> feel like this email is already too long... :P
>
>  Thanks for the help.
>
>
>
> On Wed, Jun 19, 2013 at 2:28 PM, David McNelis <dm...@gmail.com> wrote:
>
>> So, if you want to grab by the created_at and occasionally limit by
>> question id, that is why you'd use created_at.
>>
>>  The way the primary keys work is the first part of the primary key is
>> the Partioner key, that field is what essentially is the single cassandra
>> row.  The second key is the order preserving key, so you can sort by that
>> key.  If you have a third piece, then that is the secondary order
>> preserving key.
>>
>>  The reason you'd want to do (user_id, created_at, question_id) is
>> because when you do a query on the keys, if you MUST use the preceding
>> pieces of the primary key.  So in your case, you could not do a query with
>> just user_id and question_id with the user-created-question key.
>>  Alternatively if you went with (user_id, question_id, created_at), you
>> would not be able to include a range of created_at unless you were also
>> filtering on the question_id.
>>
>>  Does that make sense?
>>
>>  As for the large rows, 10k is unlikely to cause you too many issues
>> (unless the answer is potentially a big blob of text).  Newer versions of
>> cassandra deal with a lot of things in far, far, superior ways to < 1.0.
>>
>>  For a really good primary on keys in cql and how to potentially avoid
>> hot rows, a really good article to read is this one:
>> http://thelastpickle.com/2013/01/11/primary-keys-in-cql/  Aaron did a
>> great job of laying out the subtleties of primary keys in CQL.
>>
>>
>> On Wed, Jun 19, 2013 at 2:21 PM, Christopher J. Bottaro <
>> cjbottaro@academicworks.com> wrote:
>>
>>> Interesting, thank you for the reply.
>>>
>>>  Two questions though...
>>>
>>>  Why should created_at come before question_id in the primary key?  In
>>> other words, why (user_id, created_at, question_id) instead of (user_id,
>>> question_id, created_at)?
>>>
>>>  Given this setup, all a user's answers (all 10k) will be stored in a
>>> single C* (internal, not cql) row?  I thought having "fat" or "big" rows
>>> was bad.  I worked with Cassandra 0.6 at my previous job and given the
>>> nature of our work, we would sometimes generate these "fat" rows... at
>>> which point Cassandra would basically shit the bed.
>>>
>>>  Thanks for the help.
>>>
>>>
>>> On Wed, Jun 19, 2013 at 12:26 PM, David McNelis <dm...@gmail.com>wrote:
>>>
>>>> I think you'd just be better served with just a little different
>>>> primary key.
>>>>
>>>>  If your primary key was (user_id, created_at)  or (user_id,
>>>> created_at, question_id), then you'd be able to run the above query without
>>>> a problem.
>>>>
>>>>  This will mean that the entire pantheon of a specific user_id will be
>>>> stored as a 'row' (in the old style C* vernacular), and then the
>>>> information would be ordered by the 2nd piece of the primary key (or 2nd,
>>>> then 3rd if you included question_id).
>>>>
>>>>  You would certainly want to include any field that makes a record
>>>> unique in the primary key.  Another thing to note is that if a field is
>>>> part of the primary key you can not create a secondary index on that field.
>>>>  You can work around that by storing the field twice, but you might want to
>>>> rethink your structure if you find yourself doing that often.
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 12:05 PM, Christopher J. Bottaro <
>>>> cjbottaro@academicworks.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>>  We are considering using Cassandra and I want to make sure our use
>>>>> case fits Cassandra's strengths.  We have the table like:
>>>>>
>>>>>  answers
>>>>> -------
>>>>> user_id | question_id | result | created_at
>>>>>
>>>>>  Where our most common query will be something like:
>>>>>
>>>>>  SELECT * FROM answers WHERE user_id = 123 AND created_at >
>>>>> '01/01/2012' AND created_at < '01/01/2013'
>>>>>
>>>>>  Sometimes we will also limit by a question_id or a list of
>>>>> question_ids.
>>>>>
>>>>>  Secondary indexes will be created on user_id and question_id.  We
>>>>> expect the upper bound of number of answers for a given user to be around
>>>>> 10,000.
>>>>>
>>>>>  Now my understanding of how Cassandra will run the aforementioned
>>>>> query is that it will load all the answers for a given user into memory
>>>>> using the secondary index, then scan over that set filtering based on the
>>>>> dates.
>>>>>
>>>>>  Considering that that will be our most used query and it will happen
>>>>> very often, is this a bad use case for Cassandra?
>>>>>
>>>>>  Thanks for the help.
>>>>>
>>>>
>>>>
>>>
>>
>
>
>


-- 
alex p

Re: Date range queries

Posted by Colin Blower <cb...@barracuda.com>.
You could just separate the history data from the current data. Then
when the user's result is updated, just write into two tables.

CREATE TABLE all_answers (
  user_id uuid,
  created timeuuid,
  result text,
  question_id varint,
  PRIMARY KEY (user_id, created)
)

CREATE TABLE current_answers (
  user_id uuid,
  question_id varint,
  created timeuuid,
  result text,
  PRIMARY KEY (user_id, question_id)
)


> select * FROM current_answers ;
 user_id                              | question_id | result | created
--------------------------------------+-------------+--------+--------------------------------------
 11b1e59c-ddfa-11e2-a28f-0800200c9a66 |           1 |     no |
f9893ee0-ddfa-11e2-b74c-35d7be46b354
 11b1e59c-ddfa-11e2-a28f-0800200c9a66 |           2 |   blah |
f7af75d0-ddfa-11e2-b74c-35d7be46b354

> select * FROM all_answers ;
 user_id                              |
created                              | question_id | result
--------------------------------------+--------------------------------------+-------------+--------
 11b1e59c-ddfa-11e2-a28f-0800200c9a66 |
f0141234-ddfa-11e2-b74c-35d7be46b354 |           1 |    yes
 11b1e59c-ddfa-11e2-a28f-0800200c9a66 |
f7af75d0-ddfa-11e2-b74c-35d7be46b354 |           2 |   blah
 11b1e59c-ddfa-11e2-a28f-0800200c9a66 |
f9893ee0-ddfa-11e2-b74c-35d7be46b354 |           1 |     no

This way you can get the history of answers if you want and there is a
simple way to get the most current answers.

Just a thought.
-Colin B.


On 06/24/2013 03:28 PM, Christopher J. Bottaro wrote:
> Yes, that makes sense and that article helped a lot, but I still have
> a few questions...
>
> The created_at in our answers table is basically used as a version id.
>  When a user updates his answer, we don't overwrite the old answer,
> but rather insert a new answer with a more recent timestamp (the version).
>
> answers
> -------
> user_id | created_at | question_id | result
> -------------------------------------------
>       1 | 2013-01-01 | 1           | yes
>       1 | 2013-01-01 | 2           | blah
>       1 | 2013-01-02 | 1           | no
>
> So the queries we really want to run are "find me all the answers for
> a given user at a given time."  So given the date of 2013-01-02 and
> user_id 1, we would want rows 2 and 3 returned (since rows 3 obsoletes
> row 1).  Is it possible to do this with CQL given the current schema?
>
> As an aside, we can do this in Postgresql using window functions, not
> standard SQL, but pretty neat.
>
> We can alter our schema like so...
>
> answers
> -------
> user_id | start_at | end_at | question_id | result
>
> Where the start_at and end_at denote when an answer is active.  So the
> example above would become:
>
> answers
> -------
> user_id | start_at   | end_at     | question_id | result
> --------------------------------------------------------
>       1 | 2013-01-01 | 2013-01-02 | 1           | yes
>       1 | 2013-01-01 | null       | 2           | blah
>       1 | 2013-01-02 | null       | 1           | no
>
> Now we can query "SELECT * FROM answers WHERE user_id = 1 AND start_at
> >= '2013-01-02' AND (end_at < '2013-01-02' OR end_at IS NULL)".
>
> How would one define the partitioning key and cluster columns in CQL
> to accomplish this?  Is it as simple as PRIMARY KEY (user_id,
> start_at, end_at, question_id) (remembering that we sometimes want to
> limit by question_id)?
>
> Also, we are a bit worried about race conditions.  Consider two
> separate processes updating an answer for a given user_id /
> question_id.  There will be a race condition between the two to update
> the correct row's end_at field.  Does that make sense?  I can draw it
> out with ASCII tables, but I feel like this email is already too
> long... :P
>
> Thanks for the help.
>
>
>
> On Wed, Jun 19, 2013 at 2:28 PM, David McNelis <dmcnelis@gmail.com
> <ma...@gmail.com>> wrote:
>
>     So, if you want to grab by the created_at and occasionally limit
>     by question id, that is why you'd use created_at.
>
>     The way the primary keys work is the first part of the primary key
>     is the Partioner key, that field is what essentially is the single
>     cassandra row.  The second key is the order preserving key, so you
>     can sort by that key.  If you have a third piece, then that is the
>     secondary order preserving key.
>
>     The reason you'd want to do (user_id, created_at, question_id) is
>     because when you do a query on the keys, if you MUST use the
>     preceding pieces of the primary key.  So in your case, you could
>     not do a query with just user_id and question_id with the
>     user-created-question key.  Alternatively if you went with
>     (user_id, question_id, created_at), you would not be able to
>     include a range of created_at unless you were also filtering on
>     the question_id.
>
>     Does that make sense?
>
>     As for the large rows, 10k is unlikely to cause you too many
>     issues (unless the answer is potentially a big blob of text).
>      Newer versions of cassandra deal with a lot of things in far,
>     far, superior ways to < 1.0.
>
>     For a really good primary on keys in cql and how to potentially
>     avoid hot rows, a really good article to read is this
>     one: http://thelastpickle.com/2013/01/11/primary-keys-in-cql/
>      Aaron did a great job of laying out the subtleties of primary
>     keys in CQL.
>
>
>     On Wed, Jun 19, 2013 at 2:21 PM, Christopher J. Bottaro
>     <cjbottaro@academicworks.com <ma...@academicworks.com>>
>     wrote:
>
>         Interesting, thank you for the reply.
>
>         Two questions though...
>
>         Why should created_at come before question_id in the primary
>         key?  In other words, why (user_id, created_at, question_id)
>         instead of (user_id, question_id, created_at)?
>
>         Given this setup, all a user's answers (all 10k) will be
>         stored in a single C* (internal, not cql) row?  I thought
>         having "fat" or "big" rows was bad.  I worked with Cassandra
>         0.6 at my previous job and given the nature of our work, we
>         would sometimes generate these "fat" rows... at which point
>         Cassandra would basically shit the bed.
>
>         Thanks for the help.
>
>
>         On Wed, Jun 19, 2013 at 12:26 PM, David McNelis
>         <dmcnelis@gmail.com <ma...@gmail.com>> wrote:
>
>             I think you'd just be better served with just a little
>             different primary key.
>
>             If your primary key was (user_id, created_at)  or
>             (user_id, created_at, question_id), then you'd be able to
>             run the above query without a problem.
>
>             This will mean that the entire pantheon of a specific
>             user_id will be stored as a 'row' (in the old style C*
>             vernacular), and then the information would be ordered by
>             the 2nd piece of the primary key (or 2nd, then 3rd if you
>             included question_id). 
>
>             You would certainly want to include any field that makes a
>             record unique in the primary key.  Another thing to note
>             is that if a field is part of the primary key you can not
>             create a secondary index on that field.  You can work
>             around that by storing the field twice, but you might want
>             to rethink your structure if you find yourself doing that
>             often.
>
>
>             On Wed, Jun 19, 2013 at 12:05 PM, Christopher J. Bottaro
>             <cjbottaro@academicworks.com
>             <ma...@academicworks.com>> wrote:
>
>                 Hello,
>
>                 We are considering using Cassandra and I want to make
>                 sure our use case fits Cassandra's strengths.  We have
>                 the table like:
>
>                 answers
>                 -------
>                 user_id | question_id | result | created_at
>
>                 Where our most common query will be something like:
>
>                 SELECT * FROM answers WHERE user_id = 123 AND
>                 created_at > '01/01/2012' AND created_at < '01/01/2013'
>
>                 Sometimes we will also limit by a question_id or a
>                 list of question_ids.
>
>                 Secondary indexes will be created on user_id and
>                 question_id.  We expect the upper bound of number of
>                 answers for a given user to be around 10,000.
>
>                 Now my understanding of how Cassandra will run the
>                 aforementioned query is that it will load all the
>                 answers for a given user into memory using the
>                 secondary index, then scan over that set filtering
>                 based on the dates.
>
>                 Considering that that will be our most used query and
>                 it will happen very often, is this a bad use case for
>                 Cassandra?
>
>                 Thanks for the help.
>
>
>
>
>



Re: Date range queries

Posted by "Christopher J. Bottaro" <cj...@academicworks.com>.
Yes, that makes sense and that article helped a lot, but I still have a few
questions...

The created_at in our answers table is basically used as a version id.
 When a user updates his answer, we don't overwrite the old answer, but
rather insert a new answer with a more recent timestamp (the version).

answers
-------
user_id | created_at | question_id | result
-------------------------------------------
      1 | 2013-01-01 | 1           | yes
      1 | 2013-01-01 | 2           | blah
      1 | 2013-01-02 | 1           | no

So the queries we really want to run are "find me all the answers for a
given user at a given time."  So given the date of 2013-01-02 and user_id
1, we would want rows 2 and 3 returned (since rows 3 obsoletes row 1).  Is
it possible to do this with CQL given the current schema?

As an aside, we can do this in Postgresql using window functions, not
standard SQL, but pretty neat.

We can alter our schema like so...

answers
-------
user_id | start_at | end_at | question_id | result

Where the start_at and end_at denote when an answer is active.  So the
example above would become:

answers
-------
user_id | start_at   | end_at     | question_id | result
--------------------------------------------------------
      1 | 2013-01-01 | 2013-01-02 | 1           | yes
      1 | 2013-01-01 | null       | 2           | blah
      1 | 2013-01-02 | null       | 1           | no

Now we can query "SELECT * FROM answers WHERE user_id = 1 AND start_at >=
'2013-01-02' AND (end_at < '2013-01-02' OR end_at IS NULL)".

How would one define the partitioning key and cluster columns in CQL to
accomplish this?  Is it as simple as PRIMARY KEY (user_id, start_at,
end_at, question_id) (remembering that we sometimes want to limit by
question_id)?

Also, we are a bit worried about race conditions.  Consider two separate
processes updating an answer for a given user_id / question_id.  There will
be a race condition between the two to update the correct row's end_at
field.  Does that make sense?  I can draw it out with ASCII tables, but I
feel like this email is already too long... :P

Thanks for the help.



On Wed, Jun 19, 2013 at 2:28 PM, David McNelis <dm...@gmail.com> wrote:

> So, if you want to grab by the created_at and occasionally limit by
> question id, that is why you'd use created_at.
>
> The way the primary keys work is the first part of the primary key is the
> Partioner key, that field is what essentially is the single cassandra row.
>  The second key is the order preserving key, so you can sort by that key.
>  If you have a third piece, then that is the secondary order preserving key.
>
> The reason you'd want to do (user_id, created_at, question_id) is because
> when you do a query on the keys, if you MUST use the preceding pieces of
> the primary key.  So in your case, you could not do a query with just
> user_id and question_id with the user-created-question key.  Alternatively
> if you went with (user_id, question_id, created_at), you would not be able
> to include a range of created_at unless you were also filtering on the
> question_id.
>
> Does that make sense?
>
> As for the large rows, 10k is unlikely to cause you too many issues
> (unless the answer is potentially a big blob of text).  Newer versions of
> cassandra deal with a lot of things in far, far, superior ways to < 1.0.
>
> For a really good primary on keys in cql and how to potentially avoid hot
> rows, a really good article to read is this one:
> http://thelastpickle.com/2013/01/11/primary-keys-in-cql/  Aaron did a
> great job of laying out the subtleties of primary keys in CQL.
>
>
> On Wed, Jun 19, 2013 at 2:21 PM, Christopher J. Bottaro <
> cjbottaro@academicworks.com> wrote:
>
>> Interesting, thank you for the reply.
>>
>> Two questions though...
>>
>> Why should created_at come before question_id in the primary key?  In
>> other words, why (user_id, created_at, question_id) instead of (user_id,
>> question_id, created_at)?
>>
>> Given this setup, all a user's answers (all 10k) will be stored in a
>> single C* (internal, not cql) row?  I thought having "fat" or "big" rows
>> was bad.  I worked with Cassandra 0.6 at my previous job and given the
>> nature of our work, we would sometimes generate these "fat" rows... at
>> which point Cassandra would basically shit the bed.
>>
>> Thanks for the help.
>>
>>
>> On Wed, Jun 19, 2013 at 12:26 PM, David McNelis <dm...@gmail.com>wrote:
>>
>>> I think you'd just be better served with just a little different primary
>>> key.
>>>
>>> If your primary key was (user_id, created_at)  or (user_id, created_at,
>>> question_id), then you'd be able to run the above query without a problem.
>>>
>>> This will mean that the entire pantheon of a specific user_id will be
>>> stored as a 'row' (in the old style C* vernacular), and then the
>>> information would be ordered by the 2nd piece of the primary key (or 2nd,
>>> then 3rd if you included question_id).
>>>
>>> You would certainly want to include any field that makes a record unique
>>> in the primary key.  Another thing to note is that if a field is part of
>>> the primary key you can not create a secondary index on that field.  You
>>> can work around that by storing the field twice, but you might want to
>>> rethink your structure if you find yourself doing that often.
>>>
>>>
>>> On Wed, Jun 19, 2013 at 12:05 PM, Christopher J. Bottaro <
>>> cjbottaro@academicworks.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are considering using Cassandra and I want to make sure our use case
>>>> fits Cassandra's strengths.  We have the table like:
>>>>
>>>> answers
>>>> -------
>>>> user_id | question_id | result | created_at
>>>>
>>>> Where our most common query will be something like:
>>>>
>>>> SELECT * FROM answers WHERE user_id = 123 AND created_at > '01/01/2012'
>>>> AND created_at < '01/01/2013'
>>>>
>>>> Sometimes we will also limit by a question_id or a list of question_ids.
>>>>
>>>> Secondary indexes will be created on user_id and question_id.  We
>>>> expect the upper bound of number of answers for a given user to be around
>>>> 10,000.
>>>>
>>>> Now my understanding of how Cassandra will run the aforementioned query
>>>> is that it will load all the answers for a given user into memory using the
>>>> secondary index, then scan over that set filtering based on the dates.
>>>>
>>>> Considering that that will be our most used query and it will happen
>>>> very often, is this a bad use case for Cassandra?
>>>>
>>>> Thanks for the help.
>>>>
>>>
>>>
>>
>

Re: Date range queries

Posted by David McNelis <dm...@gmail.com>.
So, if you want to grab by the created_at and occasionally limit by
question id, that is why you'd use created_at.

The way the primary keys work is the first part of the primary key is the
Partioner key, that field is what essentially is the single cassandra row.
 The second key is the order preserving key, so you can sort by that key.
 If you have a third piece, then that is the secondary order preserving key.

The reason you'd want to do (user_id, created_at, question_id) is because
when you do a query on the keys, if you MUST use the preceding pieces of
the primary key.  So in your case, you could not do a query with just
user_id and question_id with the user-created-question key.  Alternatively
if you went with (user_id, question_id, created_at), you would not be able
to include a range of created_at unless you were also filtering on the
question_id.

Does that make sense?

As for the large rows, 10k is unlikely to cause you too many issues (unless
the answer is potentially a big blob of text).  Newer versions of cassandra
deal with a lot of things in far, far, superior ways to < 1.0.

For a really good primary on keys in cql and how to potentially avoid hot
rows, a really good article to read is this one:
http://thelastpickle.com/2013/01/11/primary-keys-in-cql/  Aaron did a great
job of laying out the subtleties of primary keys in CQL.


On Wed, Jun 19, 2013 at 2:21 PM, Christopher J. Bottaro <
cjbottaro@academicworks.com> wrote:

> Interesting, thank you for the reply.
>
> Two questions though...
>
> Why should created_at come before question_id in the primary key?  In
> other words, why (user_id, created_at, question_id) instead of (user_id,
> question_id, created_at)?
>
> Given this setup, all a user's answers (all 10k) will be stored in a
> single C* (internal, not cql) row?  I thought having "fat" or "big" rows
> was bad.  I worked with Cassandra 0.6 at my previous job and given the
> nature of our work, we would sometimes generate these "fat" rows... at
> which point Cassandra would basically shit the bed.
>
> Thanks for the help.
>
>
> On Wed, Jun 19, 2013 at 12:26 PM, David McNelis <dm...@gmail.com>wrote:
>
>> I think you'd just be better served with just a little different primary
>> key.
>>
>> If your primary key was (user_id, created_at)  or (user_id, created_at,
>> question_id), then you'd be able to run the above query without a problem.
>>
>> This will mean that the entire pantheon of a specific user_id will be
>> stored as a 'row' (in the old style C* vernacular), and then the
>> information would be ordered by the 2nd piece of the primary key (or 2nd,
>> then 3rd if you included question_id).
>>
>> You would certainly want to include any field that makes a record unique
>> in the primary key.  Another thing to note is that if a field is part of
>> the primary key you can not create a secondary index on that field.  You
>> can work around that by storing the field twice, but you might want to
>> rethink your structure if you find yourself doing that often.
>>
>>
>> On Wed, Jun 19, 2013 at 12:05 PM, Christopher J. Bottaro <
>> cjbottaro@academicworks.com> wrote:
>>
>>> Hello,
>>>
>>> We are considering using Cassandra and I want to make sure our use case
>>> fits Cassandra's strengths.  We have the table like:
>>>
>>> answers
>>> -------
>>> user_id | question_id | result | created_at
>>>
>>> Where our most common query will be something like:
>>>
>>> SELECT * FROM answers WHERE user_id = 123 AND created_at > '01/01/2012'
>>> AND created_at < '01/01/2013'
>>>
>>> Sometimes we will also limit by a question_id or a list of question_ids.
>>>
>>> Secondary indexes will be created on user_id and question_id.  We expect
>>> the upper bound of number of answers for a given user to be around 10,000.
>>>
>>> Now my understanding of how Cassandra will run the aforementioned query
>>> is that it will load all the answers for a given user into memory using the
>>> secondary index, then scan over that set filtering based on the dates.
>>>
>>> Considering that that will be our most used query and it will happen
>>> very often, is this a bad use case for Cassandra?
>>>
>>> Thanks for the help.
>>>
>>
>>
>

Re: Date range queries

Posted by "Christopher J. Bottaro" <cj...@academicworks.com>.
Interesting, thank you for the reply.

Two questions though...

Why should created_at come before question_id in the primary key?  In other
words, why (user_id, created_at, question_id) instead of (user_id,
question_id, created_at)?

Given this setup, all a user's answers (all 10k) will be stored in a single
C* (internal, not cql) row?  I thought having "fat" or "big" rows was bad.
 I worked with Cassandra 0.6 at my previous job and given the nature of our
work, we would sometimes generate these "fat" rows... at which point
Cassandra would basically shit the bed.

Thanks for the help.


On Wed, Jun 19, 2013 at 12:26 PM, David McNelis <dm...@gmail.com> wrote:

> I think you'd just be better served with just a little different primary
> key.
>
> If your primary key was (user_id, created_at)  or (user_id, created_at,
> question_id), then you'd be able to run the above query without a problem.
>
> This will mean that the entire pantheon of a specific user_id will be
> stored as a 'row' (in the old style C* vernacular), and then the
> information would be ordered by the 2nd piece of the primary key (or 2nd,
> then 3rd if you included question_id).
>
> You would certainly want to include any field that makes a record unique
> in the primary key.  Another thing to note is that if a field is part of
> the primary key you can not create a secondary index on that field.  You
> can work around that by storing the field twice, but you might want to
> rethink your structure if you find yourself doing that often.
>
>
> On Wed, Jun 19, 2013 at 12:05 PM, Christopher J. Bottaro <
> cjbottaro@academicworks.com> wrote:
>
>> Hello,
>>
>> We are considering using Cassandra and I want to make sure our use case
>> fits Cassandra's strengths.  We have the table like:
>>
>> answers
>> -------
>> user_id | question_id | result | created_at
>>
>> Where our most common query will be something like:
>>
>> SELECT * FROM answers WHERE user_id = 123 AND created_at > '01/01/2012'
>> AND created_at < '01/01/2013'
>>
>> Sometimes we will also limit by a question_id or a list of question_ids.
>>
>> Secondary indexes will be created on user_id and question_id.  We expect
>> the upper bound of number of answers for a given user to be around 10,000.
>>
>> Now my understanding of how Cassandra will run the aforementioned query
>> is that it will load all the answers for a given user into memory using the
>> secondary index, then scan over that set filtering based on the dates.
>>
>> Considering that that will be our most used query and it will happen very
>> often, is this a bad use case for Cassandra?
>>
>> Thanks for the help.
>>
>
>

Re: Date range queries

Posted by David McNelis <dm...@gmail.com>.
I think you'd just be better served with just a little different primary
key.

If your primary key was (user_id, created_at)  or (user_id, created_at,
question_id), then you'd be able to run the above query without a problem.

This will mean that the entire pantheon of a specific user_id will be
stored as a 'row' (in the old style C* vernacular), and then the
information would be ordered by the 2nd piece of the primary key (or 2nd,
then 3rd if you included question_id).

You would certainly want to include any field that makes a record unique in
the primary key.  Another thing to note is that if a field is part of the
primary key you can not create a secondary index on that field.  You can
work around that by storing the field twice, but you might want to rethink
your structure if you find yourself doing that often.


On Wed, Jun 19, 2013 at 12:05 PM, Christopher J. Bottaro <
cjbottaro@academicworks.com> wrote:

> Hello,
>
> We are considering using Cassandra and I want to make sure our use case
> fits Cassandra's strengths.  We have the table like:
>
> answers
> -------
> user_id | question_id | result | created_at
>
> Where our most common query will be something like:
>
> SELECT * FROM answers WHERE user_id = 123 AND created_at > '01/01/2012'
> AND created_at < '01/01/2013'
>
> Sometimes we will also limit by a question_id or a list of question_ids.
>
> Secondary indexes will be created on user_id and question_id.  We expect
> the upper bound of number of answers for a given user to be around 10,000.
>
> Now my understanding of how Cassandra will run the aforementioned query is
> that it will load all the answers for a given user into memory using the
> secondary index, then scan over that set filtering based on the dates.
>
> Considering that that will be our most used query and it will happen very
> often, is this a bad use case for Cassandra?
>
> Thanks for the help.
>