You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Voytek Jarnot <vo...@gmail.com> on 2016/12/27 15:42:17 UTC

Read efficiency question

Wondering if there's a difference when querying by primary key between the
two definitions below:

primary key ((key1, key2, key3))
primary key ((key1, key2), key3)

In terms of read speed/efficiency... I don't have much of a reason
otherwise to prefer one setup over the other, so would prefer the most
efficient for querying.

Thanks.

Re: Read efficiency question

Posted by Oskar Kjellin <os...@gmail.com>.

Yes sorry I missed the double parenthesis in the first case. 

I may be a bit off here, but I don't think the coordinator pinpoints the row but just the node it needs to go to. 
It's more a case of creating smaller partitions, which makes for more even load among the cluster and the node will not have to read a whole lot of data into memory to just GC later on. 

If you think of Cassandra as a hash map (which it kind of is). You like the key to be as unique as possible to not have to go to a bucket and filter there, or create hot spots. 

Sent from my iPhone

> On 27 Dec 2016, at 17:12, Voytek Jarnot <vo...@gmail.com> wrote:
> 
> Thank you Oskar.  I think you may be missing the double parentheses in the first example - difference is between partition key of (key1, key2, key3) and (key1, key2).  With that in mind, I believe your answer would be that the first example is more efficient?
> 
> Is this essentially a case of the coordinator node being able to exactly pinpoint a row (first example) vs the coordinator node pinpointing the partition and letting the partition-owning node refine down to the right row using the clustering key (key3 in the second example)?
> 
>> On Tue, Dec 27, 2016 at 10:06 AM, Oskar Kjellin <os...@gmail.com> wrote:
>> The second one will be the most efficient.
>> How much depends on how unique key1 is.
>> 
>> In the first case everything for the same key1 will be on the same partition.  If it's not unique at all that will be very bad.
>> In the second case the combo of key1 and key2 will decide what partition.
>> 
>> If you don't ever have to find all key2 for a given key1 I don't see any reason to do case 1
>> 
>> 
>> > On 27 Dec 2016, at 16:42, Voytek Jarnot <vo...@gmail.com> wrote:
>> >
>> > Wondering if there's a difference when querying by primary key between the two definitions below:
>> >
>> > primary key ((key1, key2, key3))
>> > primary key ((key1, key2), key3)
>> >
>> > In terms of read speed/efficiency... I don't have much of a reason otherwise to prefer one setup over the other, so would prefer the most efficient for querying.
>> >
>> > Thanks.
>

Re: Read efficiency question

Posted by Voytek Jarnot <vo...@gmail.com>.

Thank you Oskar.  I think you may be missing the double parentheses in the
first example - difference is between partition key of (key1, key2, key3)
and (key1, key2).  With that in mind, I believe your answer would be that
the first example is more efficient?

Is this essentially a case of the coordinator node being able to exactly
pinpoint a row (first example) vs the coordinator node pinpointing the
partition and letting the partition-owning node refine down to the right
row using the clustering key (key3 in the second example)?

On Tue, Dec 27, 2016 at 10:06 AM, Oskar Kjellin <os...@gmail.com>
wrote:

> The second one will be the most efficient.
> How much depends on how unique key1 is.
>
> In the first case everything for the same key1 will be on the same
> partition.  If it's not unique at all that will be very bad.
> In the second case the combo of key1 and key2 will decide what partition.
>
> If you don't ever have to find all key2 for a given key1 I don't see any
> reason to do case 1
>
>
> > On 27 Dec 2016, at 16:42, Voytek Jarnot <vo...@gmail.com> wrote:
> >
> > Wondering if there's a difference when querying by primary key between
> the two definitions below:
> >
> > primary key ((key1, key2, key3))
> > primary key ((key1, key2), key3)
> >
> > In terms of read speed/efficiency... I don't have much of a reason
> otherwise to prefer one setup over the other, so would prefer the most
> efficient for querying.
> >
> > Thanks.
>

Re: Read efficiency question

Posted by Oskar Kjellin <os...@gmail.com>.

The second one will be the most efficient. 
How much depends on how unique key1 is. 

In the first case everything for the same key1 will be on the same partition.  If it's not unique at all that will be very bad. 
In the second case the combo of key1 and key2 will decide what partition. 

If you don't ever have to find all key2 for a given key1 I don't see any reason to do case 1

> On 27 Dec 2016, at 16:42, Voytek Jarnot <vo...@gmail.com> wrote:
> 
> Wondering if there's a difference when querying by primary key between the two definitions below:
> 
> primary key ((key1, key2, key3))
> primary key ((key1, key2), key3)
> 
> In terms of read speed/efficiency... I don't have much of a reason otherwise to prefer one setup over the other, so would prefer the most efficient for querying.
> 
> Thanks.

Re: Read efficiency question

Posted by Voytek Jarnot <vo...@gmail.com>.

Thank you Janne.  Yes, these are random-access (scatter) reads - I've
decided on option 1; having also considered (as you wrote) that it will
never make sense to look at ranges of key3.

On Fri, Dec 30, 2016 at 3:40 AM, Janne Jalkanen <ja...@ecyrd.com>
wrote:

> In practice, the performance you’re getting is likely to be impacted by
> your reading patterns.  If you do a lot of sequential reads where key1 and
> key2 stay the same, and only key3 varies, then you may be getting better
> peformance out of the second option due to hitting the row and disk caches
> more often. If you are doing a lot of scatter reads, then you’re likely to
> get better performance out of the first option, because the reads will be
> distributed more evenly to multiple nodes.  It also depends on how large
> rows you’re planning to use, as this will directly impact things like
> compaction which has an overall impact of the entire cluster speed.  For
> just a few values of key3, I doubt there would be much difference in
> performance, but if key3 has a cardinality of say, a million, you might be
> better off with option 1.
>
> As always the advice is - benchmark your intended use case - put a few
> hundred gigs of mock data to a cluster, trigger compactions and do perf
> tests for different kinds of read/write loads. :-)
>
> (Though if I didn’t know what my read pattern would be, I’d probably go
> for option 1 purely on a gut feeling if I was sure I would never need range
> queries on key3; shorter rows *usually* are a bit better for performance,
> compaction, etc.  Really wide rows can sometimes be a headache
> operationally.)
>
> May you have energy and success!
> /Janne
>
>
>
> On 28 Dec 2016, at 16:44, Manoj Khangaonkar <kh...@gmail.com> wrote:
>
> In the first case, the partitioning is based on key1,key2,key3.
>
> In the second case, partitioning is based on key1 , key2. Additionally you
> have a clustered key key3. This means within a partition you can do range
> queries on key3 efficiently. That is the difference.
>
> regards
>
> On Tue, Dec 27, 2016 at 7:42 AM, Voytek Jarnot <vo...@gmail.com>
> wrote:
>
>> Wondering if there's a difference when querying by primary key between
>> the two definitions below:
>>
>> primary key ((key1, key2, key3))
>> primary key ((key1, key2), key3)
>>
>> In terms of read speed/efficiency... I don't have much of a reason
>> otherwise to prefer one setup over the other, so would prefer the most
>> efficient for querying.
>>
>> Thanks.
>>
>
>
>
> --
> http://khangaonkar.blogspot.com/
>
>
>

Re: Read efficiency question

Posted by Manoj Khangaonkar <kh...@gmail.com>.

In the first case, the partitioning is based on key1,key2,key3.

In the second case, partitioning is based on key1 , key2. Additionally you
have a clustered key key3. This means within a partition you can do range
queries on key3 efficiently. That is the difference.

regards

On Tue, Dec 27, 2016 at 7:42 AM, Voytek Jarnot <vo...@gmail.com>
wrote:

> Wondering if there's a difference when querying by primary key between the
> two definitions below:
>
> primary key ((key1, key2, key3))
> primary key ((key1, key2), key3)
>
> In terms of read speed/efficiency... I don't have much of a reason
> otherwise to prefer one setup over the other, so would prefer the most
> efficient for querying.
>
> Thanks.
>

-- 
http://khangaonkar.blogspot.com/