You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Ben Gambley <be...@intoscience.com> on 2011/10/27 04:35:49 UTC

Read Performance / Schema Design

Hi Everyone

I have a question with regards read performance and schema design if someone could help please.


Our requirement is to store per user, many unique results (which is basically an attempt at some questions ..) so I had thought of having the userid as the row key and the result id as columns. 

The keys  for the result ids are maintained in a separate location so are known without having to perform any additional lookups.

My concern is that over time reading a single result would incur the overhead of reading the entire row from disk so gradually slow things down.


So I was considering if changing the row key to userid + result id would be a better solution ?



Cheers
Ben


Re: Read Performance / Schema Design

Posted by David Jeske <da...@gmail.com>.
On Wed, Oct 26, 2011 at 7:35 PM, Ben Gambley <be...@intoscience.com>wrote:

> Our requirement is to store per user, many unique results (which is
> basically an attempt at some questions ..) so I had thought of having the
> userid as the row key and the result id as columns.
>
> The keys  for the result ids are maintained in a separate location so are
> known without having to perform any additional lookups.
>
> My concern is that over time reading a single result would incur the
> overhead of reading the entire row from disk so gradually slow things down.
>
> So I was considering if changing the row key to *userid + result id* would
> be a better solution ?
>

This is a clustering choice. Assuming your dataset is too big to fit in
system memory, some considerations which should drive your decision are
locality of access, cache efficiency, and worst-case performance.

1) If you access many results for a user-id around the same time, then
putting them close together will get you better lookup throughput, as once
you cause a disk seek to access a result, other results will be available in
memory (until it falls out of cache). This is generally going to be somewhat
true whether you use a key (userid/resultid) or a key (userid) and column
(resultid).

2) If some (or one) result-id is accessed much more frequently than others
for all users, then you might wish for most of that result-id to get good
cache hitrate, while other result-ids that are less frequently accessed do
not need to be cached as much. To get this behavior, you'd want the
hottest-most-frequently-accessed result-id to be packed together, and thus
you'd want to either use a key such as "result-id/user-id", or use a
separate column family for "hot" result-ids.

3) how many results can a user have? If every user will have an unbounded
number of results (say > 40k), but you generally only need one result at a
time, then you probably want to use a compound key (userid/resultid) rather
than a key(userid)+column(resultid), because you don't want to have to deal
with large data just to get a small piece. That said, it seems that
cassandra's handling of wide-rows is improving, so perhaps in the future
this will not be as large an issue.

Re: Read Performance / Schema Design

Posted by Tyler Hobbs <ty...@datastax.com>.
On Wed, Oct 26, 2011 at 9:35 PM, Ben Gambley <be...@intoscience.com>wrote:

>
> Hi Everyone
>
> I have a question with regards read performance and schema design if
> someone could help please.
>
>
> Our requirement is to store per user, many unique results (which is
> basically an attempt at some questions ..) so I had thought of having the
> userid as the row key and the result id as columns.
>
> The keys  for the result ids are maintained in a separate location so are
> known without having to perform any additional lookups.
>
> My concern is that over time reading a single result would incur the
> overhead of reading the entire row from disk so gradually slow things down.
>
>
> So I was considering if changing the row key to *userid + result id* would
> be a better solution ?
>
>

Do you regularly need to read all of the results for a given userid?  If
not, go with the user_id + result_id approach. It will be more efficient for
single-result lookups.

-- 
Tyler Hobbs
DataStax <http://datastax.com/>