You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by José Guilherme Vanz <gu...@gmail.com> on 2014/12/11 12:00:54 UTC

Good partition key doubt

Hello folks

I am studying Cassandra for a short a period of time and now I am modeling
a database for study purposes. During my modeling I have faced a doubt,
what is a good partition key? Is partition key direct related with my query
performance? What is the best practices?

Just to study case, let's suppose I have a column family where is inserted
all kind of logs ( http server, application server, application logs, etc )
data from different servers. In this column family I have server_id (
unique identifier for each server ) column, log_type ( http server,
application server, application log ) column and log_info column. Is a good
ideia create a partition key using server_id and log_type columns to store
all logs data from a specific type and server in a physical row? And if do
I want a physical row for each day? Is a good idea add a third column with
the date in the partition key? And if I want to query all logs in a period
of time how can I select I range o rows? Do I have to duplicate date column
( considering I have to use = operator with partition key ) ?

All the best
-- 
Att. José Guilherme Vanz
br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
<http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/>
"O sofrimento é passageiro, desistir é para sempre" - Bernardo Fonseca,
recordista da Antarctic Ice Marathon.

Re: Good partition key doubt

Posted by José Guilherme Vanz <gu...@gmail.com>.

Nice, I got it. =]
If I have more questions I'll send other emails. xD
Thank you

On Thu, Dec 11, 2014 at 12:17 PM, DuyHai Doan <do...@gmail.com> wrote:
>
> "what is a good partition key? Is partition key direct related with my
> query performance? What is the best practices?"
>
> A good partition key is a partition key that will scale with your data. An
> example: if you have a business involving individuals, it is likely that
> your business will scale as soon as the number of users will grow. In this
> case user_id is a good partition key because all the users will
> be uniformly distributed over all the Cassandra nodes.
>
> For your log example, using only server_id for partition key is clearly
> not enough because what will scale is the log lines, not the number of
> server.
>
> From the point of view of scalability (not taking about query-ability),
> adding the log_type will not scale either, because the number of different
> log types is likely to be a small set. For great scalability (not taking
> about query-ability), the couple (server_id,log_timestamp) is likely a good
> combination.
>
>  Now for query, as you should know, it is not possible to have range query
> (using <, ≤, ≥, >) over partition key, you must always use equality (=) so
> you won't be able to leverage the log_timestamp component in the partition
> key for your query.
>
> Bucketing by date is a good idea though, and the date resolution will
> depends on the log generation rate. If logs are generated very often, maybe
> a bucket by hour. If the generation rate is smaller, maybe a day or a week
> bucket is fine.
>
> Talking about log_type, putting it into the partition key will help
> partitioning further, in addition of the date bucket. However it forces you
> to always provide a log_type whenever you want to query, be aware of this.
>
> An example of data model for your logs could be
>
> CREATE TABLE logs_by_server_and_type_and_date(
>    server_id int,
>    log_type text,
>    date_bucket int, //Date bucket using format YYYYMMDD or YYYYMMDDHH or
> ...
>    log_timestamp timeuuid,
>    log_info text,
>    PRIMARY KEY((server_id,log_type,date_bucket),log_timestamp)
> );
>
>
> "And if I want to query all logs in a period of time how can I select I
> range o rows?" --> New query path = new table
>
> CREATE TABLE logs_by_date(
>    date_bucket int, //Date bucket using format YYYYMMDD or YYYYMMDDHH or
> ...
>    log_timestamp timeuuid,
>    server_id int,
>    log_type text,
>    log_info text,
>    PRIMARY KEY((date_bucket),log_timestamp) // you may add server_id or
> log_type as clustering column optionally
> );
>
> For this table, the date_bucket should be chosen very carefully because
> for the same bucket, we're going to store logs of ALL servers and all types
> ...
>
> For the query, you should provide the date bucket as partition key, and
> then use (<, ≤, ≥, >) on the log_timestamp column
>
>
> On Thu, Dec 11, 2014 at 12:00 PM, José Guilherme Vanz <
> guilherme.sft@gmail.com> wrote:
>
>> Hello folks
>>
>> I am studying Cassandra for a short a period of time and now I am
>> modeling a database for study purposes. During my modeling I have faced a
>> doubt, what is a good partition key? Is partition key direct related with
>> my query performance? What is the best practices?
>>
>> Just to study case, let's suppose I have a column family where is
>> inserted all kind of logs ( http server, application server, application
>> logs, etc ) data from different servers. In this column family I have
>> server_id ( unique identifier for each server ) column, log_type ( http
>> server,  application server, application log ) column and log_info column.
>> Is a good ideia create a partition key using server_id and log_type columns
>> to store all logs data from a specific type and server in a physical row?
>> And if do I want a physical row for each day? Is a good idea add a third
>> column with the date in the partition key? And if I want to query all logs
>> in a period of time how can I select I range o rows? Do I have to duplicate
>> date column ( considering I have to use = operator with partition key ) ?
>>
>> All the best
>> --
>> Att. José Guilherme Vanz
>> br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
>> <http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/>
>> "O sofrimento é passageiro, desistir é para sempre" - Bernardo Fonseca,
>> recordista da Antarctic Ice Marathon.
>>
>
>

-- 
Att. José Guilherme Vanz
br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
<http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/>
"O sofrimento é passageiro, desistir é para sempre" - Bernardo Fonseca,
recordista da Antarctic Ice Marathon.

Re: Good partition key doubt

Posted by DuyHai Doan <do...@gmail.com>.

"what is a good partition key? Is partition key direct related with my
query performance? What is the best practices?"

A good partition key is a partition key that will scale with your data. An
example: if you have a business involving individuals, it is likely that
your business will scale as soon as the number of users will grow. In this
case user_id is a good partition key because all the users will
be uniformly distributed over all the Cassandra nodes.

For your log example, using only server_id for partition key is clearly not
enough because what will scale is the log lines, not the number of server.

>From the point of view of scalability (not taking about query-ability),
adding the log_type will not scale either, because the number of different
log types is likely to be a small set. For great scalability (not taking
about query-ability), the couple (server_id,log_timestamp) is likely a good
combination.

 Now for query, as you should know, it is not possible to have range query
(using <, ≤, ≥, >) over partition key, you must always use equality (=) so
you won't be able to leverage the log_timestamp component in the partition
key for your query.

Bucketing by date is a good idea though, and the date resolution will
depends on the log generation rate. If logs are generated very often, maybe
a bucket by hour. If the generation rate is smaller, maybe a day or a week
bucket is fine.

Talking about log_type, putting it into the partition key will help
partitioning further, in addition of the date bucket. However it forces you
to always provide a log_type whenever you want to query, be aware of this.

An example of data model for your logs could be

CREATE TABLE logs_by_server_and_type_and_date(
   server_id int,
   log_type text,
   date_bucket int, //Date bucket using format YYYYMMDD or YYYYMMDDHH or ...
   log_timestamp timeuuid,
   log_info text,
   PRIMARY KEY((server_id,log_type,date_bucket),log_timestamp)
);

"And if I want to query all logs in a period of time how can I select I
range o rows?" --> New query path = new table

CREATE TABLE logs_by_date(
   date_bucket int, //Date bucket using format YYYYMMDD or YYYYMMDDHH or ...
   log_timestamp timeuuid,
   server_id int,
   log_type text,
   log_info text,
   PRIMARY KEY((date_bucket),log_timestamp) // you may add server_id or
log_type as clustering column optionally
);

For this table, the date_bucket should be chosen very carefully because for
the same bucket, we're going to store logs of ALL servers and all types ...

For the query, you should provide the date bucket as partition key, and
then use (<, ≤, ≥, >) on the log_timestamp column

On Thu, Dec 11, 2014 at 12:00 PM, José Guilherme Vanz <
guilherme.sft@gmail.com> wrote:

> Hello folks
>
> I am studying Cassandra for a short a period of time and now I am modeling
> a database for study purposes. During my modeling I have faced a doubt,
> what is a good partition key? Is partition key direct related with my query
> performance? What is the best practices?
>
> Just to study case, let's suppose I have a column family where is inserted
> all kind of logs ( http server, application server, application logs, etc )
> data from different servers. In this column family I have server_id (
> unique identifier for each server ) column, log_type ( http server,
> application server, application log ) column and log_info column. Is a good
> ideia create a partition key using server_id and log_type columns to store
> all logs data from a specific type and server in a physical row? And if do
> I want a physical row for each day? Is a good idea add a third column with
> the date in the partition key? And if I want to query all logs in a period
> of time how can I select I range o rows? Do I have to duplicate date column
> ( considering I have to use = operator with partition key ) ?
>
> All the best
> --
> Att. José Guilherme Vanz
> br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
> <http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/>
> "O sofrimento é passageiro, desistir é para sempre" - Bernardo Fonseca,
> recordista da Antarctic Ice Marathon.
>