You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Bamoqi <ba...@gmail.com> on 2016/01/08 08:57:35 UTC

Is it good for performance to put rows that are of different types but are always queried together in the same table partition?

My consideration is that whether doing so will result in better 
memory/disk cache locality.

Suppose I need to query for 2 different types of rows for a frequent 
user request, I can use 2 tables or 1 table:

2 tables:

   create table t1(
     partitionkey int primary key,
     col1 int, col2 int, ...
   )
   create table t2(
     partitionkey int primary key,
     col3 int, col4 int, ...
   )

query-2table:
   select col1,col2 from t1 where partitionkey = ?
   select col3,col4 from t1 where partitionkey = ?

1 table:

   create table t(
     partitionkey int,
     rowtype tinyint,
     col1 int, col2 int, ...
     col3 int, col4 int, ...
     primary key( partitionkey, rowtype )
   )

query-1table-a:
   select col1,col2 from t where partitionkey = ? and type = 1
   select col3,col4 from t where partitionkey = ? and type = 2

or alternatively, query-1table-b:
   select type,col1,col2,col3,col4 from t where partitionkey = ?
   // switch on `type` in the app code

Is there significant performance difference in query-2table, 
query-1table-a, query-1table-b?
Is the cassandra client/coordinator smart enough to direct subsequent 
queries of the same (table, partitionkey) to the same node so they can 
reuse a cached page?

Regards & Thanks

Re: [Typo correction] Is it good for performance to put rows that are of different types but are always queried together in the same table partition?

Posted by Carlos Alonso <in...@mrcalonso.com>.

Why can't you have something like this?

CREATE TABLE t (
  p INT,
  q1 INT,
  q2 UUID,
  c1 INT,
  c2 TEXT,
  PRIMARY KEY (p, q1, q2)
)

Sounds the simplest solution.

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 12 January 2016 at 18:27, Bamoqi <ba...@gmail.com> wrote:

> I over-simplified the original example. In the real model I cannot just
> merge the row types. Suppose
>     create table t1(
>         p int,
>         q1 int,
>         c1 int,
>         primary key( p, q1 )
>     )
>     create table t2(
>         p int,
>         q2 uuid,
>         c2 text,
>         primary key( p, q2 )
>     )
>
> Merging the tables will be slightly ugly and waste some storage in the
> clustering columns:
>     create table t(
>         p int,
>         rowtype tinyint, // t1 or t2
>         q1 int, q2 uuid, // depending on rowtype, either q1 or q2 is unused
>         c1 int, c2 text, // depending on rowtype, either c1 or c2 is null
>         primary key( p, rowtype, q1, q2)
>     )
>
> Nevertheless, putting them into one table seems faster as we only need one
> query to get both types, and have better cache locality. Am I correct?
>
>
> On Saturday, January 09, 2016 06:47 AM, Jack Krupansky wrote:
>
> A simple denormalization is probably all that is called for - just merge
> the two tables into one (their union.) No need for this row type.
>
>
> -- Jack Krupansky
>
> On Fri, Jan 8, 2016 at 9:30 AM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
>> You’ll see better performance using a slice (which is effectively what
>> will happen if you put them into the same table and use query-1table-b), as
>> each node will only need to merge cells/results once. It may not be twice
>> as fast, but it’ll be fast enough to make it worthwhile.
>>
>>
>>
>> On 1/8/16, 12:13 AM, "Bamoqi" < <ba...@gmail.com>
>> wrote:
>>
>> >[Correction of the original message which contains typos in code.]
>> >
>> >Is it good for performance to put rows that are of different types but
>> >are always queried together in the same table partition?
>> >
>> >My consideration is that whether doing so will result in better
>> >memory/disk cache locality.
>> >
>> >Suppose I need to query for 2 different types of rows for a frequent
>> >user request, I can use 2 tables or 1 table:
>> >
>> >2 tables:
>> >
>> >   create table t1(
>> >     partitionkey int primary key,
>> >     col1 int, col2 int, ...
>> >   )
>> >   create table t2(
>> >     partitionkey int primary key,
>> >     col3 int, col4 int, ...
>> >   )
>> >
>> >query-2table:
>> >   select col1,col2 from t1 where partitionkey = ?
>> >   select col3,col4 from t2 where partitionkey = ?
>> >
>> >1 table:
>> >
>> >   create table t(
>> >     partitionkey int,
>> >     rowtype tinyint,
>> >     col1 int, col2 int, ...
>> >     col3 int, col4 int, ...
>> >     primary key( partitionkey, rowtype )
>> >   )
>> >
>> >query-1table-a:
>> >   select col1,col2 from t where partitionkey = ? and rowtype = 1
>> >   select col3,col4 from t where partitionkey = ? and rowtype = 2
>> >
>> >or alternatively, query-1table-b:
>> >   select rowtype,col1,col2,col3,col4 from t where partitionkey = ?
>> >   // Used columns are `null`. Switch on `rowtype` in the app code
>> >
>> >Is there significant performance difference in query-2table,
>> >query-1table-a, query-1table-b?
>> >Is the cassandra client/coordinator smart enough to direct subsequent
>> >queries of the same (table, partitionkey) to the same node so they can
>> >reuse a cached page?
>> >
>> >Regards & Thanks
>>
>
>
>

Re: [Typo correction] Is it good for performance to put rows that are of different types but are always queried together in the same table partition?

Posted by Bamoqi <ba...@gmail.com>.

I over-simplified the original example. In the real model I cannot just 
merge the row types. Suppose
     create table t1(
         p int,
         q1 int,
         c1 int,
         primary key( p, q1 )
     )
     create table t2(
         p int,
         q2 uuid,
         c2 text,
         primary key( p, q2 )
     )

Merging the tables will be slightly ugly and waste some storage in the 
clustering columns:
     create table t(
         p int,
         rowtype tinyint, // t1 or t2
         q1 int, q2 uuid, // depending on rowtype, either q1 or q2 is unused
         c1 int, c2 text, // depending on rowtype, either c1 or c2 is null
         primary key( p, rowtype, q1, q2)
     )

Nevertheless, putting them into one table seems faster as we only need 
one query to get both types, and have better cache locality. Am I correct?


On Saturday, January 09, 2016 06:47 AM, Jack Krupansky wrote:
> A simple denormalization is probably all that is called for - just 
> merge the two tables into one (their union.) No need for this row type.
>
>
> -- Jack Krupansky
>
> On Fri, Jan 8, 2016 at 9:30 AM, Jeff Jirsa <jeff.jirsa@crowdstrike.com 
> <ma...@crowdstrike.com>> wrote:
>
>     You’ll see better performance using a slice (which is effectively
>     what will happen if you put them into the same table and use
>     query-1table-b), as each node will only need to merge
>     cells/results once. It may not be twice as fast, but it’ll be fast
>     enough to make it worthwhile.
>
>
>
>     On 1/8/16, 12:13 AM, "Bamoqi" <bamoqi@gmail.com
>     <ma...@gmail.com>> wrote:
>
>     >[Correction of the original message which contains typos in code.]
>     >
>     >Is it good for performance to put rows that are of different
>     types but
>     >are always queried together in the same table partition?
>     >
>     >My consideration is that whether doing so will result in better
>     >memory/disk cache locality.
>     >
>     >Suppose I need to query for 2 different types of rows for a frequent
>     >user request, I can use 2 tables or 1 table:
>     >
>     >2 tables:
>     >
>     >   create table t1(
>     >     partitionkey int primary key,
>     >     col1 int, col2 int, ...
>     >   )
>     >   create table t2(
>     >     partitionkey int primary key,
>     >     col3 int, col4 int, ...
>     >   )
>     >
>     >query-2table:
>     >   select col1,col2 from t1 where partitionkey = ?
>     >   select col3,col4 from t2 where partitionkey = ?
>     >
>     >1 table:
>     >
>     >   create table t(
>     >     partitionkey int,
>     >     rowtype tinyint,
>     >     col1 int, col2 int, ...
>     >     col3 int, col4 int, ...
>     >     primary key( partitionkey, rowtype )
>     >   )
>     >
>     >query-1table-a:
>     >   select col1,col2 from t where partitionkey = ? and rowtype = 1
>     >   select col3,col4 from t where partitionkey = ? and rowtype = 2
>     >
>     >or alternatively, query-1table-b:
>     >   select rowtype,col1,col2,col3,col4 from t where partitionkey = ?
>     >   // Used columns are `null`. Switch on `rowtype` in the app code
>     >
>     >Is there significant performance difference in query-2table,
>     >query-1table-a, query-1table-b?
>     >Is the cassandra client/coordinator smart enough to direct subsequent
>     >queries of the same (table, partitionkey) to the same node so
>     they can
>     >reuse a cached page?
>     >
>     >Regards & Thanks
>
>

Re: [Typo correction] Is it good for performance to put rows that are of different types but are always queried together in the same table partition?

Posted by Jack Krupansky <ja...@gmail.com>.

A simple denormalization is probably all that is called for - just merge
the two tables into one (their union.) No need for this row type.


-- Jack Krupansky

On Fri, Jan 8, 2016 at 9:30 AM, Jeff Jirsa <je...@crowdstrike.com>
wrote:

> You’ll see better performance using a slice (which is effectively what
> will happen if you put them into the same table and use query-1table-b), as
> each node will only need to merge cells/results once. It may not be twice
> as fast, but it’ll be fast enough to make it worthwhile.
>
>
>
> On 1/8/16, 12:13 AM, "Bamoqi" <ba...@gmail.com> wrote:
>
> >[Correction of the original message which contains typos in code.]
> >
> >Is it good for performance to put rows that are of different types but
> >are always queried together in the same table partition?
> >
> >My consideration is that whether doing so will result in better
> >memory/disk cache locality.
> >
> >Suppose I need to query for 2 different types of rows for a frequent
> >user request, I can use 2 tables or 1 table:
> >
> >2 tables:
> >
> >   create table t1(
> >     partitionkey int primary key,
> >     col1 int, col2 int, ...
> >   )
> >   create table t2(
> >     partitionkey int primary key,
> >     col3 int, col4 int, ...
> >   )
> >
> >query-2table:
> >   select col1,col2 from t1 where partitionkey = ?
> >   select col3,col4 from t2 where partitionkey = ?
> >
> >1 table:
> >
> >   create table t(
> >     partitionkey int,
> >     rowtype tinyint,
> >     col1 int, col2 int, ...
> >     col3 int, col4 int, ...
> >     primary key( partitionkey, rowtype )
> >   )
> >
> >query-1table-a:
> >   select col1,col2 from t where partitionkey = ? and rowtype = 1
> >   select col3,col4 from t where partitionkey = ? and rowtype = 2
> >
> >or alternatively, query-1table-b:
> >   select rowtype,col1,col2,col3,col4 from t where partitionkey = ?
> >   // Used columns are `null`. Switch on `rowtype` in the app code
> >
> >Is there significant performance difference in query-2table,
> >query-1table-a, query-1table-b?
> >Is the cassandra client/coordinator smart enough to direct subsequent
> >queries of the same (table, partitionkey) to the same node so they can
> >reuse a cached page?
> >
> >Regards & Thanks
>

Re: [Typo correction] Is it good for performance to put rows that are of different types but are always queried together in the same table partition?

Posted by Jeff Jirsa <je...@crowdstrike.com>.

You’ll see better performance using a slice (which is effectively what will happen if you put them into the same table and use query-1table-b), as each node will only need to merge cells/results once. It may not be twice as fast, but it’ll be fast enough to make it worthwhile.



On 1/8/16, 12:13 AM, "Bamoqi" <ba...@gmail.com> wrote:

>[Correction of the original message which contains typos in code.]
>
>Is it good for performance to put rows that are of different types but 
>are always queried together in the same table partition?
>
>My consideration is that whether doing so will result in better 
>memory/disk cache locality.
>
>Suppose I need to query for 2 different types of rows for a frequent 
>user request, I can use 2 tables or 1 table:
>
>2 tables:
>
>   create table t1(
>     partitionkey int primary key,
>     col1 int, col2 int, ...
>   )
>   create table t2(
>     partitionkey int primary key,
>     col3 int, col4 int, ...
>   )
>
>query-2table:
>   select col1,col2 from t1 where partitionkey = ?
>   select col3,col4 from t2 where partitionkey = ?
>
>1 table:
>
>   create table t(
>     partitionkey int,
>     rowtype tinyint,
>     col1 int, col2 int, ...
>     col3 int, col4 int, ...
>     primary key( partitionkey, rowtype )
>   )
>
>query-1table-a:
>   select col1,col2 from t where partitionkey = ? and rowtype = 1
>   select col3,col4 from t where partitionkey = ? and rowtype = 2
>
>or alternatively, query-1table-b:
>   select rowtype,col1,col2,col3,col4 from t where partitionkey = ?
>   // Used columns are `null`. Switch on `rowtype` in the app code
>
>Is there significant performance difference in query-2table, 
>query-1table-a, query-1table-b?
>Is the cassandra client/coordinator smart enough to direct subsequent 
>queries of the same (table, partitionkey) to the same node so they can 
>reuse a cached page?
>
>Regards & Thanks

[Typo correction] Is it good for performance to put rows that are of different types but are always queried together in the same table partition?

Posted by Bamoqi <ba...@gmail.com>.

[Correction of the original message which contains typos in code.]

Is it good for performance to put rows that are of different types but 
are always queried together in the same table partition?

My consideration is that whether doing so will result in better 
memory/disk cache locality.

Suppose I need to query for 2 different types of rows for a frequent 
user request, I can use 2 tables or 1 table:

2 tables:

   create table t1(
     partitionkey int primary key,
     col1 int, col2 int, ...
   )
   create table t2(
     partitionkey int primary key,
     col3 int, col4 int, ...
   )

query-2table:
   select col1,col2 from t1 where partitionkey = ?
   select col3,col4 from t2 where partitionkey = ?

1 table:

   create table t(
     partitionkey int,
     rowtype tinyint,
     col1 int, col2 int, ...
     col3 int, col4 int, ...
     primary key( partitionkey, rowtype )
   )

query-1table-a:
   select col1,col2 from t where partitionkey = ? and rowtype = 1
   select col3,col4 from t where partitionkey = ? and rowtype = 2

or alternatively, query-1table-b:
   select rowtype,col1,col2,col3,col4 from t where partitionkey = ?
   // Used columns are `null`. Switch on `rowtype` in the app code

Is there significant performance difference in query-2table, 
query-1table-a, query-1table-b?
Is the cassandra client/coordinator smart enough to direct subsequent 
queries of the same (table, partitionkey) to the same node so they can 
reuse a cached page?

Regards & Thanks