You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Zoltan Lorincz <zo...@gmail.com> on 2017/03/26 20:16:55 UTC

Help with data modelling (from MySQL to Cassandra)

Dear cassandra users,

We have the following structure in MySql:

documents->[doc_id(primary key), title, description]
elements->[element_id(primary key), doc_id(index), title, description]

Notation: table name->[column1(key or index), column2, …]

We want to transfer the data to Cassandra.

Each document can contain a large number of elements (between 1 and 100k+)

We have two requirements:
a) Load all elements for a given doc_id quickly
b) Update the value of one individual element quickly


We were thinking on the following cassandra configurations:

Option A

documents->[doc_id(primary key), title, description, elements] (elements
could be a SET or a TEXT, each time new elements are added (they are never
removed) we would append it to this column)
elements->[element_id(primary key), title, description]

Loading a document:

 a) Load document with given <doc_id> and get all element ids
    SELECT * from documents where doc_id=‘id’

 b) Load all elements with the given ids
    SELECT * FROM elements where element_id IN (ids loaded from query a)


Option B

documents->[doc_id(primary key), title, description]
elements->[element_id(primary key), doc_id(secondary index), title,
description]

Loading a document:
 a) SELECT * from elements where doc_id=‘id’


Neither solutions doesn’t seem to be good, in Option A, even if we are
querying by Primary keys, the second query will have 100k+ primary key id’s
in the WHERE clause, and the second solution looks like an anti pattern in
cassandra.

Could anyone give any advice how would we create a model for our use case?

Thank you in advance,
Zoltan.

Re: Help with data modelling (from MySQL to Cassandra)

Posted by Zoltan Lorincz <zo...@gmail.com>.

Great suggestion! Thanks Avi!

On Mon, Mar 27, 2017 at 3:47 PM, Avi Kivity <av...@scylladb.com> wrote:

> You can use static columns to and just one table:
>
>
> CREATE TABLE documents (
>
>     doc_id uuid,
>
>     element_id uuid,
>
>     description text static,
>
>     doc_title text static,
>
>     element_title text,
>
>     PRIMARY KEY (doc_id, element_id)
>
> );
>
> The static columns are present once per unique doc_id.
>
>
>
> On 03/27/2017 01:08 PM, Zoltan Lorincz wrote:
>
> Hi Alexander,
>
> thank you for your help! I think we found the answer:
>
> CREATE TABLE documents (
>     doc_id uuid,
>     description text,
>     title text,
>     PRIMARY KEY (doc_id)
>  );
>
> CREATE TABLE nodes (
>     doc_id uuid,
>     element_id uuid,
>     title text,
>     PRIMARY KEY (doc_id, element_id)
> );
>
> We can retrieve all elements with the following query:
>  SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
>  UPDATE elements SET title='Hello' WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
> AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;
>
> Zoltan.
>
>
> On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <
> alex@thelastpickle.com> wrote:
>
>> Hi Zoltan,
>>
>> you must try to avoid multi partition queries as much as possible.
>> Instead, use asynchronous queries to grab several partitions concurrently.
>> Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
>> your cluster.
>> This would leave you roughly with 1000+ async queries groups to run.
>> Performance will really depend on your hardware, consistency level, load
>> balancing policy, partition fragmentation (how many updates you'll run on
>> each element over time) and the SLA you're expecting.
>>
>> If that approach doesn't meet your SLA requirements, you can try to use
>> wide partitions and group elements under buckets :
>>
>> CREATE TABLE elements (
>> doc_id long,
>> bucket long,
>> element_id long,
>> element_content text,
>> PRIMARY KEY((doc_id, bucket), element_id)
>> )
>>
>> The bucket here could be a modulus of the element_id (or of the hash of
>> element_id if it is not a numerical value). This way you can spread
>> elements over the cluster and access them directly if you have the doc_id
>> and the element_id to perform updates.
>> You'll get to run less queries concurrently but they'll take more time
>> than individual ones in the first scenario (1 partition per element). You
>> should benchmark both solutions to see which one gives best performance.
>> Bucket your elements so that your partitions don't grow over 100MB. Large
>> partitions are silent cluster killers (1GB+ partitions are a direct threat
>> to cluster stability)...
>>
>> To ensure best performance, use prepared statements along with the
>> TokenAwarePolicy
>> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html> to
>> avoid unnecessary coordination.
>>
>> Cheers,
>>
>>
>> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zo...@gmail.com> wrote:
>>
>>> Querying by (doc_id and element_id ) OR just by (element_id) is fine,
>>> but the real question is, will it be efficient to query 100k+ primary keys
>>> in the elements table?
>>> e.g.
>>>
>>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
>>> element_id3, .... element_id100K+)  ?
>>>
>>> The elements_id is a primary key.
>>>
>>> Thank you?
>>>
>>>
>>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <ma...@gmail.com>
>>> wrote:
>>>
>>> Have one table hold document metadata (doc_id, title, description, ...)
>>> and have another table elements where partition key is doc_id and
>>> clustering key is element_id.
>>> Only problem here is if you need to query and/or update element just by
>>> element_id but I don't know your queries up front.
>>>
>>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zo...@gmail.com>
>>> wrote:
>>>
>>> Dear cassandra users,
>>>
>>> We have the following structure in MySql:
>>>
>>> documents->[doc_id(primary key), title, description]
>>> elements->[element_id(primary key), doc_id(index), title, description]
>>>
>>> Notation: table name->[column1(key or index), column2, …]
>>>
>>> We want to transfer the data to Cassandra.
>>>
>>> Each document can contain a large number of elements (between 1 and
>>> 100k+)
>>>
>>> We have two requirements:
>>> a) Load all elements for a given doc_id quickly
>>> b) Update the value of one individual element quickly
>>>
>>>
>>> We were thinking on the following cassandra configurations:
>>>
>>> Option A
>>>
>>> documents->[doc_id(primary key), title, description, elements] (elements
>>> could be a SET or a TEXT, each time new elements are added (they are never
>>> removed) we would append it to this column)
>>> elements->[element_id(primary key), title, description]
>>>
>>> Loading a document:
>>>
>>>  a) Load document with given <doc_id> and get all element ids
>>>     SELECT * from documents where doc_id=‘id’
>>>
>>>  b) Load all elements with the given ids
>>>     SELECT * FROM elements where element_id IN (ids loaded from query a)
>>>
>>>
>>> Option B
>>>
>>> documents->[doc_id(primary key), title, description]
>>> elements->[element_id(primary key), doc_id(secondary index), title,
>>> description]
>>>
>>> Loading a document:
>>>  a) SELECT * from elements where doc_id=‘id’
>>>
>>>
>>> Neither solutions doesn’t seem to be good, in Option A, even if we are
>>> querying by Primary keys, the second query will have 100k+ primary key id’s
>>> in the WHERE clause, and the second solution looks like an anti pattern in
>>> cassandra.
>>>
>>> Could anyone give any advice how would we create a model for our use
>>> case?
>>>
>>> Thank you in advance,
>>> Zoltan.
>>>
>>>
>>>
>>> --
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>
>

Re: Help with data modelling (from MySQL to Cassandra)

Posted by Avi Kivity <av...@scylladb.com>.

You can use static columns to and just one table:


CREATE TABLE documents (

     doc_id uuid,

     element_id uuid,

     description text static,

     doc_title text static,

     element_title text,

     PRIMARY KEY (doc_id, element_id)

);


The static columns are present once per unique doc_id.


On 03/27/2017 01:08 PM, Zoltan Lorincz wrote:
> Hi Alexander,
>
> thank you for your help! I think we found the answer:
>
> CREATE TABLE documents (
>     doc_id uuid,
>     description text,
>     title text,
>     PRIMARY KEY (doc_id)
>  );
>
> CREATE TABLE nodes (
>     doc_id uuid,
>     element_id uuid,
>     title text,
>     PRIMARY KEY (doc_id, element_id)
> );
>
> We can retrieve all elements with the following query:
>  SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
>  UPDATE elements SET title='Hello' WHERE 
> doc_id=131cfa55-181e-431e-7956-fe449139d613 AND 
> element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;
>
> Zoltan.
>
>
> On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski 
> <alex@thelastpickle.com <ma...@thelastpickle.com>> wrote:
>
>     Hi Zoltan,
>
>     you must try to avoid multi partition queries as much as possible.
>     Instead, use asynchronous queries to grab several partitions
>     concurrently.
>     Try to send no more than  ~100 queries at the same time to avoid
>     DDOS-ing your cluster.
>     This would leave you roughly with 1000+ async queries groups to
>     run. Performance will really depend on your hardware, consistency
>     level, load balancing policy, partition fragmentation (how many
>     updates you'll run on each element over time) and the SLA you're
>     expecting.
>
>     If that approach doesn't meet your SLA requirements, you can try
>     to use wide partitions and group elements under buckets :
>
>     CREATE TABLE elements (
>     doc_id long,
>     bucket long,
>     element_id long,
>     element_content text,
>     PRIMARY KEY((doc_id, bucket), element_id)
>     )
>
>     The bucket here could be a modulus of the element_id (or of the
>     hash of element_id if it is not a numerical value). This way you
>     can spread elements over the cluster and access them directly if
>     you have the doc_id and the element_id to perform updates.
>     You'll get to run less queries concurrently but they'll take more
>     time than individual ones in the first scenario (1 partition per
>     element). You should benchmark both solutions to see which one
>     gives best performance.
>     Bucket your elements so that your partitions don't grow over
>     100MB. Large partitions are silent cluster killers (1GB+
>     partitions are a direct threat to cluster stability)...
>
>     To ensure best performance, use prepared statements along with the
>     TokenAwarePolicy
>     <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html> to
>     avoid unnecessary coordination.
>
>     Cheers,
>
>
>     On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zollor@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Querying by (doc_id and element_id ) OR just by (element_id)
>         is fine, but the real question is, will it be efficient to
>         query 100k+ primary keys in the elements table?
>         e.g.
>
>         SELECT * FROM elements WHERE element_id IN (element_id1,
>         element_id2, element_id3, .... element_id100K+)  ?
>
>         The elements_id is a primary key.
>
>         Thank you?
>
>
>         On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec
>         <matija0204@gmail.com <ma...@gmail.com>> wrote:
>
>             Have one table hold document metadata (doc_id, title,
>             description, ...) and have another table elements where
>             partition key is doc_id and clustering key is element_id.
>             Only problem here is if you need to query and/or update
>             element just by element_id but I don't know your queries
>             up front.
>
>             On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz
>             <zollor@gmail.com <ma...@gmail.com>> wrote:
>
>                 Dear cassandra users,
>
>                 We have the following structure in MySql:
>
>                 documents->[doc_id(primary key), title, description]
>                 elements->[element_id(primary key), doc_id(index),
>                 title, description]
>
>                 Notation: table name->[column1(key or index), column2, \u2026]
>
>                 We want to transfer the data to Cassandra.
>
>                 Each document can contain a large number of elements
>                 (between 1 and 100k+)
>
>                 We have two requirements:
>                 a) Load all elements for a given doc_id quickly
>                 b) Update the value of one individual element quickly
>
>
>                 We were thinking on the following cassandra
>                 configurations:
>
>                 Option A
>
>                 documents->[doc_id(primary key), title, description,
>                 elements] (elements could be a SET or a TEXT, each
>                 time new elements are added (they are never removed)
>                 we would append it to this column)
>                 elements->[element_id(primary key), title, description]
>
>                 Loading a document:
>
>                  a) Load document with given <doc_id> and get all
>                 element ids
>                   SELECT * from documents where doc_id=\u2018id\u2019
>
>                  b) Load all elements with the given ids
>                   SELECT * FROM elements where element_id IN (ids
>                 loaded from query a)
>
>
>                 Option B
>
>                 documents->[doc_id(primary key), title, description]
>                 elements->[element_id(primary key), doc_id(secondary
>                 index), title, description]
>
>                 Loading a document:
>                  a) SELECT * from elements where doc_id=\u2018id\u2019
>
>
>                 Neither solutions doesn\u2019t seem to be good, in Option
>                 A, even if we are querying by Primary keys, the second
>                 query will have 100k+ primary key id\u2019s in the WHERE
>                 clause, and the second solution looks like an anti
>                 pattern in cassandra.
>
>                 Could anyone give any advice how would we create a
>                 model for our use case?
>
>                 Thank you in advance,
>                 Zoltan.
>
>
>
>     -- 
>     -----------------
>     Alexander Dejanovski
>     France
>     @alexanderdeja
>
>     Consultant
>     Apache Cassandra Consulting
>     http://www.thelastpickle.com <http://www.thelastpickle.com/>
>
>

Re: Help with data modelling (from MySQL to Cassandra)

Posted by Zoltan Lorincz <zo...@gmail.com>.

Thank you Matija, because i am newbie, it was not clear for me that i am
able to query by the partition key (not providing the clustering key),
sorry about that!
Zoltan.

On Mon, Mar 27, 2017 at 1:54 PM, Matija Gobec <ma...@gmail.com> wrote:

> Thats exactly what I described. IN queries can be used sometimes but I
> usually run parallel async as Alexander explained.
>
> On Mon, Mar 27, 2017 at 12:08 PM, Zoltan Lorincz <zo...@gmail.com> wrote:
>
>> Hi Alexander,
>>
>> thank you for your help! I think we found the answer:
>>
>> CREATE TABLE documents (
>>     doc_id uuid,
>>     description text,
>>     title text,
>>     PRIMARY KEY (doc_id)
>>  );
>>
>> CREATE TABLE nodes (
>>     doc_id uuid,
>>     element_id uuid,
>>     title text,
>>     PRIMARY KEY (doc_id, element_id)
>> );
>>
>> We can retrieve all elements with the following query:
>>  SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
>>  UPDATE elements SET title='Hello' WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
>> AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;
>>
>> Zoltan.
>>
>>
>> On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <
>> alex@thelastpickle.com> wrote:
>>
>>> Hi Zoltan,
>>>
>>> you must try to avoid multi partition queries as much as possible.
>>> Instead, use asynchronous queries to grab several partitions concurrently.
>>> Try to send no more than  ~100 queries at the same time to avoid
>>> DDOS-ing your cluster.
>>> This would leave you roughly with 1000+ async queries groups to run.
>>> Performance will really depend on your hardware, consistency level, load
>>> balancing policy, partition fragmentation (how many updates you'll run on
>>> each element over time) and the SLA you're expecting.
>>>
>>> If that approach doesn't meet your SLA requirements, you can try to use
>>> wide partitions and group elements under buckets :
>>>
>>> CREATE TABLE elements (
>>> doc_id long,
>>> bucket long,
>>> element_id long,
>>> element_content text,
>>> PRIMARY KEY((doc_id, bucket), element_id)
>>> )
>>>
>>> The bucket here could be a modulus of the element_id (or of the hash of
>>> element_id if it is not a numerical value). This way you can spread
>>> elements over the cluster and access them directly if you have the doc_id
>>> and the element_id to perform updates.
>>> You'll get to run less queries concurrently but they'll take more time
>>> than individual ones in the first scenario (1 partition per element). You
>>> should benchmark both solutions to see which one gives best performance.
>>> Bucket your elements so that your partitions don't grow over 100MB.
>>> Large partitions are silent cluster killers (1GB+ partitions are a direct
>>> threat to cluster stability)...
>>>
>>> To ensure best performance, use prepared statements along with the
>>> TokenAwarePolicy
>>> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html> to
>>> avoid unnecessary coordination.
>>>
>>> Cheers,
>>>
>>>
>>> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zo...@gmail.com> wrote:
>>>
>>>> Querying by (doc_id and element_id ) OR just by (element_id) is fine,
>>>> but the real question is, will it be efficient to query 100k+ primary keys
>>>> in the elements table?
>>>> e.g.
>>>>
>>>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
>>>> element_id3, .... element_id100K+)  ?
>>>>
>>>> The elements_id is a primary key.
>>>>
>>>> Thank you?
>>>>
>>>>
>>>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <ma...@gmail.com>
>>>> wrote:
>>>>
>>>> Have one table hold document metadata (doc_id, title, description, ...)
>>>> and have another table elements where partition key is doc_id and
>>>> clustering key is element_id.
>>>> Only problem here is if you need to query and/or update element just by
>>>> element_id but I don't know your queries up front.
>>>>
>>>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zo...@gmail.com>
>>>> wrote:
>>>>
>>>> Dear cassandra users,
>>>>
>>>> We have the following structure in MySql:
>>>>
>>>> documents->[doc_id(primary key), title, description]
>>>> elements->[element_id(primary key), doc_id(index), title, description]
>>>>
>>>> Notation: table name->[column1(key or index), column2, …]
>>>>
>>>> We want to transfer the data to Cassandra.
>>>>
>>>> Each document can contain a large number of elements (between 1 and
>>>> 100k+)
>>>>
>>>> We have two requirements:
>>>> a) Load all elements for a given doc_id quickly
>>>> b) Update the value of one individual element quickly
>>>>
>>>>
>>>> We were thinking on the following cassandra configurations:
>>>>
>>>> Option A
>>>>
>>>> documents->[doc_id(primary key), title, description, elements]
>>>> (elements could be a SET or a TEXT, each time new elements are added (they
>>>> are never removed) we would append it to this column)
>>>> elements->[element_id(primary key), title, description]
>>>>
>>>> Loading a document:
>>>>
>>>>  a) Load document with given <doc_id> and get all element ids
>>>>     SELECT * from documents where doc_id=‘id’
>>>>
>>>>  b) Load all elements with the given ids
>>>>     SELECT * FROM elements where element_id IN (ids loaded from query a)
>>>>
>>>>
>>>> Option B
>>>>
>>>> documents->[doc_id(primary key), title, description]
>>>> elements->[element_id(primary key), doc_id(secondary index), title,
>>>> description]
>>>>
>>>> Loading a document:
>>>>  a) SELECT * from elements where doc_id=‘id’
>>>>
>>>>
>>>> Neither solutions doesn’t seem to be good, in Option A, even if we are
>>>> querying by Primary keys, the second query will have 100k+ primary key id’s
>>>> in the WHERE clause, and the second solution looks like an anti pattern in
>>>> cassandra.
>>>>
>>>> Could anyone give any advice how would we create a model for our use
>>>> case?
>>>>
>>>> Thank you in advance,
>>>> Zoltan.
>>>>
>>>>
>>>>
>>>> --
>>> -----------------
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>>
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>
>>
>

Re: Help with data modelling (from MySQL to Cassandra)

Posted by Matija Gobec <ma...@gmail.com>.

Thats exactly what I described. IN queries can be used sometimes but I
usually run parallel async as Alexander explained.

On Mon, Mar 27, 2017 at 12:08 PM, Zoltan Lorincz <zo...@gmail.com> wrote:

> Hi Alexander,
>
> thank you for your help! I think we found the answer:
>
> CREATE TABLE documents (
>     doc_id uuid,
>     description text,
>     title text,
>     PRIMARY KEY (doc_id)
>  );
>
> CREATE TABLE nodes (
>     doc_id uuid,
>     element_id uuid,
>     title text,
>     PRIMARY KEY (doc_id, element_id)
> );
>
> We can retrieve all elements with the following query:
>  SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
>  UPDATE elements SET title='Hello' WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
> AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;
>
> Zoltan.
>
>
> On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <
> alex@thelastpickle.com> wrote:
>
>> Hi Zoltan,
>>
>> you must try to avoid multi partition queries as much as possible.
>> Instead, use asynchronous queries to grab several partitions concurrently.
>> Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
>> your cluster.
>> This would leave you roughly with 1000+ async queries groups to run.
>> Performance will really depend on your hardware, consistency level, load
>> balancing policy, partition fragmentation (how many updates you'll run on
>> each element over time) and the SLA you're expecting.
>>
>> If that approach doesn't meet your SLA requirements, you can try to use
>> wide partitions and group elements under buckets :
>>
>> CREATE TABLE elements (
>> doc_id long,
>> bucket long,
>> element_id long,
>> element_content text,
>> PRIMARY KEY((doc_id, bucket), element_id)
>> )
>>
>> The bucket here could be a modulus of the element_id (or of the hash of
>> element_id if it is not a numerical value). This way you can spread
>> elements over the cluster and access them directly if you have the doc_id
>> and the element_id to perform updates.
>> You'll get to run less queries concurrently but they'll take more time
>> than individual ones in the first scenario (1 partition per element). You
>> should benchmark both solutions to see which one gives best performance.
>> Bucket your elements so that your partitions don't grow over 100MB. Large
>> partitions are silent cluster killers (1GB+ partitions are a direct threat
>> to cluster stability)...
>>
>> To ensure best performance, use prepared statements along with the
>> TokenAwarePolicy
>> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html> to
>> avoid unnecessary coordination.
>>
>> Cheers,
>>
>>
>> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zo...@gmail.com> wrote:
>>
>>> Querying by (doc_id and element_id ) OR just by (element_id) is fine,
>>> but the real question is, will it be efficient to query 100k+ primary keys
>>> in the elements table?
>>> e.g.
>>>
>>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
>>> element_id3, .... element_id100K+)  ?
>>>
>>> The elements_id is a primary key.
>>>
>>> Thank you?
>>>
>>>
>>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <ma...@gmail.com>
>>> wrote:
>>>
>>> Have one table hold document metadata (doc_id, title, description, ...)
>>> and have another table elements where partition key is doc_id and
>>> clustering key is element_id.
>>> Only problem here is if you need to query and/or update element just by
>>> element_id but I don't know your queries up front.
>>>
>>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zo...@gmail.com>
>>> wrote:
>>>
>>> Dear cassandra users,
>>>
>>> We have the following structure in MySql:
>>>
>>> documents->[doc_id(primary key), title, description]
>>> elements->[element_id(primary key), doc_id(index), title, description]
>>>
>>> Notation: table name->[column1(key or index), column2, …]
>>>
>>> We want to transfer the data to Cassandra.
>>>
>>> Each document can contain a large number of elements (between 1 and
>>> 100k+)
>>>
>>> We have two requirements:
>>> a) Load all elements for a given doc_id quickly
>>> b) Update the value of one individual element quickly
>>>
>>>
>>> We were thinking on the following cassandra configurations:
>>>
>>> Option A
>>>
>>> documents->[doc_id(primary key), title, description, elements] (elements
>>> could be a SET or a TEXT, each time new elements are added (they are never
>>> removed) we would append it to this column)
>>> elements->[element_id(primary key), title, description]
>>>
>>> Loading a document:
>>>
>>>  a) Load document with given <doc_id> and get all element ids
>>>     SELECT * from documents where doc_id=‘id’
>>>
>>>  b) Load all elements with the given ids
>>>     SELECT * FROM elements where element_id IN (ids loaded from query a)
>>>
>>>
>>> Option B
>>>
>>> documents->[doc_id(primary key), title, description]
>>> elements->[element_id(primary key), doc_id(secondary index), title,
>>> description]
>>>
>>> Loading a document:
>>>  a) SELECT * from elements where doc_id=‘id’
>>>
>>>
>>> Neither solutions doesn’t seem to be good, in Option A, even if we are
>>> querying by Primary keys, the second query will have 100k+ primary key id’s
>>> in the WHERE clause, and the second solution looks like an anti pattern in
>>> cassandra.
>>>
>>> Could anyone give any advice how would we create a model for our use
>>> case?
>>>
>>> Thank you in advance,
>>> Zoltan.
>>>
>>>
>>>
>>> --
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>

Re: Help with data modelling (from MySQL to Cassandra)

Posted by Zoltan Lorincz <zo...@gmail.com>.

Hi Alexander,

thank you for your help! I think we found the answer:

CREATE TABLE documents (
    doc_id uuid,
    description text,
    title text,
    PRIMARY KEY (doc_id)
 );

CREATE TABLE nodes (
    doc_id uuid,
    element_id uuid,
    title text,
    PRIMARY KEY (doc_id, element_id)
);

We can retrieve all elements with the following query:
 SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
 UPDATE elements SET title='Hello' WHERE
doc_id=131cfa55-181e-431e-7956-fe449139d613 AND
element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;

Zoltan.


On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <
alex@thelastpickle.com> wrote:

> Hi Zoltan,
>
> you must try to avoid multi partition queries as much as possible.
> Instead, use asynchronous queries to grab several partitions concurrently.
> Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
> your cluster.
> This would leave you roughly with 1000+ async queries groups to run.
> Performance will really depend on your hardware, consistency level, load
> balancing policy, partition fragmentation (how many updates you'll run on
> each element over time) and the SLA you're expecting.
>
> If that approach doesn't meet your SLA requirements, you can try to use
> wide partitions and group elements under buckets :
>
> CREATE TABLE elements (
> doc_id long,
> bucket long,
> element_id long,
> element_content text,
> PRIMARY KEY((doc_id, bucket), element_id)
> )
>
> The bucket here could be a modulus of the element_id (or of the hash of
> element_id if it is not a numerical value). This way you can spread
> elements over the cluster and access them directly if you have the doc_id
> and the element_id to perform updates.
> You'll get to run less queries concurrently but they'll take more time
> than individual ones in the first scenario (1 partition per element). You
> should benchmark both solutions to see which one gives best performance.
> Bucket your elements so that your partitions don't grow over 100MB. Large
> partitions are silent cluster killers (1GB+ partitions are a direct threat
> to cluster stability)...
>
> To ensure best performance, use prepared statements along with the
> TokenAwarePolicy
> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html> to
> avoid unnecessary coordination.
>
> Cheers,
>
>
> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zo...@gmail.com> wrote:
>
>> Querying by (doc_id and element_id ) OR just by (element_id) is fine, but
>> the real question is, will it be efficient to query 100k+ primary keys in
>> the elements table?
>> e.g.
>>
>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
>> element_id3, .... element_id100K+)  ?
>>
>> The elements_id is a primary key.
>>
>> Thank you?
>>
>>
>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <ma...@gmail.com>
>> wrote:
>>
>> Have one table hold document metadata (doc_id, title, description, ...)
>> and have another table elements where partition key is doc_id and
>> clustering key is element_id.
>> Only problem here is if you need to query and/or update element just by
>> element_id but I don't know your queries up front.
>>
>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zo...@gmail.com>
>> wrote:
>>
>> Dear cassandra users,
>>
>> We have the following structure in MySql:
>>
>> documents->[doc_id(primary key), title, description]
>> elements->[element_id(primary key), doc_id(index), title, description]
>>
>> Notation: table name->[column1(key or index), column2, …]
>>
>> We want to transfer the data to Cassandra.
>>
>> Each document can contain a large number of elements (between 1 and
>> 100k+)
>>
>> We have two requirements:
>> a) Load all elements for a given doc_id quickly
>> b) Update the value of one individual element quickly
>>
>>
>> We were thinking on the following cassandra configurations:
>>
>> Option A
>>
>> documents->[doc_id(primary key), title, description, elements] (elements
>> could be a SET or a TEXT, each time new elements are added (they are never
>> removed) we would append it to this column)
>> elements->[element_id(primary key), title, description]
>>
>> Loading a document:
>>
>>  a) Load document with given <doc_id> and get all element ids
>>     SELECT * from documents where doc_id=‘id’
>>
>>  b) Load all elements with the given ids
>>     SELECT * FROM elements where element_id IN (ids loaded from query a)
>>
>>
>> Option B
>>
>> documents->[doc_id(primary key), title, description]
>> elements->[element_id(primary key), doc_id(secondary index), title,
>> description]
>>
>> Loading a document:
>>  a) SELECT * from elements where doc_id=‘id’
>>
>>
>> Neither solutions doesn’t seem to be good, in Option A, even if we are
>> querying by Primary keys, the second query will have 100k+ primary key id’s
>> in the WHERE clause, and the second solution looks like an anti pattern in
>> cassandra.
>>
>> Could anyone give any advice how would we create a model for our use case?
>>
>> Thank you in advance,
>> Zoltan.
>>
>>
>>
>> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Re: Help with data modelling (from MySQL to Cassandra)

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

Hi Zoltan,

you must try to avoid multi partition queries as much as possible. Instead,
use asynchronous queries to grab several partitions concurrently.
Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
your cluster.
This would leave you roughly with 1000+ async queries groups to run.
Performance will really depend on your hardware, consistency level, load
balancing policy, partition fragmentation (how many updates you'll run on
each element over time) and the SLA you're expecting.

If that approach doesn't meet your SLA requirements, you can try to use
wide partitions and group elements under buckets :

CREATE TABLE elements (
doc_id long,
bucket long,
element_id long,
element_content text,
PRIMARY KEY((doc_id, bucket), element_id)
)

The bucket here could be a modulus of the element_id (or of the hash of
element_id if it is not a numerical value). This way you can spread
elements over the cluster and access them directly if you have the doc_id
and the element_id to perform updates.
You'll get to run less queries concurrently but they'll take more time than
individual ones in the first scenario (1 partition per element). You should
benchmark both solutions to see which one gives best performance.
Bucket your elements so that your partitions don't grow over 100MB. Large
partitions are silent cluster killers (1GB+ partitions are a direct threat
to cluster stability)...

To ensure best performance, use prepared statements along with the
TokenAwarePolicy
<http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
to
avoid unnecessary coordination.

Cheers,

On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zo...@gmail.com> wrote:

> Querying by (doc_id and element_id ) OR just by (element_id) is fine, but
> the real question is, will it be efficient to query 100k+ primary keys in
> the elements table?
> e.g.
>
> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
> element_id3, .... element_id100K+)  ?
>
> The elements_id is a primary key.
>
> Thank you?
>
>
> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <ma...@gmail.com>
> wrote:
>
> Have one table hold document metadata (doc_id, title, description, ...)
> and have another table elements where partition key is doc_id and
> clustering key is element_id.
> Only problem here is if you need to query and/or update element just by
> element_id but I don't know your queries up front.
>
> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zo...@gmail.com> wrote:
>
> Dear cassandra users,
>
> We have the following structure in MySql:
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(index), title, description]
>
> Notation: table name->[column1(key or index), column2, …]
>
> We want to transfer the data to Cassandra.
>
> Each document can contain a large number of elements (between 1 and 100k+)
>
> We have two requirements:
> a) Load all elements for a given doc_id quickly
> b) Update the value of one individual element quickly
>
>
> We were thinking on the following cassandra configurations:
>
> Option A
>
> documents->[doc_id(primary key), title, description, elements] (elements
> could be a SET or a TEXT, each time new elements are added (they are never
> removed) we would append it to this column)
> elements->[element_id(primary key), title, description]
>
> Loading a document:
>
>  a) Load document with given <doc_id> and get all element ids
>     SELECT * from documents where doc_id=‘id’
>
>  b) Load all elements with the given ids
>     SELECT * FROM elements where element_id IN (ids loaded from query a)
>
>
> Option B
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(secondary index), title,
> description]
>
> Loading a document:
>  a) SELECT * from elements where doc_id=‘id’
>
>
> Neither solutions doesn’t seem to be good, in Option A, even if we are
> querying by Primary keys, the second query will have 100k+ primary key id’s
> in the WHERE clause, and the second solution looks like an anti pattern in
> cassandra.
>
> Could anyone give any advice how would we create a model for our use case?
>
> Thank you in advance,
> Zoltan.
>
>
>
> --
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Help with data modelling (from MySQL to Cassandra)

Posted by Zoltan Lorincz <zo...@gmail.com>.

Querying by (doc_id and element_id ) OR just by (element_id) is fine, but
the real question is, will it be efficient to query 100k+ primary keys in
the elements table?
e.g.

SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
element_id3, .... element_id100K+)  ?

The elements_id is a primary key.

Thank you?


On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <ma...@gmail.com> wrote:

> Have one table hold document metadata (doc_id, title, description, ...)
> and have another table elements where partition key is doc_id and
> clustering key is element_id.
> Only problem here is if you need to query and/or update element just by
> element_id but I don't know your queries up front.
>
> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zo...@gmail.com> wrote:
>
>> Dear cassandra users,
>>
>> We have the following structure in MySql:
>>
>> documents->[doc_id(primary key), title, description]
>> elements->[element_id(primary key), doc_id(index), title, description]
>>
>> Notation: table name->[column1(key or index), column2, …]
>>
>> We want to transfer the data to Cassandra.
>>
>> Each document can contain a large number of elements (between 1 and
>> 100k+)
>>
>> We have two requirements:
>> a) Load all elements for a given doc_id quickly
>> b) Update the value of one individual element quickly
>>
>>
>> We were thinking on the following cassandra configurations:
>>
>> Option A
>>
>> documents->[doc_id(primary key), title, description, elements] (elements
>> could be a SET or a TEXT, each time new elements are added (they are never
>> removed) we would append it to this column)
>> elements->[element_id(primary key), title, description]
>>
>> Loading a document:
>>
>>  a) Load document with given <doc_id> and get all element ids
>>     SELECT * from documents where doc_id=‘id’
>>
>>  b) Load all elements with the given ids
>>     SELECT * FROM elements where element_id IN (ids loaded from query a)
>>
>>
>> Option B
>>
>> documents->[doc_id(primary key), title, description]
>> elements->[element_id(primary key), doc_id(secondary index), title,
>> description]
>>
>> Loading a document:
>>  a) SELECT * from elements where doc_id=‘id’
>>
>>
>> Neither solutions doesn’t seem to be good, in Option A, even if we are
>> querying by Primary keys, the second query will have 100k+ primary key id’s
>> in the WHERE clause, and the second solution looks like an anti pattern in
>> cassandra.
>>
>> Could anyone give any advice how would we create a model for our use case?
>>
>> Thank you in advance,
>> Zoltan.
>>
>>
>

Re: Help with data modelling (from MySQL to Cassandra)

Posted by Matija Gobec <ma...@gmail.com>.

Have one table hold document metadata (doc_id, title, description, ...) and
have another table elements where partition key is doc_id and clustering
key is element_id.
Only problem here is if you need to query and/or update element just by
element_id but I don't know your queries up front.

On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zo...@gmail.com> wrote:

> Dear cassandra users,
>
> We have the following structure in MySql:
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(index), title, description]
>
> Notation: table name->[column1(key or index), column2, …]
>
> We want to transfer the data to Cassandra.
>
> Each document can contain a large number of elements (between 1 and 100k+)
>
> We have two requirements:
> a) Load all elements for a given doc_id quickly
> b) Update the value of one individual element quickly
>
>
> We were thinking on the following cassandra configurations:
>
> Option A
>
> documents->[doc_id(primary key), title, description, elements] (elements
> could be a SET or a TEXT, each time new elements are added (they are never
> removed) we would append it to this column)
> elements->[element_id(primary key), title, description]
>
> Loading a document:
>
>  a) Load document with given <doc_id> and get all element ids
>     SELECT * from documents where doc_id=‘id’
>
>  b) Load all elements with the given ids
>     SELECT * FROM elements where element_id IN (ids loaded from query a)
>
>
> Option B
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(secondary index), title,
> description]
>
> Loading a document:
>  a) SELECT * from elements where doc_id=‘id’
>
>
> Neither solutions doesn’t seem to be good, in Option A, even if we are
> querying by Primary keys, the second query will have 100k+ primary key id’s
> in the WHERE clause, and the second solution looks like an anti pattern in
> cassandra.
>
> Could anyone give any advice how would we create a model for our use case?
>
> Thank you in advance,
> Zoltan.
>
>