You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Nagesh <na...@gmail.com> on 2015/01/05 12:44:07 UTC

(Unknown)

Hi All,

I have designed a column family

prodgroup text, prodid int, status int, , PRIMARY KEY ((prodgroup), prodid,
status)

The data model is to cater

   - Get list of products from the product group
   - get list of products for a given range of ids
   - Get details of a specific product
   - Update status of the product acive/inactive
   - Get list of products that are active or inactive (select * from
   product where prodgroup='xyz' and prodid > 0 and status = 0)

The design works fine, except for the last query . Cassandra not allowing
to query on status unless I fix the product id. I think defining a super
column family which has the key "PRIMARY KEY((prodgroup), staus,
productid)" should work. Would like to get expert advice on other
alternatives.
-- 
Thanks,
Nageswara Rao.V

*"The LORD reigns"*

Re:

Posted by Ryan Svihla <rs...@foundev.pro>.

Something to start considering is the partition key (first part of your
primary key) drives your model more than anything. So if you're querying
for all of X your partition key should probably be X, but there are some
constraints to be mindful of.

The rest of replies inline

On Wed, Jan 7, 2015 at 1:37 AM, Nagesh <na...@gmail.com> wrote:

> Thanks Ryan, Srinivas for you answer.
>
> Finally I have decided to create three column families
>
> 1. product_date_id (yyyymm, dd, prodid) PRIMARY KEY ((yyyymm), dd, prodid)
> - Record the arrival date on updates of a product
> - Get list of products that are recently added/updated Ex: [(yyyymm, dd) >
> (2014, 06)]
>

Could just be product_date and include the entire product graph needed,
this is a tradeoff, and frequently it's optimal for performance reasons on
the read side, the downside is your usually increasing your write payload.
My thought is do a fully materialized view first and denormalize, and
include the entire product, and if you find the write traffic is too much
consider the index approach here then (which is easier after the fact to
just drop the columns).


> 1. product_status(prodid int, status int) PRIMARY KEY (prodid), INDEX on
> (status)
> - Each time I add a product just insert a record (prodid, defstatus) with
> the condition IF NOT EXISTS, to avoid status being updated, Here I couldnt
> avoid read before write to protect product status
>
> As for protecting product status that's fine, however, you could just do
what most applications do and update regardless of previous status. This
leads into different locking theories and what the right behavior for an
application is, but this is something most people never think twice about
when using MySQL or Oracle, and in the end they update status in
unprotected ways. Something to ponder.

- Update Enable/Disable prodid
> - Get list of product ids with the give status
>
>

List of product ids with a given status query will probably suck using a
2i, think scanning ALL of the nodes to get potentially as little as 2
records (if that fits within SLA however, kudos, just be aware of the
behavior).

Assuming you have large status counts and limited status items, the data
model gets trickier, as there are some rule of thumb style constraints
(varies on hardware and SLA what you can tolerate). Say you had a primary
key of (status, prodid), this would in theory very quickly return all of
the ACTIVE prodids as there may only be a few hundred, but lets say you
want to return all the archived prodids there maybe billions and this would
likely take far far too long to return in one query, not to mention
compaction of such a large partition will be fun, and it'll unbalance your
cluster.

So frequently for this particular corner I end up having to do some form of
sharding to spread status over the cluster and keep sizes of the partition
reasonable (and query in an async fashion to get all of the queries in a
reasonable time).

primary key((status, shardId), prodId)

The shardid can be any up to the reasonable size limits of your hardware
and cluster (say 50k for rule of thumb), and there are a number of
different approaches:

- it can be a random uuid but then you have to track with a separate table
what shardIds there are for that particular status (this is not uncommon)
- it can be a fixed size say 10000 and you can just increment the number by
1 (but make sure as you're updating this you're not introducing any fun
state bugs that have to different shards writing to the same number). When
you query you keep increasing the number until you stop getting responses.
This has the downside in that optimization is a bit hard to get right.
Optionally you can have a static column in the table called maxShardId that
once you've done your first query you know how many parallel queries you
have to send out.
- It can be based on some business logic or domain rule that includes some
fixed boundaries, say add a productGroupId in there, and you know from an
application level, how many productGroupIds there are. This has the
downside of not giving you absolute protection against fat partitions, on
the upside it fits your natural domain model and is easier to reason about.



> 2. product_details(prodgrp, prodid, ..... <other details of product>)
> PRIMARY KEY (prodgrp, prodid)
> - Insert product details in the prodgrp blindly to store recent updates of
> the product details
> - Get list of products in the product group
> - Get details of products for the give ids
>
> "get list of products for a given range of ids" : My queries are answered
> with the above design.
>
> PS: I am still thinking to avoid read before write on product_status. And
> would like to see if there is better way to design using supercolumn
> families or materialized views which I am yet to explore.
>
>
Materialized views are your friend, use them freely but as always being
mindful of real world constraints and goals.


> Regards,
> Nageswara Rao
>
> On Tue, Jan 6, 2015 at 10:53 PM, Ryan Svihla <rs...@foundev.pro> wrote:
>
>> Normal data modeling approach in Cassandra is a separate column family of
>> each of those queries is answerable with one partition key (that's going to
>> be the fastest).
>>
>> I'm very suspicious of
>>
>>    - get list of products for a given range of ids
>>
>> Is this being driven by another query to get a list of ids? If so that
>> should probably be modeled differently and any query that would normally
>> return a list of ids should instead be modeled to produce a full product
>> (materialized views scale very well on any database and it's the common
>> approach on Cassandra)
>>
>>
>> On Mon, Jan 5, 2015 at 6:18 AM, Srinivasa T N <se...@gmail.com> wrote:
>>
>>> Just an arrow in the dark: Doucment "CQL for Cassandra 2.x
>>> Documentation" informs that cassandra allows to query on a column when it
>>> is indexed.
>>>
>>> Regards,
>>> Seenu.
>>>
>>> On Mon, Jan 5, 2015 at 5:14 PM, Nagesh <na...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have designed a column family
>>>>
>>>> prodgroup text, prodid int, status int, , PRIMARY KEY ((prodgroup),
>>>> prodid, status)
>>>>
>>>> The data model is to cater
>>>>
>>>>    - Get list of products from the product group
>>>>    - get list of products for a given range of ids
>>>>    - Get details of a specific product
>>>>    - Update status of the product acive/inactive
>>>>    - Get list of products that are active or inactive (select * from
>>>>    product where prodgroup='xyz' and prodid > 0 and status = 0)
>>>>
>>>> The design works fine, except for the last query . Cassandra not
>>>> allowing to query on status unless I fix the product id. I think defining a
>>>> super column family which has the key "PRIMARY KEY((prodgroup), staus,
>>>> productid)" should work. Would like to get expert advice on other
>>>> alternatives.
>>>> --
>>>> Thanks,
>>>> Nageswara Rao.V
>>>>
>>>> *"The LORD reigns"*
>>>>
>>>
>>>
>>
>>
>> --
>>
>> Thanks,
>> Ryan Svihla
>>
>>
>
>
> --
> Thanks,
> Nageswara Rao.V
>
> *"The LORD reigns"*
>



-- 

Thanks,
Ryan Svihla

Re:

Posted by Nagesh <na...@gmail.com>.

Thanks Ryan, Srinivas for you answer.

Finally I have decided to create three column families

1. product_date_id (yyyymm, dd, prodid) PRIMARY KEY ((yyyymm), dd, prodid)
- Record the arrival date on updates of a product
- Get list of products that are recently added/updated Ex: [(yyyymm, dd) >
(2014, 06)]

1. product_status(prodid int, status int) PRIMARY KEY (prodid), INDEX on
(status)
- Each time I add a product just insert a record (prodid, defstatus) with
the condition IF NOT EXISTS, to avoid status being updated, Here I couldnt
avoid read before write to protect product status

- Update Enable/Disable prodid
- Get list of product ids with the give status

2. product_details(prodgrp, prodid, ..... <other details of product>)
PRIMARY KEY (prodgrp, prodid)
- Insert product details in the prodgrp blindly to store recent updates of
the product details
- Get list of products in the product group
- Get details of products for the give ids

"get list of products for a given range of ids" : My queries are answered
with the above design.

PS: I am still thinking to avoid read before write on product_status. And
would like to see if there is better way to design using supercolumn
families or materialized views which I am yet to explore.

Regards,
Nageswara Rao

On Tue, Jan 6, 2015 at 10:53 PM, Ryan Svihla <rs...@foundev.pro> wrote:

> Normal data modeling approach in Cassandra is a separate column family of
> each of those queries is answerable with one partition key (that's going to
> be the fastest).
>
> I'm very suspicious of
>
>    - get list of products for a given range of ids
>
> Is this being driven by another query to get a list of ids? If so that
> should probably be modeled differently and any query that would normally
> return a list of ids should instead be modeled to produce a full product
> (materialized views scale very well on any database and it's the common
> approach on Cassandra)
>
>
> On Mon, Jan 5, 2015 at 6:18 AM, Srinivasa T N <se...@gmail.com> wrote:
>
>> Just an arrow in the dark: Doucment "CQL for Cassandra 2.x Documentation"
>> informs that cassandra allows to query on a column when it is indexed.
>>
>> Regards,
>> Seenu.
>>
>> On Mon, Jan 5, 2015 at 5:14 PM, Nagesh <na...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I have designed a column family
>>>
>>> prodgroup text, prodid int, status int, , PRIMARY KEY ((prodgroup),
>>> prodid, status)
>>>
>>> The data model is to cater
>>>
>>>    - Get list of products from the product group
>>>    - get list of products for a given range of ids
>>>    - Get details of a specific product
>>>    - Update status of the product acive/inactive
>>>    - Get list of products that are active or inactive (select * from
>>>    product where prodgroup='xyz' and prodid > 0 and status = 0)
>>>
>>> The design works fine, except for the last query . Cassandra not
>>> allowing to query on status unless I fix the product id. I think defining a
>>> super column family which has the key "PRIMARY KEY((prodgroup), staus,
>>> productid)" should work. Would like to get expert advice on other
>>> alternatives.
>>> --
>>> Thanks,
>>> Nageswara Rao.V
>>>
>>> *"The LORD reigns"*
>>>
>>
>>
>
>
> --
>
> Thanks,
> Ryan Svihla
>
>


-- 
Thanks,
Nageswara Rao.V

*"The LORD reigns"*

Re:

Posted by Ryan Svihla <rs...@foundev.pro>.

Normal data modeling approach in Cassandra is a separate column family of
each of those queries is answerable with one partition key (that's going to
be the fastest).

I'm very suspicious of

   - get list of products for a given range of ids

Is this being driven by another query to get a list of ids? If so that
should probably be modeled differently and any query that would normally
return a list of ids should instead be modeled to produce a full product
(materialized views scale very well on any database and it's the common
approach on Cassandra)

On Mon, Jan 5, 2015 at 6:18 AM, Srinivasa T N <se...@gmail.com> wrote:

> Just an arrow in the dark: Doucment "CQL for Cassandra 2.x Documentation"
> informs that cassandra allows to query on a column when it is indexed.
>
> Regards,
> Seenu.
>
> On Mon, Jan 5, 2015 at 5:14 PM, Nagesh <na...@gmail.com> wrote:
>
>> Hi All,
>>
>> I have designed a column family
>>
>> prodgroup text, prodid int, status int, , PRIMARY KEY ((prodgroup),
>> prodid, status)
>>
>> The data model is to cater
>>
>>    - Get list of products from the product group
>>    - get list of products for a given range of ids
>>    - Get details of a specific product
>>    - Update status of the product acive/inactive
>>    - Get list of products that are active or inactive (select * from
>>    product where prodgroup='xyz' and prodid > 0 and status = 0)
>>
>> The design works fine, except for the last query . Cassandra not allowing
>> to query on status unless I fix the product id. I think defining a super
>> column family which has the key "PRIMARY KEY((prodgroup), staus,
>> productid)" should work. Would like to get expert advice on other
>> alternatives.
>> --
>> Thanks,
>> Nageswara Rao.V
>>
>> *"The LORD reigns"*
>>
>
>

-- 

Thanks,
Ryan Svihla

Re:

Posted by Srinivasa T N <se...@gmail.com>.

Just an arrow in the dark: Doucment "CQL for Cassandra 2.x Documentation"
informs that cassandra allows to query on a column when it is indexed.

Regards,
Seenu.

On Mon, Jan 5, 2015 at 5:14 PM, Nagesh <na...@gmail.com> wrote:

> Hi All,
>
> I have designed a column family
>
> prodgroup text, prodid int, status int, , PRIMARY KEY ((prodgroup),
> prodid, status)
>
> The data model is to cater
>
>    - Get list of products from the product group
>    - get list of products for a given range of ids
>    - Get details of a specific product
>    - Update status of the product acive/inactive
>    - Get list of products that are active or inactive (select * from
>    product where prodgroup='xyz' and prodid > 0 and status = 0)
>
> The design works fine, except for the last query . Cassandra not allowing
> to query on status unless I fix the product id. I think defining a super
> column family which has the key "PRIMARY KEY((prodgroup), staus,
> productid)" should work. Would like to get expert advice on other
> alternatives.
> --
> Thanks,
> Nageswara Rao.V
>
> *"The LORD reigns"*
>