You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Data Craftsman <da...@gmail.com> on 2012/04/05 20:46:35 UTC

Materialized Views or Index CF - data model question

Howdy,

Can I ask a data model question here?

We have a book table with 20 columns, 300 million rows, average row
size is 1500 bytes.

create table book(
 book_id,
 isbn,
 price,
 author,
 titile,
 ...
 col_n1,
 col_n2,
 ...
 col_nm
);

Data usage:

We need to query data by each column, do pagination as below,

select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
select * from book where price  < 992   order by price  descending limit 30;
select * from book where col_n1 < 789   order by col_n1 descending limit 30;
select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
...
select * from book where col_nm < 978 order by col_nm descending limit 30;

Write: 100 million updates a day.
Read : 16  million queries a day. 200 queries per second, one query
returns 30 rows.

***
Materialized Views approach

{"ISBN_01",book_object1},{"ISBN_02",book_object2},...,{"ISBN_N",book_objectN}
...
We will end up with 20 timelines.


***
Index approach - create 2nd Column Family as Index

'ISBN_01': 'book_id_a01','book_id_a02',...,'book_id_aN'
'ISBN_02': 'book_id_b01','book_id_b02',...,'book_id_bN'
...
'ISBN_0m': 'book_id_m01','book_id_m02',...,'book_id_mN'

This way, we will create 20 index Column Family(s).

---

If we choose Materialized Views approach, we have to update all
20 Materialized View column family(s), for each base row update.
Will the Cassandra write performance acceptable?

Redis recommend building an index for the query on each column, that
is your 1st strategy - create 2nd index CF:
http://redis.io/topics/data-types-intro
(see section [ Pushing IDs instead of the actual data in Redis lists ]

Should we just normalize the data, create base book table with book_id
as primary key, and then
build 20 index column family(s), use wide row column slicing approach,
with index column data value as column name and book_id as value?
This way, we only need to update fewer affected column family that
column value changed, but not all 20 Materialized Views CF(s).

Another option would be using Redis to store master book data, using
Cassandra Column Family to maintain 2nd index.

What will you recommend?

Charlie (@mujiang) 一个 木匠
=======
Data Architect Developer
http://mujiang.blogspot.com


p.s.

Gist from datastax dev blog (
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra )
"
 If the same event is tracked in multiple timelines,
 it’s okay to denormalize and store all of the event data in each of
those timelines.
 One of the main principles that Cassandra was built on is that disk
space is very cheap resource;
 minimizing disk seeks at the cost of higher space consumption is a
good tradeoff.
 Unless the data for each event is ^very large^, I always prefer this
strategy over the index strategy.
"

Will 1500 bytes row size be large or small for Cassandra from your
understanding?

Re: Materialized Views or Index CF - data model question

Posted by Radim Kolar <hs...@filez.com>.

Will 1500 bytes row size be large or small for Cassandra from your 
understanding?

performance degradation starts at 500MB rows, its very slow if you hit 
this limit.

Re: Materialized Views or Index CF - data model question

Posted by aaron morton <aa...@thelastpickle.com>.

> a) "These queries are not easily supported on standard Cassandra"
> select * from book where price  < 992   order by price descending limit 30;
> 
> This is a typical (time series data)timeline query well supported by
> Cassandra, from my understanding.
Queries that use a secondary index (on price) must include an equality operator. 

> 
> b) "You do not need a different CF for each custom secondary index.
> Try putting the name of the index in the row key. "
> 
> I couldn't understand it. Can you help to build an demo with CF
> structure and some sample data?
You can have one CF that contains multiple secondary indexes. 

key: col_1:value_1
col_name: entity_id_1

key: col_2:value_2
col_name: entity_id_1

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 11/04/2012, at 7:24 AM, Data Craftsman wrote:

> Hi Aaron,
> 
> Thanks for the quick answer, I'll build a prototype to benchmark each
> approach next week.
> 
> Here are more questions based on your reply:
> 
> a) "These queries are not easily supported on standard Cassandra"
> select * from book where price  < 992   order by price descending limit 30;
> 
> This is a typical (time series data)timeline query well supported by
> Cassandra, from my understanding.
> 
> b) "You do not need a different CF for each custom secondary index.
> Try putting the name of the index in the row key. "
> 
> I couldn't understand it. Can you help to build an demo with CF
> structure and some sample data?
> 
> Thanks,
> Charlie | DBA developer
> 
> 
> 
> On Sun, Apr 8, 2012 at 2:30 PM, aaron morton <aa...@thelastpickle.com> wrote:
>> We need to query data by each column, do pagination as below,
>> 
>> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
>> select * from book where price  < 992   order by price  descending limit 30;
>> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
>> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
>> ...
>> select * from book where col_nm < 978 order by col_nm descending limit 30;
>> 
>> These queries are not easily supported on standard Cassandra. If you need
>> this level of query complexity consider Data Stax Enterprise, Solr, or a
>> RDBMS.
>> 
>> If we choose Materialized Views approach, we have to update all
>> 20 Materialized View column family(s), for each base row update.
>> Will the Cassandra write performance acceptable?
>> 
>> Yes, depending on the size of the cluster and the machine spec.
>> 
>> It's often a good idea to design CF's to match the workloads. If you have
>> some data that changes faster than other, consider splitting them into
>> different CFs.
>> 
>> Should we just normalize the data, create base book table with book_id
>> as primary key, and then
>> build 20 index column family(s), use wide row column slicing approach,
>> with index column data value as column name and book_id as value?
>> 
>> You do not need a different CF for each custom secondary index. Try putting
>> the name of the index in the row key.
>> 
>> What will you recommend?
>> 
>> Take another look at the queries you *need* to support. Then build a small
>> proof of concept to see if Cassandra will work for you.
>> 
>> Hope that helps.
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 6/04/2012, at 6:46 AM, Data Craftsman wrote:
>> 
>> Howdy,
>> 
>> Can I ask a data model question here?
>> 
>> We have a book table with 20 columns, 300 million rows, average row
>> size is 1500 bytes.
>> 
>> create table book(
>> book_id,
>> isbn,
>> price,
>> author,
>> titile,
>> ...
>> col_n1,
>> col_n2,
>> ...
>> col_nm
>> );
>> 
>> Data usage:
>> 
>> We need to query data by each column, do pagination as below,
>> 
>> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
>> select * from book where price  < 992   order by price  descending limit 30;
>> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
>> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
>> ...
>> select * from book where col_nm < 978 order by col_nm descending limit 30;
>> 
>> Write: 100 million updates a day.
>> Read : 16  million queries a day. 200 queries per second, one query
>> returns 30 rows.
>> 
>> ***
>> Materialized Views approach
>> 
>> {"ISBN_01",book_object1},{"ISBN_02",book_object2},...,{"ISBN_N",book_objectN}
>> ...
>> We will end up with 20 timelines.
>> 
>> 
>> ***
>> Index approach - create 2nd Column Family as Index
>> 
>> 'ISBN_01': 'book_id_a01','book_id_a02',...,'book_id_aN'
>> 'ISBN_02': 'book_id_b01','book_id_b02',...,'book_id_bN'
>> ...
>> 'ISBN_0m': 'book_id_m01','book_id_m02',...,'book_id_mN'
>> 
>> This way, we will create 20 index Column Family(s).
>> 
>> ---
>> 
>> If we choose Materialized Views approach, we have to update all
>> 20 Materialized View column family(s), for each base row update.
>> Will the Cassandra write performance acceptable?
>> 
>> Redis recommend building an index for the query on each column, that
>> is your 1st strategy - create 2nd index CF:
>> http://redis.io/topics/data-types-intro
>> (see section [ Pushing IDs instead of the actual data in Redis lists ]
>> 
>> Should we just normalize the data, create base book table with book_id
>> as primary key, and then
>> build 20 index column family(s), use wide row column slicing approach,
>> with index column data value as column name and book_id as value?
>> This way, we only need to update fewer affected column family that
>> column value changed, but not all 20 Materialized Views CF(s).
>> 
>> Another option would be using Redis to store master book data, using
>> Cassandra Column Family to maintain 2nd index.
>> 
>> What will you recommend?
>> 
>> Charlie (@mujiang) 一个 木匠
>> =======
>> Data Architect Developer
>> http://mujiang.blogspot.com
>> 
>> 
>> p.s.
>> 
>> Gist from datastax dev blog (
>> http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra )
>> "
>> If the same event is tracked in multiple timelines,
>> it's okay to denormalize and store all of the event data in each of
>> those timelines.
>> One of the main principles that Cassandra was built on is that disk
>> space is very cheap resource;
>> minimizing disk seeks at the cost of higher space consumption is a
>> good tradeoff.
>> Unless the data for each event is ^very large^, I always prefer this
>> strategy over the index strategy.
>> "
>> 
>> Will 1500 bytes row size be large or small for Cassandra from your
>> understanding?
>> 
>> 
> 
> 
> 
> -- 
> --
> Thanks,
> 
> Charlie (@mujiang) 一个 木匠
> =======
> Data Architect Developer
> http://mujiang.blogspot.com

Re: Materialized Views or Index CF - data model question

Posted by Data Craftsman <da...@gmail.com>.

Hi Aaron,

Thanks for the quick answer, I'll build a prototype to benchmark each
approach next week.

Here are more questions based on your reply:

a) "These queries are not easily supported on standard Cassandra"
select * from book where price  < 992   order by price descending limit 30;

This is a typical (time series data)timeline query well supported by
Cassandra, from my understanding.

b) "You do not need a different CF for each custom secondary index.
Try putting the name of the index in the row key. "

I couldn't understand it. Can you help to build an demo with CF
structure and some sample data?

Thanks,
Charlie | DBA developer



On Sun, Apr 8, 2012 at 2:30 PM, aaron morton <aa...@thelastpickle.com> wrote:
> We need to query data by each column, do pagination as below,
>
> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
> select * from book where price  < 992   order by price  descending limit 30;
> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
> ...
> select * from book where col_nm < 978 order by col_nm descending limit 30;
>
> These queries are not easily supported on standard Cassandra. If you need
> this level of query complexity consider Data Stax Enterprise, Solr, or a
> RDBMS.
>
> If we choose Materialized Views approach, we have to update all
> 20 Materialized View column family(s), for each base row update.
> Will the Cassandra write performance acceptable?
>
> Yes, depending on the size of the cluster and the machine spec.
>
> It's often a good idea to design CF's to match the workloads. If you have
> some data that changes faster than other, consider splitting them into
> different CFs.
>
> Should we just normalize the data, create base book table with book_id
> as primary key, and then
> build 20 index column family(s), use wide row column slicing approach,
> with index column data value as column name and book_id as value?
>
> You do not need a different CF for each custom secondary index. Try putting
> the name of the index in the row key.
>
> What will you recommend?
>
> Take another look at the queries you *need* to support. Then build a small
> proof of concept to see if Cassandra will work for you.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 6/04/2012, at 6:46 AM, Data Craftsman wrote:
>
> Howdy,
>
> Can I ask a data model question here?
>
> We have a book table with 20 columns, 300 million rows, average row
> size is 1500 bytes.
>
> create table book(
> book_id,
> isbn,
> price,
> author,
> titile,
> ...
> col_n1,
> col_n2,
> ...
> col_nm
> );
>
> Data usage:
>
> We need to query data by each column, do pagination as below,
>
> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
> select * from book where price  < 992   order by price  descending limit 30;
> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
> ...
> select * from book where col_nm < 978 order by col_nm descending limit 30;
>
> Write: 100 million updates a day.
> Read : 16  million queries a day. 200 queries per second, one query
> returns 30 rows.
>
> ***
> Materialized Views approach
>
> {"ISBN_01",book_object1},{"ISBN_02",book_object2},...,{"ISBN_N",book_objectN}
> ...
> We will end up with 20 timelines.
>
>
> ***
> Index approach - create 2nd Column Family as Index
>
> 'ISBN_01': 'book_id_a01','book_id_a02',...,'book_id_aN'
> 'ISBN_02': 'book_id_b01','book_id_b02',...,'book_id_bN'
> ...
> 'ISBN_0m': 'book_id_m01','book_id_m02',...,'book_id_mN'
>
> This way, we will create 20 index Column Family(s).
>
> ---
>
> If we choose Materialized Views approach, we have to update all
> 20 Materialized View column family(s), for each base row update.
> Will the Cassandra write performance acceptable?
>
> Redis recommend building an index for the query on each column, that
> is your 1st strategy - create 2nd index CF:
> http://redis.io/topics/data-types-intro
> (see section [ Pushing IDs instead of the actual data in Redis lists ]
>
> Should we just normalize the data, create base book table with book_id
> as primary key, and then
> build 20 index column family(s), use wide row column slicing approach,
> with index column data value as column name and book_id as value?
> This way, we only need to update fewer affected column family that
> column value changed, but not all 20 Materialized Views CF(s).
>
> Another option would be using Redis to store master book data, using
> Cassandra Column Family to maintain 2nd index.
>
> What will you recommend?
>
> Charlie (@mujiang) 一个 木匠
> =======
> Data Architect Developer
> http://mujiang.blogspot.com
>
>
> p.s.
>
> Gist from datastax dev blog (
> http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra )
> "
> If the same event is tracked in multiple timelines,
> it's okay to denormalize and store all of the event data in each of
> those timelines.
> One of the main principles that Cassandra was built on is that disk
> space is very cheap resource;
> minimizing disk seeks at the cost of higher space consumption is a
> good tradeoff.
> Unless the data for each event is ^very large^, I always prefer this
> strategy over the index strategy.
> "
>
> Will 1500 bytes row size be large or small for Cassandra from your
> understanding?
>
>



-- 
--
Thanks,

Charlie (@mujiang) 一个 木匠
=======
Data Architect Developer
http://mujiang.blogspot.com

Re: Materialized Views or Index CF - data model question

Posted by aaron morton <aa...@thelastpickle.com>.

> We need to query data by each column, do pagination as below,
> 
> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
> select * from book where price  < 992   order by price  descending limit 30;
> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
> ...
> select * from book where col_nm < 978 order by col_nm descending limit 30;
These queries are not easily supported on standard Cassandra. If you need this level of query complexity consider Data Stax Enterprise, Solr, or a RDBMS. 

> If we choose Materialized Views approach, we have to update all
> 20 Materialized View column family(s), for each base row update.
> Will the Cassandra write performance acceptable?
Yes, depending on the size of the cluster and the machine spec. 

It's often a good idea to design CF's to match the workloads. If you have some data that changes faster than other, consider splitting them into different CFs.

> Should we just normalize the data, create base book table with book_id
> as primary key, and then
> build 20 index column family(s), use wide row column slicing approach,
> with index column data value as column name and book_id as value?
You do not need a different CF for each custom secondary index. Try putting the name of the index in the row key. 

> What will you recommend?
Take another look at the queries you *need* to support. Then build a small proof of concept to see if Cassandra will work for you. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 6/04/2012, at 6:46 AM, Data Craftsman wrote:

> Howdy,
> 
> Can I ask a data model question here?
> 
> We have a book table with 20 columns, 300 million rows, average row
> size is 1500 bytes.
> 
> create table book(
> book_id,
> isbn,
> price,
> author,
> titile,
> ...
> col_n1,
> col_n2,
> ...
> col_nm
> );
> 
> Data usage:
> 
> We need to query data by each column, do pagination as below,
> 
> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
> select * from book where price  < 992   order by price  descending limit 30;
> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
> ...
> select * from book where col_nm < 978 order by col_nm descending limit 30;
> 
> Write: 100 million updates a day.
> Read : 16  million queries a day. 200 queries per second, one query
> returns 30 rows.
> 
> ***
> Materialized Views approach
> 
> {"ISBN_01",book_object1},{"ISBN_02",book_object2},...,{"ISBN_N",book_objectN}
> ...
> We will end up with 20 timelines.
> 
> 
> ***
> Index approach - create 2nd Column Family as Index
> 
> 'ISBN_01': 'book_id_a01','book_id_a02',...,'book_id_aN'
> 'ISBN_02': 'book_id_b01','book_id_b02',...,'book_id_bN'
> ...
> 'ISBN_0m': 'book_id_m01','book_id_m02',...,'book_id_mN'
> 
> This way, we will create 20 index Column Family(s).
> 
> ---
> 
> If we choose Materialized Views approach, we have to update all
> 20 Materialized View column family(s), for each base row update.
> Will the Cassandra write performance acceptable?
> 
> Redis recommend building an index for the query on each column, that
> is your 1st strategy - create 2nd index CF:
> http://redis.io/topics/data-types-intro
> (see section [ Pushing IDs instead of the actual data in Redis lists ]
> 
> Should we just normalize the data, create base book table with book_id
> as primary key, and then
> build 20 index column family(s), use wide row column slicing approach,
> with index column data value as column name and book_id as value?
> This way, we only need to update fewer affected column family that
> column value changed, but not all 20 Materialized Views CF(s).
> 
> Another option would be using Redis to store master book data, using
> Cassandra Column Family to maintain 2nd index.
> 
> What will you recommend?
> 
> Charlie (@mujiang) 一个 木匠
> =======
> Data Architect Developer
> http://mujiang.blogspot.com
> 
> 
> p.s.
> 
> Gist from datastax dev blog (
> http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra )
> "
> If the same event is tracked in multiple timelines,
> it’s okay to denormalize and store all of the event data in each of
> those timelines.
> One of the main principles that Cassandra was built on is that disk
> space is very cheap resource;
> minimizing disk seeks at the cost of higher space consumption is a
> good tradeoff.
> Unless the data for each event is ^very large^, I always prefer this
> strategy over the index strategy.
> "
> 
> Will 1500 bytes row size be large or small for Cassandra from your
> understanding?