You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Chandra Sekar KR <ch...@hotmail.com> on 2016/02/19 05:20:55 UTC

Compatability, performance & portability of Cassandra data types (MAP, UDT & JSON) in DSE Search & Analytics

Hi,


I'm looking for help in arriving at pros & cons of using MAP, UDT & JSON (Text) data types in Cassandra & its ease of use/impact across other DSE products - Spark & Solr. We are migrating an OLTP database from RDBMS to Cassandra which has 200+ columns and with an average daily volume of 25 million records/day. The access pattern is quite simple and in OLTP the access is always based on primary key. For OLAP, there are other access patterns with a combination of columns where we are planning to use Spark & Solr for search & analytical capabilities (in a separate DC).


The average size of each record is ~2KB and the application workload is of type INSERT only (no updates/deletes). We conducted performance tests on two types of data models

1) A table with 200+ columns similar to RDBMS

2) A table with 15 columns where only critical business fields are maintained as key/value pairs and the remaining are stored in a single column of type TEXT as JSON object.


In the results, we noticed significant advantage in the JSON model where the performance was 5X times better than columnar data model. Alternatively, we are in the process of evaluating performance for other data types - MAP & UDT instead of using TEXT for storing JSON object. Sample data model structure for columnar, json, map & udt types are given below:


[cid:9136e044-677b-4e0a-8bb2-5305acc2782d]


I would like to know the performance, transformation, compatibility & portability impacts & east-of-use of each of these data types from Search & Analytics perspective (Spark & Solr). I'm aware that we will have to use field transformers in Solr to use index on JSON fields, not sure about MAP & UDT. Any help on comparison of these data types in Spark & Solr is highly appreciated.


Regards, KR

Re: Compatability, performance & portability of Cassandra data types (MAP, UDT & JSON) in DSE Search & Analytics

Posted by ch...@wipro.com.

Please find below the graph plotted out of cassandra-stress test output log. While the columnar data took 36 mins to insert 20m records, the JSON format data was loaded in under 10 mins. The tests were carried on bare-metal 4 node cluster with 16-core CPU and 120GB memory (8GB Heap) backed by SSDs.

[cid:a99c2964-32d9-4a1b-81fe-9ada3ce80c07]
Regards, Chandra Sekar KR
________________________________
From: daemeon reiydelle <da...@gmail.com>
Sent: Friday, February 19, 2016 12:57
To: user@cassandra.apache.org
Subject: Re: Compatability, performance & portability of Cassandra data types (MAP, UDT & JSON) in DSE Search & Analytics

Given you only have 16 columns vs. over 200 ... I would expect a substantial improvement in writes, but not 5x.
Ditto reads. I would be interested to understand where that 5x comes from.

.......

Daemeon C.M. Reiydelle
USA (+1) 415.501.0198
London (+44) (0) 20 8144 9872

On Thu, Feb 18, 2016 at 8:20 PM, Chandra Sekar KR <ch...@hotmail.com>> wrote:

Hi,

I'm looking for help in arriving at pros & cons of using MAP, UDT & JSON (Text) data types in Cassandra & its ease of use/impact across other DSE products - Spark & Solr. We are migrating an OLTP database from RDBMS to Cassandra which has 200+ columns and with an average daily volume of 25 million records/day. The access pattern is quite simple and in OLTP the access is always based on primary key. For OLAP, there are other access patterns with a combination of columns where we are planning to use Spark & Solr for search & analytical capabilities (in a separate DC).

The average size of each record is ~2KB and the application workload is of type INSERT only (no updates/deletes). We conducted performance tests on two types of data models

1) A table with 200+ columns similar to RDBMS

2) A table with 15 columns where only critical business fields are maintained as key/value pairs and the remaining are stored in a single column of type TEXT as JSON object.

In the results, we noticed significant advantage in the JSON model where the performance was 5X times better than columnar data model. Alternatively, we are in the process of evaluating performance for other data types - MAP & UDT instead of using TEXT for storing JSON object. Sample data model structure for columnar, json, map & udt types are given below:

[cid:9136e044-677b-4e0a-8bb2-5305acc2782d]

I would like to know the performance, transformation, compatibility & portability impacts & east-of-use of each of these data types from Search & Analytics perspective (Spark & Solr). I'm aware that we will have to use field transformers in Solr to use index on JSON fields, not sure about MAP & UDT. Any help on comparison of these data types in Spark & Solr is highly appreciated.

Regards, KR

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com

Re: Compatability, performance & portability of Cassandra data types (MAP, UDT & JSON) in DSE Search & Analytics

Posted by daemeon reiydelle <da...@gmail.com>.

Given you only have 16 columns vs. over 200 ... I would expect a
substantial improvement in writes, but not 5x.
Ditto reads. I would be interested to understand where that 5x comes from.


*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 18, 2016 at 8:20 PM, Chandra Sekar KR <
chandrasekarkr@hotmail.com> wrote:

> Hi,
>
>
> I'm looking for help in arriving at pros & cons of using MAP, UDT & JSON
> (Text) data types in Cassandra & its ease of use/impact across other DSE
> products - Spark & Solr. We are migrating an OLTP database from RDBMS to
> Cassandra which has 200+ columns and with an average daily volume of 25
> million records/day. The access pattern is quite simple and in OLTP the
> access is always based on primary key. For OLAP, there are other access
> patterns with a combination of columns where we are planning to use Spark &
> Solr for search & analytical capabilities (in a separate DC).
>
>
> The average size of each record is ~2KB and the application workload is of
> type INSERT only (no updates/deletes). We conducted performance tests on
> two types of data models
>
> 1) A table with 200+ columns similar to RDBMS
>
> 2) A table with 15 columns where only critical business fields are
> maintained as key/value pairs and the remaining are stored in a single
> column of type TEXT as JSON object.
>
>
> In the results, we noticed significant advantage in the JSON model where
> the performance was 5X times better than columnar data model.
> Alternatively, we are in the process of evaluating performance for other
> data types - MAP & UDT instead of using TEXT for storing JSON object.
> Sample data model structure for columnar, json, map & udt types are given
> below:
>
>
>
>
> I would like to know the performance, transformation, compatibility &
> portability impacts & east-of-use of each of these data types from Search &
> Analytics perspective (Spark & Solr). I'm aware that we will have to use
> field transformers in Solr to use index on JSON fields, not sure about MAP
> & UDT. Any help on comparison of these data types in Spark & Solr is highly
> appreciated.
>
>
> Regards, KR
>