You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Piotr Kołaczkowski (JIRA)" <ji...@apache.org> on 2014/12/01 13:02:13 UTC

[jira] [Comment Edited] (CASSANDRA-7688) Add data sizing to a system table

    [ https://issues.apache.org/jira/browse/CASSANDRA-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229701#comment-14229701 ] 

Piotr Kołaczkowski edited comment on CASSANDRA-7688 at 12/1/14 12:01 PM:
-------------------------------------------------------------------------

It would be nice to know also the average partition size in the given table, both in bytes and in number of CQL rows. This would be useful to set appropriate fetch.size. Additionally, current split generation API does not allow to set split size in terms of data size in bytes or number of CQL rows, but only by number of partitions. Number of partitions doesn't make a nice default, as partitions can vary greatly in size and are extremely use-case dependent. So please, don't just copy current describe_splits_ex functionality to the new driver, but *improve this*. 

We really don't need the driver / Cassandra to do the splitting for us. Instead we need to know:

1. estimate of total amount of data in the table in bytes
2. estimate of total number of CQL rows in the table
3. estimate of total number of partitions in the table

We're interested both in totals (whole cluster; logical sizes; i.e. without replicas), and split by token-ranges by node (physical; incuding replicas).

Note that this information is useful not just for Spark/Hadoop split generation, but also things like e.g. SparkSQL optimizer so it knows how much data will it have to process.

The next  step would be providing column data histograms to guide predicate selectivity. 


was (Author: pkolaczk):
It would be nice to know also the average partition size in the given table, both in bytes and in number of CQL rows. This would be useful to set appropriate fetch.size. Additionally, current split generation API does not allow to set split size in terms of data size in bytes or number of CQL rows, but only by number of partitions. Number of partitions doesn't make a nice default, as partitions can vary greatly in size and are extremely use-case dependent. So please, don't just copy current describe_splits_ex functionality to the new driver, but *improve this*. 

We really don't need the driver / Cassandra to do the splitting for us. Instead we need to know:

1. estimate of total amount of data in the table in bytes
2. estimate of total number of CQL rows in the table
3. estimate of total number of partitions in the table

We're interested both in totals (whole cluster; logical sizes; i.e. without replicas), and split by token-ranges by node (physical; incuding replicas).

> Add data sizing to a system table
> ---------------------------------
>
>                 Key: CASSANDRA-7688
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7688
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jeremiah Jordan
>             Fix For: 2.1.3
>
>
> Currently you can't implement something similar to describe_splits_ex purely from the a native protocol driver.  https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose easily getting ownership information to a client in the java-driver.  But you still need the data sizing part to get splits of a given size.  We should add the sizing information to a system table so that native clients can get to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)