You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@carbondata.apache.org by Swapnil Shinde <sw...@gmail.com> on 2017/07/19 07:23:24 UTC

carbon data performance doubts

Hello All
     I am trying carbon data for the first time and having few question on
improving performance -

1. What is the use of *carbon.number.of.cores *property and how is it
different from spark's executor cores?

2. Documentation says, by default, all non-numeric columns (except complex
types) become dimensions and numeric columns become measure. How dimensions
and measure columns are handled diferently? What are the pros and cons of
keeping any column as dimension vs measure?

3. What is the best way when we have a ID INT column which is will be used
heavily for filteration/agg/joins but can't be dimension by default.
Documentation says to include these kind of numeric columns with
"dictionay_include" or "dictionary_exclude" in table definition so that
column will be considered as dimenstion. It is not supported to keep
non-string data types as "dictionary_exclude" (link
<https://github.com/apache/carbondata/blob/6488bc018a2ec715b31407d12290680d388a43b3/integration/spark-common/src/main/scala/org/apache/spark/sql/catalyst/CarbonDDLSqlParser.scala#L690>)
Then do we have to enable dictionary encoding for ID INT columns which is
beneficial to encode.

4. How MDK gets generated and how can we alter it? Any API to find out MDK
for given table?


        It will be good to know to understand above concept in details so
we can use carbon data effectively?


Thanks
Swapnil

Re: carbon data performance doubts

Posted by Kumar Vishal <ku...@gmail.com>.
Hi Swapnil,

Currently it is not supported as numeric value should be written based on
their datatype in carbondata file and same need to be handled in filters.
It is a requirement pending in carbondata.

If you are interested, please have a look. Please let the community know
for any support required.

-Regards
Kumar Vishal

On Thu, Jul 20, 2017 at 9:42 PM, Swapnil Shinde <sw...@gmail.com>
wrote:

> Ok. Just curious - Any reason not to support numeric columns with
> dictionary_exclude? Wouldn't it be useful for numeric unique column which
> should be dimension but avoid creating dictionary  (as it may not be
> beneficial).
>
> Thanks
> Swapnil
>
>
> On Thu, Jul 20, 2017 at 4:20 AM, manishgupta88 <to...@gmail.com>
> wrote:
>
> > No Dictionary_Exclude is supported only for String data type columns.
> >
> > Regards
> > Manish Gupta
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-dev-
> > mailing-list-archive.1130556.n5.nabble.com/carbon-data-
> performance-doubts-
> > tp18438p18559.html
> > Sent from the Apache CarbonData Dev Mailing List archive mailing list
> > archive at Nabble.com.
> >
>

Re: carbon data performance doubts

Posted by 马云 <si...@163.com>.
sure



thanks
Jack

> 在 2017年7月23日,下午11:06,Liang Chen <ch...@gmail.com> 写道:
> 
> Hi simafengyun
> 
> Can you write a example to introduce how to use sort_columns and update the
> documents also, thanks.
> 
> Regards
> Liang
> 
> 
> 
> --
> View this message in context: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-tp18438p18703.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list archive at Nabble.com.



Re: carbon data performance doubts

Posted by Liang Chen <ch...@gmail.com>.
Hi simafengyun

Can you write a example to introduce how to use sort_columns and update the
documents also, thanks.

Regards
Liang



--
View this message in context: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-tp18438p18703.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive at Nabble.com.

Re: carbon data performance doubts

Posted by 马云 <si...@163.com>.
Good Suggestion!
Currently you can refer to the below code for sort_columns use cases.
https://github.com/apache/carbondata/blob/master/integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/sortcolumns/TestSortColumns.scala



thanks
Jack


> 在 2017年7月23日,上午10:55,Swapnil Shinde <sw...@gmail.com> 写道:
> 
> Thank you, Liang. I couldn't find this property "sort_columns" in
> documentation. It will be good to have it there.
> 
> -
> Swapnil
> 
> On Fri, Jul 21, 2017 at 9:31 PM, Liang Chen <ch...@gmail.com> wrote:
> 
>> 
>> Hi
>> 
>> Some more info :
>> In release 1.1.1, there was a good improvement "measure filter
>> optimization",  system will use minmax index to do filter for measure
>> column filter.
>> 
>> So for INT column to get good filter: one way you can add the INT column to
>> sort_columns, another way, system will automatically use the INT column's
>> minmax index to do get good filter.
>> 
>> Regards
>> Liang
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-carbondata-dev-
>> mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-
>> tp18438p18662.html
>> Sent from the Apache CarbonData Dev Mailing List archive mailing list
>> archive at Nabble.com.
>> 



Re: carbon data performance doubts

Posted by Swapnil Shinde <sw...@gmail.com>.
Thank you, Liang. I couldn't find this property "sort_columns" in
documentation. It will be good to have it there.

-
Swapnil

On Fri, Jul 21, 2017 at 9:31 PM, Liang Chen <ch...@gmail.com> wrote:

>
> Hi
>
> Some more info :
> In release 1.1.1, there was a good improvement "measure filter
> optimization",  system will use minmax index to do filter for measure
> column filter.
>
> So for INT column to get good filter: one way you can add the INT column to
> sort_columns, another way, system will automatically use the INT column's
> minmax index to do get good filter.
>
> Regards
> Liang
>
>
>
> --
> View this message in context: http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-
> tp18438p18662.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>

Re: carbon data performance doubts

Posted by Liang Chen <ch...@gmail.com>.
Hi 

Some more info : 
In release 1.1.1, there was a good improvement "measure filter 
optimization",  system will use minmax index to do filter for measure 
column filter. 

So for INT column to get good filter: one way you can add the INT column to
sort_columns, another way, system will automatically use the INT column's
minmax index to do get good filter.

Regards 
Liang



--
View this message in context: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-tp18438p18662.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive at Nabble.com.

Re: carbon data performance doubts

Posted by Liang Chen <ch...@apache.org>.
Hi

Some more info :
In release 1.1.1, there was a good improvement "measure filter
optimization",  system will use minmax index to do filter for measure
column filter.

So for INT

Regards
Liang

2017-07-22 9:22 GMT+08:00 Liang Chen <ch...@apache.org>:

> Hi Swapnil
>
> Actually,  current system's behavior is  : Index and dictionary encoding
> are decoupled, no relationship.
>
> 1. If you want to make some columns have good filter , just add these
> columns to sort_columns (like tblproperties('sort_columns'='empno')), to
> build good MDX index for these columns,  just add INT column to
> sort_columns list for filter.
>
> 2. If you want to make some columns have good aggregation for group by,
>  just dictionary encodes these columns. By default INT column doesn't do
> dictionary encode, so don't need to add "DICTIONARY_EXCLUDE",  if the INT
> column is low cardinality and you also want to have good aggregation on the
> INT column, use "DICTIONARY_INCLUDE = the INT column".
>
> So , in a word :  INT column with high cardinality doesn't have
> DICTIONARY_EXCLUDE scenario :)
>
> HTH.
>
> Regards
> Liang
>
>
> 2017-07-22 6:09 GMT+08:00 Swapnil Shinde <sw...@gmail.com>:
>
>> Thank you Jacky! Above encoding property makes sense. How would you handle
>> an INT column with high cardinality? as per my understanding, this column
>> will be considered as measure and only way to make it dimension is to
>> specify "dictionary_include" for that column.
>> Any reason why a column being a dimension or measure is tied with
>> dictionary encoding? Does it make sense to have column as dimension with
>> no
>> encoding so indexes can be used for filter?
>>
>> Thanks
>> Swapnil
>>
>>
>> On Fri, Jul 21, 2017 at 12:30 PM, Jacky Li <ja...@qq.com> wrote:
>>
>> > Hi Swapnil,
>> >
>> > Dictionary is beneficial for aggregation query (carbon will leverage
>> late
>> > decode optimization in sql optimizer), so you can use it for columns on
>> > which you frequently do group by. While it can improve query
>> performance,
>> > but it also requires more memory and CPU while loading. Normally, you
>> > should consider to use dictionary only on low cardinality columns.
>> >
>> > In current apache master branch (and all history release before 1.2),
>> > carbon data’s default encoding strategy favor query performance over
>> > loading performance. By default,  all string data type by default is
>> > encoded as dictionary. But it creates some problems sometimes, for
>> example,
>> > if there are high cardinality column in the table, loading may fail due
>> to
>> > not enough memory in JVM. To avoid this, we have added
>> DICTIONARY_EXCLUDE
>> > option so that user can disable this default behavior manually. So,
>> > DICTIONARY_EXCLUDE property is designed for String column only.
>> >
>> > And, if you have low cardinality integer column ( like some ID field),
>> you
>> > can choose to encode it as dictionary by specifying DICTIONARY_INCLUDE,
>> so
>> > group by on this integer column will be faster.
>> >
>> > All these are current behavior, and there was discussion to change this
>> > behavior and give more control to the user, in the coming release (1.2)
>> > The new proposed target behavior will be:
>> > 1. There will be a default encoding strategy for each data type. If user
>> > does not specify any encoding related property in CREATE TABLE, carbon
>> will
>> > use the default encoding strategy for each column.
>> > 2. And there will be a ENCODING property through which user can override
>> > the system default strategy. For example, user can create table by:
>> >
>> > CREATE TABLE t1 (city_name STRING, city_id INT, population INT, area
>> > DOUBLE)
>> > TBLPROPERTIES (‘ENCODING’ = ‘city_name: dictionary, city_id:
>> {dictionary,
>> > RLE}, population: delta’)
>> >
>> > This SQL means city_name is encoded using dictionary, city_id is encoded
>> > using dictionary then apply RLE encoding (for numeric value),
>> population is
>> > encoded using delta encoding, and area is encoded using system default
>> > encoding for double data type.
>> >
>> > This change is still going on (CARBONDATA-1014,
>> https://issues.apache.org/
>> > jira/browse/CARBONDATA-1014 <https://issues.apache.org/
>> > jira/browse/CARBONDATA-1014>), on apache/encoding_override branch. Once
>> > it is done and stable it will be merged into master.
>> >
>> > Please advise if you have any suggestions.
>> >
>> > Regards,
>> > Jacky
>> >
>> >
>> > > 在 2017年7月21日,上午12:12,Swapnil Shinde <sw...@gmail.com> 写道:
>> > >
>> > > Ok. Just curious - Any reason not to support numeric columns with
>> > > dictionary_exclude? Wouldn't it be useful for numeric unique column
>> which
>> > > should be dimension but avoid creating dictionary  (as it may not be
>> > > beneficial).
>> > >
>> > > Thanks
>> > > Swapnil
>> > >
>> > >
>> > > On Thu, Jul 20, 2017 at 4:20 AM, manishgupta88 <
>> > tomanishgupta18@gmail.com>
>> > > wrote:
>> > >
>> > >> No Dictionary_Exclude is supported only for String data type columns.
>> > >>
>> > >> Regards
>> > >> Manish Gupta
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> View this message in context: http://apache-carbondata-dev-
>> > >> mailing-list-archive.1130556.n5.nabble.com/carbon-data-
>> > performance-doubts-
>> > >> tp18438p18559.html
>> > >> Sent from the Apache CarbonData Dev Mailing List archive mailing list
>> > >> archive at Nabble.com.
>> > >>
>> >
>> >
>>
>
>

Re: carbon data performance doubts

Posted by Liang Chen <ch...@apache.org>.
Hi Swapnil

Actually,  current system's behavior is  : Index and dictionary encoding
are decoupled, no relationship.

1. If you want to make some columns have good filter , just add these
columns to sort_columns (like tblproperties('sort_columns'='empno')), to
build good MDX index for these columns,  just add INT column to
sort_columns list for filter.

2. If you want to make some columns have good aggregation for group by,
 just dictionary encodes these columns. By default INT column doesn't do
dictionary encode, so don't need to add "DICTIONARY_EXCLUDE",  if the INT
column is low cardinality and you also want to have good aggregation on the
INT column, use "DICTIONARY_INCLUDE = the INT column".

So , in a word :  INT column with high cardinality doesn't have
DICTIONARY_EXCLUDE scenario :)

HTH.

Regards
Liang


2017-07-22 6:09 GMT+08:00 Swapnil Shinde <sw...@gmail.com>:

> Thank you Jacky! Above encoding property makes sense. How would you handle
> an INT column with high cardinality? as per my understanding, this column
> will be considered as measure and only way to make it dimension is to
> specify "dictionary_include" for that column.
> Any reason why a column being a dimension or measure is tied with
> dictionary encoding? Does it make sense to have column as dimension with no
> encoding so indexes can be used for filter?
>
> Thanks
> Swapnil
>
>
> On Fri, Jul 21, 2017 at 12:30 PM, Jacky Li <ja...@qq.com> wrote:
>
> > Hi Swapnil,
> >
> > Dictionary is beneficial for aggregation query (carbon will leverage late
> > decode optimization in sql optimizer), so you can use it for columns on
> > which you frequently do group by. While it can improve query performance,
> > but it also requires more memory and CPU while loading. Normally, you
> > should consider to use dictionary only on low cardinality columns.
> >
> > In current apache master branch (and all history release before 1.2),
> > carbon data’s default encoding strategy favor query performance over
> > loading performance. By default,  all string data type by default is
> > encoded as dictionary. But it creates some problems sometimes, for
> example,
> > if there are high cardinality column in the table, loading may fail due
> to
> > not enough memory in JVM. To avoid this, we have added DICTIONARY_EXCLUDE
> > option so that user can disable this default behavior manually. So,
> > DICTIONARY_EXCLUDE property is designed for String column only.
> >
> > And, if you have low cardinality integer column ( like some ID field),
> you
> > can choose to encode it as dictionary by specifying DICTIONARY_INCLUDE,
> so
> > group by on this integer column will be faster.
> >
> > All these are current behavior, and there was discussion to change this
> > behavior and give more control to the user, in the coming release (1.2)
> > The new proposed target behavior will be:
> > 1. There will be a default encoding strategy for each data type. If user
> > does not specify any encoding related property in CREATE TABLE, carbon
> will
> > use the default encoding strategy for each column.
> > 2. And there will be a ENCODING property through which user can override
> > the system default strategy. For example, user can create table by:
> >
> > CREATE TABLE t1 (city_name STRING, city_id INT, population INT, area
> > DOUBLE)
> > TBLPROPERTIES (‘ENCODING’ = ‘city_name: dictionary, city_id: {dictionary,
> > RLE}, population: delta’)
> >
> > This SQL means city_name is encoded using dictionary, city_id is encoded
> > using dictionary then apply RLE encoding (for numeric value), population
> is
> > encoded using delta encoding, and area is encoded using system default
> > encoding for double data type.
> >
> > This change is still going on (CARBONDATA-1014,
> https://issues.apache.org/
> > jira/browse/CARBONDATA-1014 <https://issues.apache.org/
> > jira/browse/CARBONDATA-1014>), on apache/encoding_override branch. Once
> > it is done and stable it will be merged into master.
> >
> > Please advise if you have any suggestions.
> >
> > Regards,
> > Jacky
> >
> >
> > > 在 2017年7月21日,上午12:12,Swapnil Shinde <sw...@gmail.com> 写道:
> > >
> > > Ok. Just curious - Any reason not to support numeric columns with
> > > dictionary_exclude? Wouldn't it be useful for numeric unique column
> which
> > > should be dimension but avoid creating dictionary  (as it may not be
> > > beneficial).
> > >
> > > Thanks
> > > Swapnil
> > >
> > >
> > > On Thu, Jul 20, 2017 at 4:20 AM, manishgupta88 <
> > tomanishgupta18@gmail.com>
> > > wrote:
> > >
> > >> No Dictionary_Exclude is supported only for String data type columns.
> > >>
> > >> Regards
> > >> Manish Gupta
> > >>
> > >>
> > >>
> > >> --
> > >> View this message in context: http://apache-carbondata-dev-
> > >> mailing-list-archive.1130556.n5.nabble.com/carbon-data-
> > performance-doubts-
> > >> tp18438p18559.html
> > >> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> > >> archive at Nabble.com.
> > >>
> >
> >
>

Re: carbon data performance doubts

Posted by Swapnil Shinde <sw...@gmail.com>.
Thank you Jacky! Above encoding property makes sense. How would you handle
an INT column with high cardinality? as per my understanding, this column
will be considered as measure and only way to make it dimension is to
specify "dictionary_include" for that column.
Any reason why a column being a dimension or measure is tied with
dictionary encoding? Does it make sense to have column as dimension with no
encoding so indexes can be used for filter?

Thanks
Swapnil


On Fri, Jul 21, 2017 at 12:30 PM, Jacky Li <ja...@qq.com> wrote:

> Hi Swapnil,
>
> Dictionary is beneficial for aggregation query (carbon will leverage late
> decode optimization in sql optimizer), so you can use it for columns on
> which you frequently do group by. While it can improve query performance,
> but it also requires more memory and CPU while loading. Normally, you
> should consider to use dictionary only on low cardinality columns.
>
> In current apache master branch (and all history release before 1.2),
> carbon data’s default encoding strategy favor query performance over
> loading performance. By default,  all string data type by default is
> encoded as dictionary. But it creates some problems sometimes, for example,
> if there are high cardinality column in the table, loading may fail due to
> not enough memory in JVM. To avoid this, we have added DICTIONARY_EXCLUDE
> option so that user can disable this default behavior manually. So,
> DICTIONARY_EXCLUDE property is designed for String column only.
>
> And, if you have low cardinality integer column ( like some ID field), you
> can choose to encode it as dictionary by specifying DICTIONARY_INCLUDE, so
> group by on this integer column will be faster.
>
> All these are current behavior, and there was discussion to change this
> behavior and give more control to the user, in the coming release (1.2)
> The new proposed target behavior will be:
> 1. There will be a default encoding strategy for each data type. If user
> does not specify any encoding related property in CREATE TABLE, carbon will
> use the default encoding strategy for each column.
> 2. And there will be a ENCODING property through which user can override
> the system default strategy. For example, user can create table by:
>
> CREATE TABLE t1 (city_name STRING, city_id INT, population INT, area
> DOUBLE)
> TBLPROPERTIES (‘ENCODING’ = ‘city_name: dictionary, city_id: {dictionary,
> RLE}, population: delta’)
>
> This SQL means city_name is encoded using dictionary, city_id is encoded
> using dictionary then apply RLE encoding (for numeric value), population is
> encoded using delta encoding, and area is encoded using system default
> encoding for double data type.
>
> This change is still going on (CARBONDATA-1014, https://issues.apache.org/
> jira/browse/CARBONDATA-1014 <https://issues.apache.org/
> jira/browse/CARBONDATA-1014>), on apache/encoding_override branch. Once
> it is done and stable it will be merged into master.
>
> Please advise if you have any suggestions.
>
> Regards,
> Jacky
>
>
> > 在 2017年7月21日,上午12:12,Swapnil Shinde <sw...@gmail.com> 写道:
> >
> > Ok. Just curious - Any reason not to support numeric columns with
> > dictionary_exclude? Wouldn't it be useful for numeric unique column which
> > should be dimension but avoid creating dictionary  (as it may not be
> > beneficial).
> >
> > Thanks
> > Swapnil
> >
> >
> > On Thu, Jul 20, 2017 at 4:20 AM, manishgupta88 <
> tomanishgupta18@gmail.com>
> > wrote:
> >
> >> No Dictionary_Exclude is supported only for String data type columns.
> >>
> >> Regards
> >> Manish Gupta
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-carbondata-dev-
> >> mailing-list-archive.1130556.n5.nabble.com/carbon-data-
> performance-doubts-
> >> tp18438p18559.html
> >> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> >> archive at Nabble.com.
> >>
>
>

Re: carbon data performance doubts

Posted by Jacky Li <ja...@qq.com>.
Hi Swapnil,

Dictionary is beneficial for aggregation query (carbon will leverage late decode optimization in sql optimizer), so you can use it for columns on which you frequently do group by. While it can improve query performance, but it also requires more memory and CPU while loading. Normally, you should consider to use dictionary only on low cardinality columns.

In current apache master branch (and all history release before 1.2), carbon data’s default encoding strategy favor query performance over loading performance. By default,  all string data type by default is encoded as dictionary. But it creates some problems sometimes, for example, if there are high cardinality column in the table, loading may fail due to not enough memory in JVM. To avoid this, we have added DICTIONARY_EXCLUDE option so that user can disable this default behavior manually. So, DICTIONARY_EXCLUDE property is designed for String column only.

And, if you have low cardinality integer column ( like some ID field), you can choose to encode it as dictionary by specifying DICTIONARY_INCLUDE, so group by on this integer column will be faster.

All these are current behavior, and there was discussion to change this behavior and give more control to the user, in the coming release (1.2)
The new proposed target behavior will be:
1. There will be a default encoding strategy for each data type. If user does not specify any encoding related property in CREATE TABLE, carbon will use the default encoding strategy for each column.
2. And there will be a ENCODING property through which user can override the system default strategy. For example, user can create table by:

CREATE TABLE t1 (city_name STRING, city_id INT, population INT, area DOUBLE)
TBLPROPERTIES (‘ENCODING’ = ‘city_name: dictionary, city_id: {dictionary, RLE}, population: delta’)

This SQL means city_name is encoded using dictionary, city_id is encoded using dictionary then apply RLE encoding (for numeric value), population is encoded using delta encoding, and area is encoded using system default encoding for double data type.

This change is still going on (CARBONDATA-1014, https://issues.apache.org/jira/browse/CARBONDATA-1014 <https://issues.apache.org/jira/browse/CARBONDATA-1014>), on apache/encoding_override branch. Once it is done and stable it will be merged into master. 

Please advise if you have any suggestions.

Regards,
Jacky


> 在 2017年7月21日,上午12:12,Swapnil Shinde <sw...@gmail.com> 写道:
> 
> Ok. Just curious - Any reason not to support numeric columns with
> dictionary_exclude? Wouldn't it be useful for numeric unique column which
> should be dimension but avoid creating dictionary  (as it may not be
> beneficial).
> 
> Thanks
> Swapnil
> 
> 
> On Thu, Jul 20, 2017 at 4:20 AM, manishgupta88 <to...@gmail.com>
> wrote:
> 
>> No Dictionary_Exclude is supported only for String data type columns.
>> 
>> Regards
>> Manish Gupta
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-carbondata-dev-
>> mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-
>> tp18438p18559.html
>> Sent from the Apache CarbonData Dev Mailing List archive mailing list
>> archive at Nabble.com.
>> 


Re: carbon data performance doubts

Posted by Swapnil Shinde <sw...@gmail.com>.
Ok. Just curious - Any reason not to support numeric columns with
dictionary_exclude? Wouldn't it be useful for numeric unique column which
should be dimension but avoid creating dictionary  (as it may not be
beneficial).

Thanks
Swapnil


On Thu, Jul 20, 2017 at 4:20 AM, manishgupta88 <to...@gmail.com>
wrote:

> No Dictionary_Exclude is supported only for String data type columns.
>
> Regards
> Manish Gupta
>
>
>
> --
> View this message in context: http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-
> tp18438p18559.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>

Re: carbon data performance doubts

Posted by manishgupta88 <to...@gmail.com>.
No Dictionary_Exclude is supported only for String data type columns.

Regards
Manish Gupta



--
View this message in context: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-tp18438p18559.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive at Nabble.com.

Re: carbon data performance doubts

Posted by Swapnil Shinde <sw...@gmail.com>.
Thank you, Manish.
Is dictionary exclude supported for datatypes other than String?
https://github.com/apache/carbondata/blob/6488bc018a2ec715b31407d12290680d388a43b3/integration/spark-common/src/main/scala/org/apache/spark/sql/catalyst/CarbonDDLSqlParser.scala#L706

-
Swapnil

On Wed, Jul 19, 2017 at 10:44 PM, manishgupta88 <to...@gmail.com>
wrote:

> Hi Swapnil
>
> Please find my answers inline.
>
> 1. What is the use of *carbon.number.of.cores *property and how is it
> different from spark's executor cores?
>
> -carbon.number.of.cores is used for reading the footer and header of the
> carbondata file during query execution. Spark executor cores is a property
> of spark and controlled by spark for parallelizing the tasks. After task
> distribution each task will further open the number of threads in parallel
> specified as carbon.number.of.cores to read carbondata file footer and
> header and it is managed by carbon code.
>
> 2. Documentation says, by default, all non-numeric columns (except complex
> types) become dimensions and numeric columns become measure. How dimensions
> and measure columns are handled diferently? What are the pros and cons of
> keeping any column as dimension vs measure?
>
> - Dimensions will by default taking part in sorting the complete data from
> left to right as well as because its a columnar storage each dimension will
> further be sorted. On the other hand measure neither take part in sorting
> the data nor they are individually sorted.
> - Because dimensions are sorted it helps to get faster results for filter
> queries by performing binary search.
>
> 3. What is the best way when we have a ID INT column which is will be used
> heavily for filteration/agg/joins but can't be dimension by default.
> Documentation says to include these kind of numeric columns with
> "dictionay_include" or "dictionary_exclude" in table definition so that
> column will be considered as dimenstion. It is not supported to keep
> non-string data types as "dictionary_exclude" (link
> <https://github.com/apache/carbondata/blob/6488bc018a2ec715b31407d1229068
> 0d388a43b3/integration/spark-common/src/main/scala/org/
> apache/spark/sql/catalyst/CarbonDDLSqlParser.scala#L690>)
> Then do we have to enable dictionary encoding for ID INT columns which is
> beneficial to encode.
>
> -- In the current system best way is to include the IT column as dictionary
> include if the cardinality of column is less or dictionary exclude if
> cardinality of column is high. Measure filter optimization has already been
> implemented in branch 1.1
> (https://github.com/apache/carbondata/commits/branch-1.1) and will be
> available in the coming releases (1.2 or 1.3).
> For your reference you can go through the PR-1124
> (https://github.com/apache/carbondata/pull/1124)
>
> 4. How MDK gets generated and how can we alter it? Any API to find out MDK
> for given table?
>
> -- Only dictionary Include columns take part in generation of MDKey. MDkey
> is generated based on the cardinality of the column. It is one of the data
> compression techniques to reduce the storage space in carbondata storage.
> Computation example:
> Number of bytes for each integer value - 4
> Total number of rows - 100000
> Total umber of bytes - 100000*4
> Cardinality of column(total number of unique values of a column) - 5
> As cardinality is only 5 and we store only the unique values for a
> dictionary column, 5 unique values require total 3 bits for storage. But we
> take minimum storage unit as byte so we can consider here 1 byte for
> storing
> 5 unique values. So we have reduced space from 4 byte to 1 byte for each
> primitive integer value. This is the concept of MDKey.
>
> - You cannot alter an MDKey after table creation. MDKey will be created in
> the order you have specified the dictionary columns during table creation.
>
> - For MDKey generation  logic you can check the class
> MultiDimKeyVarLengthGenerator
>
> Regards
> Manish Gupta
>
>
>
> --
> View this message in context: http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-
> tp18438p18523.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>

Re: carbon data performance doubts

Posted by manishgupta88 <to...@gmail.com>.
Hi Swapnil

Please find my answers inline.

1. What is the use of *carbon.number.of.cores *property and how is it
different from spark's executor cores?

-carbon.number.of.cores is used for reading the footer and header of the
carbondata file during query execution. Spark executor cores is a property
of spark and controlled by spark for parallelizing the tasks. After task
distribution each task will further open the number of threads in parallel
specified as carbon.number.of.cores to read carbondata file footer and
header and it is managed by carbon code.

2. Documentation says, by default, all non-numeric columns (except complex
types) become dimensions and numeric columns become measure. How dimensions
and measure columns are handled diferently? What are the pros and cons of
keeping any column as dimension vs measure?

- Dimensions will by default taking part in sorting the complete data from
left to right as well as because its a columnar storage each dimension will
further be sorted. On the other hand measure neither take part in sorting
the data nor they are individually sorted.
- Because dimensions are sorted it helps to get faster results for filter
queries by performing binary search.

3. What is the best way when we have a ID INT column which is will be used
heavily for filteration/agg/joins but can't be dimension by default.
Documentation says to include these kind of numeric columns with
"dictionay_include" or "dictionary_exclude" in table definition so that
column will be considered as dimenstion. It is not supported to keep
non-string data types as "dictionary_exclude" (link
<https://github.com/apache/carbondata/blob/6488bc018a2ec715b31407d12290680d388a43b3/integration/spark-common/src/main/scala/org/apache/spark/sql/catalyst/CarbonDDLSqlParser.scala#L690>)
Then do we have to enable dictionary encoding for ID INT columns which is
beneficial to encode.

-- In the current system best way is to include the IT column as dictionary
include if the cardinality of column is less or dictionary exclude if
cardinality of column is high. Measure filter optimization has already been
implemented in branch 1.1
(https://github.com/apache/carbondata/commits/branch-1.1) and will be
available in the coming releases (1.2 or 1.3).
For your reference you can go through the PR-1124
(https://github.com/apache/carbondata/pull/1124)

4. How MDK gets generated and how can we alter it? Any API to find out MDK
for given table?  

-- Only dictionary Include columns take part in generation of MDKey. MDkey
is generated based on the cardinality of the column. It is one of the data
compression techniques to reduce the storage space in carbondata storage. 
Computation example:
Number of bytes for each integer value - 4
Total number of rows - 100000
Total umber of bytes - 100000*4
Cardinality of column(total number of unique values of a column) - 5
As cardinality is only 5 and we store only the unique values for a
dictionary column, 5 unique values require total 3 bits for storage. But we
take minimum storage unit as byte so we can consider here 1 byte for storing
5 unique values. So we have reduced space from 4 byte to 1 byte for each
primitive integer value. This is the concept of MDKey.

- You cannot alter an MDKey after table creation. MDKey will be created in
the order you have specified the dictionary columns during table creation.

- For MDKey generation  logic you can check the class
MultiDimKeyVarLengthGenerator

Regards
Manish Gupta



--
View this message in context: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-tp18438p18523.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive at Nabble.com.