You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by Kumar Vishal <ku...@gmail.com> on 2018/06/04 15:10:26 UTC

[Discussion] Carbon Local Dictionary Support

 Hi Community,Currently CarbonData supports global dictionary or
No-Dictionary (Plain-Text stored in LV format) for storing dimension column
data.

*Bottleneck with Global Dictionary*

   1.

   As dictionary file is mutable file, so it is not possible to support
   global dictionary in storage environment which does not support append.
   2.

   It’s difficult for user to determine whether the column should be
   dictionary or not if number of columns in table is high.
   3.

   Global dictionary generation generally slows down the load process

*Bottleneck with No-Dictionary*

   1.

   Storage size is high
   2.

   Query on No-Dictionary column is slower as data read/processed is more
   3.

   Filtering is slower on No-Dictionary columns as number of comparison is
   high
   4.

   Memory footprint is high

The above bottlenecks can be solved by *Generating Local dictionary for low
cardinality columns at each blocklet level, *which will help to achieve
below benefits:

   1.

   This will help in supporting dictionary generation on different storage
   environment irrespective of its supported operations(append) on the files.
   2.

   Reduces the extra IO operations read/write on the dictionary files
   generated in case of global dictionary.
   3.

   It will eliminate the problem for user to identify the dictionary
   columns when the number of columns are more in a table.
   4.

   It helps in getting more compression on dimension columns with less
   cardinality.
   5.

   Filter query on No-dictionary columns with local dictionary will be
   faster as filter will be done on encoded data.
   6.

   It will help in reducing the store size and memory footprint as only
   unique values will be stored as part of local dictionary and
   corresponding data will be stored as encoded data.

Please provide your comment. Any suggestion from community is most
welcomed. Please let me know for any clarification.

-Regards
Kumar Vishal

Re: [Discussion] Carbon Local Dictionary Support

Posted by xm_zzc <44...@qq.com>.
Hi:
  +1.
  This is an exciting feature, hope to have it in version 1.5.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Carbon Local Dictionary Support

Posted by akashrn5 <ak...@gmail.com>.
1.	If user is giving any invalid value, default threshold(1000 unique values)
value will be considered.  What is the consideration behind the default
value 1000.
*1000 is a random value we have mentioned in design doc. 
CARBON_LOCALDICT_THRESHOLD is exposed to user for setting threshold value
based on their usecase*

2.	There is no option mentioned for the user to alter the table if the
ENABLE_LOCAL_DICT and CARBON_LOCALDICT_THRESHOLD values are set. This would
also help in compatibility if we want to generate local dictionary for table
created in previous versions.
*In new load for old table local dictionary will be generated as by default
local dictionary generation is enabled. Alter command for setting
CARBON_LOCALDICT_THRESHOLD and ENABLE_LOCAL_DICT property will be provided
for older tables and This will be updated in desing doc. Thank you for
pointing this out*

3.There should be validation provided if the user inputs ENABLE_LOCAL_DICT
as false and tries to set CARBON_LOCALDICT_THRESHOLD value.
*will not consider Threshold value if ENABLE_LOCAL_DICT is false*

4.Impact of alter table add/drop/change type of column is not mentioned .
*There is no impact that’s why not captured in design doc's Impact analysis
section*

5.Would complex types also be considered for local dictionary. 
* it will be handled for complex primitive no dictionary String data type
columns*

6.For any column if dictionary values crosses the threshold
(carbon_localdict_threshold), then it will drop dictionary for that column.
 Could not understand “drop dictionary for that column”
* Local dictionary will not be considered for respective column*

7.For better testability information regarding generation and updation of
local dictionary can be logged.
*Log will be added for each level of local dictionary generation.*




--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Carbon Local Dictionary Support

Posted by xuchuanyin <xu...@hust.edu.cn>.
Hi, kumarvishal:
  As the local dictionary feature will be released in 1.4.1, Is there any
difference between the implementation and the previous design document?
  I'm trying to understand the implementation of local dictionary. If there
is any difference, please help to update the document in JIRA.
 Thanks.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Carbon Local Dictionary Support

Posted by akashrn5 <ak...@gmail.com>.
1.	If user is giving any invalid value, default threshold(1000 unique values)
value will be considered.  What is the consideration behind the default
value 1000.
*1000 is a random value we have mentioned in design doc. 
CARBON_LOCALDICT_THRESHOLD is exposed to user for setting threshold value
based on their usecase*

2.	There is no option mentioned for the user to alter the table if the
ENABLE_LOCAL_DICT and CARBON_LOCALDICT_THRESHOLD values are set. This would
also help in compatibility if we want to generate local dictionary for table
created in previous versions.
*In new load for old table local dictionary will be generated as by default
local dictionary generation is enabled. Alter command for setting
CARBON_LOCALDICT_THRESHOLD and ENABLE_LOCAL_DICT property will be provided
for older tables and This will be updated in desing doc. Thank you for
pointing this out*

3.There should be validation provided if the user inputs ENABLE_LOCAL_DICT
as false and tries to set CARBON_LOCALDICT_THRESHOLD value.
*will not consider Threshold value if ENABLE_LOCAL_DICT is false*

4.Impact of alter table add/drop/change type of column is not mentioned .
*There is no impact that’s why not captured in design doc's Impact analysis
section*

5.Would complex types also be considered for local dictionary. 
* it will be handled for complex primitive no dictionary String data type
columns*

6.For any column if dictionary values crosses the threshold
(carbon_localdict_threshold), then it will drop dictionary for that column.
 Could not understand “drop dictionary for that column”
* Local dictionary will not be considered for respective column*

7.For better testability information regarding generation and updation of
local dictionary can be logged.
*Log will be added for each level of local dictionary generation.*




--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Carbon Local Dictionary Support

Posted by chetdb <ch...@gmail.com>.
Dear Vishal,

Please find the queries/comments on the design doc.

1.	If user is giving any invalid value, default threshold(1000 unique
values) value will be considered.  What is the consideration behind the
default value 1000.
2.	There is no option mentioned for the user to alter the table if the
ENABLE_LOCAL_DICT and CARBON_LOCALDICT_THRESHOLD values are set. This would
also help in compatibility if we want to generate local dictionary for table
created in previous carbon versions.
3.	There should be validation provided if the user inputs ENABLE_LOCAL_DICT
as false and tries to set CARBON_LOCALDICT_THRESHOLD value.
4.	Impact of alter table add/drop/change type of column is not mentioned .
5.	would complex types also be considered for local dictionary.
6.	For any column if dictionary values crosses the threshold
(carbon_localdict_threshold), then it will drop dictionary for that column.
 Could not understand “drop dictionary for that column”
7.	For better testability information regarding generation and updation of
local dictionary can be logged.

Regards

Chetan 




--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Carbon Local Dictionary Support

Posted by xm_zzc <44...@qq.com>.
Hi kumarvishal09:
  Will this feature support on stream table too? 



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Carbon Local Dictionary Support

Posted by akashrn5 <ak...@gmail.com>.
Hi bhavya,

Local dictionary generation is task level. if in ongoing load, if the
threshold is breached, then for that load the local dictionary will not be
generated for that corresponding column and there is no dependency with the
previous loads. For each load new local dictionary will be generated.

Regards,
Akash r Nilugal



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Carbon Local Dictionary Support

Posted by Bhavya Aggarwal <bh...@knoldus.com>.
Hi Vishal,

Thanks for sharing the design and I have one question related to deciding
on whether to generate the dictionary or not. If in first few loads we have
the cardinality below the threshold then we will create a local dictionary,
but if in subsequent loads the threshold value is breached than what will
happen to the data of previous loads?

Regards
Bhavya

On Thu, Jun 7, 2018 at 5:28 PM, xuchuanyin <xu...@hust.edu.cn> wrote:

> About query filtering
>
> 1. “during filter, actual filter values will be generated using column
> local
> dictionary values...then filter will be applied on the dictionary encode
> data”
> ---
> If the filter is not 'equal' but 'like','greater than', can it also run on
> encode data.
>
> 2. "As dictionary data will be always of 4 bytes "
> ---
> Why they are 4 bytes?
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
*Bhavya Aggarwal*
Sr. Director
Knoldus Inc. <http://www.knoldus.com/>
+91-9910483067
Canada - USA - India - Singapore
<https://in.linkedin.com/company/knoldus> <https://twitter.com/Knolspeak>
<https://www.facebook.com/KnoldusSoftware/> <https://blog.knoldus.com/>

Re: [Discussion] Carbon Local Dictionary Support

Posted by akashrn5 <ak...@gmail.com>.
Hi xuchuanyin,

Please find my comments inline

About query filtering 

1. “during filter, actual filter values will be generated using column local 
dictionary values...then filter will be applied on the dictionary encode 
data” 
--- 
If the filter is not 'equal' but 'like','greater than', can it also run on 
encode data. 

*For range type of filters , it will be same as the way global dictionary
column is handled.*

2. "As dictionary data will be always of 4 bytes " 
--- 
Why they are 4 bytes? 

*Dictionary value/data is nothing but integer value assigned to the
dictionary key. So it will of 4 bytes.*





--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Carbon Local Dictionary Support

Posted by xuchuanyin <xu...@hust.edu.cn>.
About query filtering

1. “during filter, actual filter values will be generated using column local
dictionary values...then filter will be applied on the dictionary encode
data”
---
If the filter is not 'equal' but 'like','greater than', can it also run on
encode data.

2. "As dictionary data will be always of 4 bytes "
---
Why they are 4 bytes?



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Carbon Local Dictionary Support

Posted by manish gupta <to...@gmail.com>.
Hi Vishal,

Thanks for uploading the design document. The document is good and gives a
detailed picture of the requirement.

I have few questions and suggestions. Kindly consider if applicable.

1. Will the local dictionary be read once and put into offheap/onheap
memory or for every query it will be read?

2. Will the columnCardinality integer array now contain the actual
cardinality for no dictionary column in the block footer or in any other
metadata?
If not then we can store as it can be one of the statistics which can help
in deciding pushdown for like queries on no dictionary column.

3. Apart from default threshold we can also define the max threshold for
the local dictionary (lets say 1 lac). If user configures a value greater
than max allowed threshold then we can consider max and continue.

Regards
Manish Gupta

On Wed, Jun 6, 2018 at 6:54 PM, Kumar Vishal <ku...@gmail.com>
wrote:

> Hi Xuchuanyin,
>
> Please find the JIRA link for local dictionary support.
>
> https://issues.apache.org/jira/browse/CARBONDATA-2584
>
> -Regards
> Kumar Vishal
>
> On Wed, Jun 6, 2018 at 6:25 PM, xuchuanyin <xu...@hust.edu.cn> wrote:
>
> > Hi, Kumar:
> >   Can you raise a Jira and provide the document as attachment? I cannot
> > open the links since it is blocked.
>

Re: [Discussion] Carbon Local Dictionary Support

Posted by Kumar Vishal <ku...@gmail.com>.
Hi Xuchuanyin,

Please find the JIRA link for local dictionary support.

https://issues.apache.org/jira/browse/CARBONDATA-2584

-Regards
Kumar Vishal

On Wed, Jun 6, 2018 at 6:25 PM, xuchuanyin <xu...@hust.edu.cn> wrote:

> Hi, Kumar:
>   Can you raise a Jira and provide the document as attachment? I cannot
> open the links since it is blocked.

Re: [Discussion] Carbon Local Dictionary Support

Posted by xuchuanyin <xu...@hust.edu.cn>.
Hi, Kumar:
  Can you raise a Jira and provide the document as attachment? I cannot open the links since it is blocked.

Re: [Discussion] Carbon Local Dictionary Support

Posted by Kumar Vishal <ku...@gmail.com>.
Hi All,

Please ignore above link.

Please comment here:
https://docs.google.com/document/d/1y0dJSWOr0ZTPpbNOOUfVfU5SoANL5B1F0l7jhl8BgUs/edit?usp=sharing

-Regards
Kumar Vishal

On Wed, Jun 6, 2018 at 3:06 PM, Kumar Vishal <ku...@gmail.com>
wrote:

> Hi All,
>
> Due to some problem above link is not working. Please find the updated
> link.
>
> https://drive.google.com/file/d/10LqtQlrE4jeotmleoMLJ8F91rK2Tr
> N2h/view?usp=sharing
>
> -Regards
> Kumar Vishal
>
> On Wed, Jun 6, 2018 at 2:40 PM, Kumar Vishal <ku...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> Please find the link for design doc.
>>
>> https://drive.google.com/file/d/1eqfIms2tMi3b63nMbKfGRZYmo7T
>> MyE1_/view?usp=sharing
>>
>> -Regards
>> Kumar Vishal
>>
>> On Wed, Jun 6, 2018 at 2:25 PM, Kumar Vishal <ku...@gmail.com>
>> wrote:
>>
>>> Hi Community,
>>>
>>> Please find the Attached Local dictionary support design document.
>>> Please let me know for any further clarification on design document.
>>> Any further inputs/improvements are most welcomed.
>>>
>>>
>>>
>>> -Regards
>>> Kumar Vishal
>>>
>>> On Tue, Jun 5, 2018 at 6:14 PM, Jacky Li <ja...@qq.com> wrote:
>>>
>>>> +1
>>>> Good feature to add in CarbonData
>>>>
>>>> Regards,
>>>> Jacky
>>>>
>>>>
>>>> > 在 2018年6月4日,下午11:10,Kumar Vishal <ku...@gmail.com> 写道:
>>>> >
>>>> > Hi Community,Currently CarbonData supports global dictionary or
>>>> > No-Dictionary (Plain-Text stored in LV format) for storing dimension
>>>> column
>>>> > data.
>>>> >
>>>> > *Bottleneck with Global Dictionary*
>>>> >
>>>> >   1.
>>>> >
>>>> >   As dictionary file is mutable file, so it is not possible to support
>>>> >   global dictionary in storage environment which does not support
>>>> append.
>>>> >   2.
>>>> >
>>>> >   It’s difficult for user to determine whether the column should be
>>>> >   dictionary or not if number of columns in table is high.
>>>> >   3.
>>>> >
>>>> >   Global dictionary generation generally slows down the load process
>>>> >
>>>> > *Bottleneck with No-Dictionary*
>>>> >
>>>> >   1.
>>>> >
>>>> >   Storage size is high
>>>> >   2.
>>>> >
>>>> >   Query on No-Dictionary column is slower as data read/processed is
>>>> more
>>>> >   3.
>>>> >
>>>> >   Filtering is slower on No-Dictionary columns as number of
>>>> comparison is
>>>> >   high
>>>> >   4.
>>>> >
>>>> >   Memory footprint is high
>>>> >
>>>> > The above bottlenecks can be solved by *Generating Local dictionary
>>>> for low
>>>> > cardinality columns at each blocklet level, *which will help to
>>>> achieve
>>>> > below benefits:
>>>> >
>>>> >   1.
>>>> >
>>>> >   This will help in supporting dictionary generation on different
>>>> storage
>>>> >   environment irrespective of its supported operations(append) on the
>>>> files.
>>>> >   2.
>>>> >
>>>> >   Reduces the extra IO operations read/write on the dictionary files
>>>> >   generated in case of global dictionary.
>>>> >   3.
>>>> >
>>>> >   It will eliminate the problem for user to identify the dictionary
>>>> >   columns when the number of columns are more in a table.
>>>> >   4.
>>>> >
>>>> >   It helps in getting more compression on dimension columns with less
>>>> >   cardinality.
>>>> >   5.
>>>> >
>>>> >   Filter query on No-dictionary columns with local dictionary will be
>>>> >   faster as filter will be done on encoded data.
>>>> >   6.
>>>> >
>>>> >   It will help in reducing the store size and memory footprint as only
>>>> >   unique values will be stored as part of local dictionary and
>>>> >   corresponding data will be stored as encoded data.
>>>> >
>>>> > Please provide your comment. Any suggestion from community is most
>>>> > welcomed. Please let me know for any clarification.
>>>> >
>>>> > -Regards
>>>> > Kumar Vishal
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: [Discussion] Carbon Local Dictionary Support

Posted by Kumar Vishal <ku...@gmail.com>.
Hi All,

Due to some problem above link is not working. Please find the updated link.

https://drive.google.com/file/d/10LqtQlrE4jeotmleoMLJ8F91rK2TrN2h/view?usp=sharing

-Regards
Kumar Vishal

On Wed, Jun 6, 2018 at 2:40 PM, Kumar Vishal <ku...@gmail.com>
wrote:

> Hi All,
>
> Please find the link for design doc.
>
> https://drive.google.com/file/d/1eqfIms2tMi3b63nMbKfGRZYmo7T
> MyE1_/view?usp=sharing
>
> -Regards
> Kumar Vishal
>
> On Wed, Jun 6, 2018 at 2:25 PM, Kumar Vishal <ku...@gmail.com>
> wrote:
>
>> Hi Community,
>>
>> Please find the Attached Local dictionary support design document. Please
>> let me know for any further clarification on design document.
>> Any further inputs/improvements are most welcomed.
>>
>>
>>
>> -Regards
>> Kumar Vishal
>>
>> On Tue, Jun 5, 2018 at 6:14 PM, Jacky Li <ja...@qq.com> wrote:
>>
>>> +1
>>> Good feature to add in CarbonData
>>>
>>> Regards,
>>> Jacky
>>>
>>>
>>> > 在 2018年6月4日,下午11:10,Kumar Vishal <ku...@gmail.com> 写道:
>>> >
>>> > Hi Community,Currently CarbonData supports global dictionary or
>>> > No-Dictionary (Plain-Text stored in LV format) for storing dimension
>>> column
>>> > data.
>>> >
>>> > *Bottleneck with Global Dictionary*
>>> >
>>> >   1.
>>> >
>>> >   As dictionary file is mutable file, so it is not possible to support
>>> >   global dictionary in storage environment which does not support
>>> append.
>>> >   2.
>>> >
>>> >   It’s difficult for user to determine whether the column should be
>>> >   dictionary or not if number of columns in table is high.
>>> >   3.
>>> >
>>> >   Global dictionary generation generally slows down the load process
>>> >
>>> > *Bottleneck with No-Dictionary*
>>> >
>>> >   1.
>>> >
>>> >   Storage size is high
>>> >   2.
>>> >
>>> >   Query on No-Dictionary column is slower as data read/processed is
>>> more
>>> >   3.
>>> >
>>> >   Filtering is slower on No-Dictionary columns as number of comparison
>>> is
>>> >   high
>>> >   4.
>>> >
>>> >   Memory footprint is high
>>> >
>>> > The above bottlenecks can be solved by *Generating Local dictionary
>>> for low
>>> > cardinality columns at each blocklet level, *which will help to achieve
>>> > below benefits:
>>> >
>>> >   1.
>>> >
>>> >   This will help in supporting dictionary generation on different
>>> storage
>>> >   environment irrespective of its supported operations(append) on the
>>> files.
>>> >   2.
>>> >
>>> >   Reduces the extra IO operations read/write on the dictionary files
>>> >   generated in case of global dictionary.
>>> >   3.
>>> >
>>> >   It will eliminate the problem for user to identify the dictionary
>>> >   columns when the number of columns are more in a table.
>>> >   4.
>>> >
>>> >   It helps in getting more compression on dimension columns with less
>>> >   cardinality.
>>> >   5.
>>> >
>>> >   Filter query on No-dictionary columns with local dictionary will be
>>> >   faster as filter will be done on encoded data.
>>> >   6.
>>> >
>>> >   It will help in reducing the store size and memory footprint as only
>>> >   unique values will be stored as part of local dictionary and
>>> >   corresponding data will be stored as encoded data.
>>> >
>>> > Please provide your comment. Any suggestion from community is most
>>> > welcomed. Please let me know for any clarification.
>>> >
>>> > -Regards
>>> > Kumar Vishal
>>>
>>>
>>>
>>>
>>
>

Re: [Discussion] Carbon Local Dictionary Support

Posted by Kumar Vishal <ku...@gmail.com>.
Hi All,

Please find the link for design doc.

https://drive.google.com/file/d/1eqfIms2tMi3b63nMbKfGRZYmo7TMy
E1_/view?usp=sharing

-Regards
Kumar Vishal

On Wed, Jun 6, 2018 at 2:25 PM, Kumar Vishal <ku...@gmail.com>
wrote:

> Hi Community,
>
> Please find the Attached Local dictionary support design document. Please
> let me know for any further clarification on design document.
> Any further inputs/improvements are most welcomed.
>
>
>
> -Regards
> Kumar Vishal
>
> On Tue, Jun 5, 2018 at 6:14 PM, Jacky Li <ja...@qq.com> wrote:
>
>> +1
>> Good feature to add in CarbonData
>>
>> Regards,
>> Jacky
>>
>>
>> > 在 2018年6月4日,下午11:10,Kumar Vishal <ku...@gmail.com> 写道:
>> >
>> > Hi Community,Currently CarbonData supports global dictionary or
>> > No-Dictionary (Plain-Text stored in LV format) for storing dimension
>> column
>> > data.
>> >
>> > *Bottleneck with Global Dictionary*
>> >
>> >   1.
>> >
>> >   As dictionary file is mutable file, so it is not possible to support
>> >   global dictionary in storage environment which does not support
>> append.
>> >   2.
>> >
>> >   It’s difficult for user to determine whether the column should be
>> >   dictionary or not if number of columns in table is high.
>> >   3.
>> >
>> >   Global dictionary generation generally slows down the load process
>> >
>> > *Bottleneck with No-Dictionary*
>> >
>> >   1.
>> >
>> >   Storage size is high
>> >   2.
>> >
>> >   Query on No-Dictionary column is slower as data read/processed is more
>> >   3.
>> >
>> >   Filtering is slower on No-Dictionary columns as number of comparison
>> is
>> >   high
>> >   4.
>> >
>> >   Memory footprint is high
>> >
>> > The above bottlenecks can be solved by *Generating Local dictionary for
>> low
>> > cardinality columns at each blocklet level, *which will help to achieve
>> > below benefits:
>> >
>> >   1.
>> >
>> >   This will help in supporting dictionary generation on different
>> storage
>> >   environment irrespective of its supported operations(append) on the
>> files.
>> >   2.
>> >
>> >   Reduces the extra IO operations read/write on the dictionary files
>> >   generated in case of global dictionary.
>> >   3.
>> >
>> >   It will eliminate the problem for user to identify the dictionary
>> >   columns when the number of columns are more in a table.
>> >   4.
>> >
>> >   It helps in getting more compression on dimension columns with less
>> >   cardinality.
>> >   5.
>> >
>> >   Filter query on No-dictionary columns with local dictionary will be
>> >   faster as filter will be done on encoded data.
>> >   6.
>> >
>> >   It will help in reducing the store size and memory footprint as only
>> >   unique values will be stored as part of local dictionary and
>> >   corresponding data will be stored as encoded data.
>> >
>> > Please provide your comment. Any suggestion from community is most
>> > welcomed. Please let me know for any clarification.
>> >
>> > -Regards
>> > Kumar Vishal
>>
>>
>>
>>
>

Re: [Discussion] Carbon Local Dictionary Support

Posted by Kumar Vishal <ku...@gmail.com>.
Hi Community,

Please find the Attached Local dictionary support design document. Please
let me know for any further clarification on design document.
Any further inputs/improvements are most welcomed.



-Regards
Kumar Vishal

On Tue, Jun 5, 2018 at 6:14 PM, Jacky Li <ja...@qq.com> wrote:

> +1
> Good feature to add in CarbonData
>
> Regards,
> Jacky
>
>
> > 在 2018年6月4日,下午11:10,Kumar Vishal <ku...@gmail.com> 写道:
> >
> > Hi Community,Currently CarbonData supports global dictionary or
> > No-Dictionary (Plain-Text stored in LV format) for storing dimension
> column
> > data.
> >
> > *Bottleneck with Global Dictionary*
> >
> >   1.
> >
> >   As dictionary file is mutable file, so it is not possible to support
> >   global dictionary in storage environment which does not support append.
> >   2.
> >
> >   It’s difficult for user to determine whether the column should be
> >   dictionary or not if number of columns in table is high.
> >   3.
> >
> >   Global dictionary generation generally slows down the load process
> >
> > *Bottleneck with No-Dictionary*
> >
> >   1.
> >
> >   Storage size is high
> >   2.
> >
> >   Query on No-Dictionary column is slower as data read/processed is more
> >   3.
> >
> >   Filtering is slower on No-Dictionary columns as number of comparison is
> >   high
> >   4.
> >
> >   Memory footprint is high
> >
> > The above bottlenecks can be solved by *Generating Local dictionary for
> low
> > cardinality columns at each blocklet level, *which will help to achieve
> > below benefits:
> >
> >   1.
> >
> >   This will help in supporting dictionary generation on different storage
> >   environment irrespective of its supported operations(append) on the
> files.
> >   2.
> >
> >   Reduces the extra IO operations read/write on the dictionary files
> >   generated in case of global dictionary.
> >   3.
> >
> >   It will eliminate the problem for user to identify the dictionary
> >   columns when the number of columns are more in a table.
> >   4.
> >
> >   It helps in getting more compression on dimension columns with less
> >   cardinality.
> >   5.
> >
> >   Filter query on No-dictionary columns with local dictionary will be
> >   faster as filter will be done on encoded data.
> >   6.
> >
> >   It will help in reducing the store size and memory footprint as only
> >   unique values will be stored as part of local dictionary and
> >   corresponding data will be stored as encoded data.
> >
> > Please provide your comment. Any suggestion from community is most
> > welcomed. Please let me know for any clarification.
> >
> > -Regards
> > Kumar Vishal
>
>
>
>

Re: [Discussion] Carbon Local Dictionary Support

Posted by Jacky Li <ja...@qq.com>.
+1 
Good feature to add in CarbonData

Regards,
Jacky


> 在 2018年6月4日,下午11:10,Kumar Vishal <ku...@gmail.com> 写道:
> 
> Hi Community,Currently CarbonData supports global dictionary or
> No-Dictionary (Plain-Text stored in LV format) for storing dimension column
> data.
> 
> *Bottleneck with Global Dictionary*
> 
>   1.
> 
>   As dictionary file is mutable file, so it is not possible to support
>   global dictionary in storage environment which does not support append.
>   2.
> 
>   It’s difficult for user to determine whether the column should be
>   dictionary or not if number of columns in table is high.
>   3.
> 
>   Global dictionary generation generally slows down the load process
> 
> *Bottleneck with No-Dictionary*
> 
>   1.
> 
>   Storage size is high
>   2.
> 
>   Query on No-Dictionary column is slower as data read/processed is more
>   3.
> 
>   Filtering is slower on No-Dictionary columns as number of comparison is
>   high
>   4.
> 
>   Memory footprint is high
> 
> The above bottlenecks can be solved by *Generating Local dictionary for low
> cardinality columns at each blocklet level, *which will help to achieve
> below benefits:
> 
>   1.
> 
>   This will help in supporting dictionary generation on different storage
>   environment irrespective of its supported operations(append) on the files.
>   2.
> 
>   Reduces the extra IO operations read/write on the dictionary files
>   generated in case of global dictionary.
>   3.
> 
>   It will eliminate the problem for user to identify the dictionary
>   columns when the number of columns are more in a table.
>   4.
> 
>   It helps in getting more compression on dimension columns with less
>   cardinality.
>   5.
> 
>   Filter query on No-dictionary columns with local dictionary will be
>   faster as filter will be done on encoded data.
>   6.
> 
>   It will help in reducing the store size and memory footprint as only
>   unique values will be stored as part of local dictionary and
>   corresponding data will be stored as encoded data.
> 
> Please provide your comment. Any suggestion from community is most
> welcomed. Please let me know for any clarification.
> 
> -Regards
> Kumar Vishal




Re: [Discussion] Carbon Local Dictionary Support

Posted by Ravindra Pesala <ra...@gmail.com>.
Hi Vishal,

+1

Thank you for starting a discussion on it. It will be a very helpful
feature to improve query performance and reduces the memory footprint.
Please add the design document for the same.

Regards,
Ravindra.

On 5 June 2018 at 09:22, xuchuanyin <xu...@hust.edu.cn> wrote:

> Hi, Kumar:
>   Local dictionary will be nice feature and other formats like parquet all
> support this.
>
>   My concern is that: How will you implement this feature?
>
>   1. What's the scope of the `local`? Page level (for all containing rows),
> Blocklet level (for all containing pages), Block level(for all containing
> blocklets)?
>
>   2. Where will you store the local dictionary?
>
>   3. How do you decide to enable the local dictionary for a column?
>
>   4. Have you considered to fall back to plain encoding if the local
> dictionary encoding consumes more space?
>
>   5. Will you still work on V3 format or start a new V4 (or v3.1) version?
>
>   Anyway, I'm concerning about the data loading performance. Please pay
> attention to it while you are implementing this feature.
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>



-- 
Thanks & Regards,
Ravi

Re: [Discussion] Carbon Local Dictionary Support

Posted by manish gupta <to...@gmail.com>.
+1

It is a good feature to have. Once the design document is uploaded we will
get a better idea of how it will be implemented.

Regards
Manish Gupta

On Tue, Jun 5, 2018 at 11:18 AM, Kumar Vishal <ku...@gmail.com>
wrote:

> Hi Xuchuanyin,
>
> I am working on design document, and all the points you have mentioned I
> have already captured. I will share once it is finished.
>
> -Regards
> Kumar Vishal
>
> On Tue, Jun 5, 2018 at 9:22 AM, xuchuanyin <xu...@hust.edu.cn> wrote:
>
> > Hi, Kumar:
> >   Local dictionary will be nice feature and other formats like parquet
> all
> > support this.
> >
> >   My concern is that: How will you implement this feature?
> >
> >   1. What's the scope of the `local`? Page level (for all containing
> rows),
> > Blocklet level (for all containing pages), Block level(for all containing
> > blocklets)?
> >
> >   2. Where will you store the local dictionary?
> >
> >   3. How do you decide to enable the local dictionary for a column?
> >
> >   4. Have you considered to fall back to plain encoding if the local
> > dictionary encoding consumes more space?
> >
> >   5. Will you still work on V3 format or start a new V4 (or v3.1)
> version?
> >
> >   Anyway, I'm concerning about the data loading performance. Please pay
> > attention to it while you are implementing this feature.
> >
> >
> >
> > --
> > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> > n5.nabble.com/
> >
>

Re: [Discussion] Carbon Local Dictionary Support

Posted by Kumar Vishal <ku...@gmail.com>.
Hi Xuchuanyin,

I am working on design document, and all the points you have mentioned I
have already captured. I will share once it is finished.

-Regards
Kumar Vishal

On Tue, Jun 5, 2018 at 9:22 AM, xuchuanyin <xu...@hust.edu.cn> wrote:

> Hi, Kumar:
>   Local dictionary will be nice feature and other formats like parquet all
> support this.
>
>   My concern is that: How will you implement this feature?
>
>   1. What's the scope of the `local`? Page level (for all containing rows),
> Blocklet level (for all containing pages), Block level(for all containing
> blocklets)?
>
>   2. Where will you store the local dictionary?
>
>   3. How do you decide to enable the local dictionary for a column?
>
>   4. Have you considered to fall back to plain encoding if the local
> dictionary encoding consumes more space?
>
>   5. Will you still work on V3 format or start a new V4 (or v3.1) version?
>
>   Anyway, I'm concerning about the data loading performance. Please pay
> attention to it while you are implementing this feature.
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>

Re: [Discussion] Carbon Local Dictionary Support

Posted by xuchuanyin <xu...@hust.edu.cn>.
Hi, Kumar:
  Local dictionary will be nice feature and other formats like parquet all
support this.

  My concern is that: How will you implement this feature?

  1. What's the scope of the `local`? Page level (for all containing rows),
Blocklet level (for all containing pages), Block level(for all containing
blocklets)?

  2. Where will you store the local dictionary?

  3. How do you decide to enable the local dictionary for a column?

  4. Have you considered to fall back to plain encoding if the local
dictionary encoding consumes more space?

  5. Will you still work on V3 format or start a new V4 (or v3.1) version?

  Anyway, I'm concerning about the data loading performance. Please pay
attention to it while you are implementing this feature.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/