You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@carbondata.apache.org by "kumar vishal (JIRA)" <ji...@apache.org> on 2018/07/26 09:55:00 UTC

[jira] [Resolved] (CARBONDATA-2584) CarbonData Local Dictionary Support

     [ https://issues.apache.org/jira/browse/CARBONDATA-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

kumar vishal resolved CARBONDATA-2584.
--------------------------------------
    Resolution: Fixed
      Assignee: kumar vishal

> CarbonData Local Dictionary Support
> -----------------------------------
>
>                 Key: CARBONDATA-2584
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-2584
>             Project: CarbonData
>          Issue Type: New Feature
>            Reporter: kumar vishal
>            Assignee: kumar vishal
>            Priority: Major
>         Attachments: CarbonData Local Dictionary Support Design Doc.docx
>
>
> Currently CarbonData supports global dictionary or No-Dictionary (Plain-Text stored in LV format) for storing dimension column data.
> *Bottleneck with Global Dictionary*
> It’s difficult for user to determine whether the column should be dictionary or not if number of columns in table is high.
> Global dictionary generation generally slows down the load process.
> Multiple IO operations are made during load even though dictionary already exists.
> During query, multiple IO operations done for reading dictionary files and carbondata files.
> *Bottleneck with No-Dictionary*
> Storage size is high as we store the data in LV format
> Query on No-Dictionary column is slower as data read/processed is more
> Filtering is slower on No-Dictionary columns as number of comparison is high
> Memory footprint is high
> *The above bottlenecks can be solved by generating dictionary for low cardinality columns at each blocklet level, which will help to achieve below benefits:*
> Reduces the extra IO operations read/write on the dictionary files generated in case of global dictionary.
> It will eliminate the problem for user to identify the dictionary columns when the number of columns are more in a table.
> It helps in getting more compression on dimension columns with less cardinality.
> Filter queries and full scan queries on No-dictionary columns with local dictionary will be faster as filter will be done on encoded data.
> It will help in reducing the store size and memory footprint as only unique values will be stored {color:#000000}as {color}part of local dictionary and corresponding data will be stored as encoded data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)