You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by "Jihong MA (JIRA)" <ji...@apache.org> on 2016/12/16 01:38:58 UTC
[jira] [Updated] (CARBONDATA-431) Improve compression ratio for
numeric datatype
[ https://issues.apache.org/jira/browse/CARBONDATA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jihong MA updated CARBONDATA-431:
---------------------------------
Description:
Carbon has better compression ratio for String type, but worst for numeric data type, identify issues with current numeric datatype compression for carbon to get better compression ratio.
DataType Text Parquet Orc Carbon
decimal 16G | 11G | 6G | 13G
int 5G | 1G | 1G | 3G
String 24G | 22G | 11G | 3G (no dictionary) ------- high cardinality
String 30G | 4G | 4G | 1G -- Dictionary encode 1G -- Dictionary encode without inverted index 3G -- No dictionary encode -----------low cardinality
was:
For the data type, carbon's string type has better compression ratio, but for numeric type, orc has the best compression. we should analysis numeric datatype for carbon to get better compression ratio
DataType Text Parquet Orc Carbon
decimal 16G | 11G | 6G | 13G
int 5G | 1G | 1G | 3G
String 24G | 22G | 11G | 3G (no dictionary) ------- high cardinality
String 30G | 4G | 4G | 1G -- Dictionary encode 1G -- Dictionary encode without inverted index 3G -- No dictionary encode -----------low cardinality
Summary: Improve compression ratio for numeric datatype (was: Analysis compression for numeric datatype compared with Parquet/ORC)
> Improve compression ratio for numeric datatype
> -----------------------------------------------
>
> Key: CARBONDATA-431
> URL: https://issues.apache.org/jira/browse/CARBONDATA-431
> Project: CarbonData
> Issue Type: Sub-task
> Reporter: suo tong
> Assignee: Ashok Kumar
> Fix For: 1.0.0-incubating
>
> Time Spent: 2h 50m
> Remaining Estimate: 0h
>
> Carbon has better compression ratio for String type, but worst for numeric data type, identify issues with current numeric datatype compression for carbon to get better compression ratio.
> DataType Text Parquet Orc Carbon
> decimal 16G | 11G | 6G | 13G
> int 5G | 1G | 1G | 3G
> String 24G | 22G | 11G | 3G (no dictionary) ------- high cardinality
> String 30G | 4G | 4G | 1G -- Dictionary encode 1G -- Dictionary encode without inverted index 3G -- No dictionary encode -----------low cardinality
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)