You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/01/12 14:43:40 UTC
[jira] [Resolved] (SPARK-1521) Take character set size into account
when compressing in-memory string columns
[ https://issues.apache.org/jira/browse/SPARK-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-1521.
------------------------------
Resolution: Won't Fix
I assume this is obsolete or else already implemented in some sense by tungsten
> Take character set size into account when compressing in-memory string columns
> ------------------------------------------------------------------------------
>
> Key: SPARK-1521
> URL: https://issues.apache.org/jira/browse/SPARK-1521
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.1.0
> Reporter: Cheng Lian
> Labels: compression
>
> Quoted from [a blog post|https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/] from Facebook:
> bq. Strings dominate the largest tables in our warehouse and make up about 80% of the columns across the warehouse, so optimizing compression for string columns was important. By using a threshold on observed number of distinct column values per stripe, we modified the ORCFile writer to apply dictionary encoding to a stripe only when beneficial. Additionally, we sample the column values and take the character set of the column into account, since a small character set can be leveraged by codecs like Zlib for good compression and dictionary encoding then becomes unnecessary or sometimes even detrimental if applied.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org