You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/01/12 14:43:40 UTC

[jira] [Resolved] (SPARK-1521) Take character set size into account when compressing in-memory string columns

     [ https://issues.apache.org/jira/browse/SPARK-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-1521.
------------------------------
    Resolution: Won't Fix

I assume this is obsolete or else already implemented in some sense by tungsten

> Take character set size into account when compressing in-memory string columns
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-1521
>                 URL: https://issues.apache.org/jira/browse/SPARK-1521
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.1.0
>            Reporter: Cheng Lian
>              Labels: compression
>
> Quoted from [a blog post|https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/] from Facebook:
> bq. Strings dominate the largest tables in our warehouse and make up about 80% of the columns across the warehouse, so optimizing compression for string columns was important. By using a threshold on observed number of distinct column values per stripe, we modified the ORCFile writer to apply dictionary encoding to a stripe only when beneficial. Additionally, we sample the column values and take the character set of the column into account, since a small character set can be leveraged by codecs like Zlib for good compression and dictionary encoding then becomes unnecessary or sometimes even detrimental if applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org