You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Telco Phone <te...@yahoo.com> on 2017/03/25 02:10:11 UTC

Compress compare

I have been comparing my custom orc converted files to hive formatted tables (using hive)
My question is what is the Compression size meaning here and why the difference ?

HIVE table:
File Version: 0.12 with HIVE_8732Rows: 194032Compression: ZLIBCompression size: 262144
My Orc File:
File Version: 0.12 with ORC_101Rows: 229376Compression: ZLIBCompression size: 4096

I use this for writing my file:
OrcFile.writerOptions(conf).stripeSize(100000).bufferSize(10000).compress(org.apache.orc.CompressionKind.ZLIB).version(OrcFile.Version.V_0_12).setSchema(orcSchema));

Re: Compress compare

Posted by Prasanth Jayachandran <j....@gmail.com>.
Compression size is the buffer that holds the columnar data in memory per stream per column and when it fills up compresses the buffer using the compression codec. The size is different because in older versions these buffers were held in memory causing high memory pressure and often lead to OOM. In recent versions, writer adjusts the buffer sizes based on the number of columns, stripe size and available memory to avoid OOMs. 

Thanks
Prasanth




On Fri, Mar 24, 2017 at 7:15 PM -0700, "Telco Phone" <te...@yahoo.com> wrote:











I have been comparing my custom orc converted files to hive formatted tables (using hive)
My question is what is the Compression size meaning here and why the difference ?

HIVE table:
File Version: 0.12 with HIVE_8732Rows: 194032Compression: ZLIBCompression size: 262144
My Orc File:
File Version: 0.12 with ORC_101Rows: 229376Compression: ZLIBCompression size: 4096

I use this for writing my file:
OrcFile.writerOptions(conf).stripeSize(100000).bufferSize(10000).compress(org.apache.orc.CompressionKind.ZLIB).version(OrcFile.Version.V_0_12).setSchema(orcSchema));