You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by qihua wu <wu...@gmail.com> on 2013/11/11 15:22:10 UTC

RLE in hive ORC

In vertica, if I have a column sorted, and the same value repeat 1M times,
it only used very small storage as it only stores (value, 1M). But in ORC,
looks like the max length is less than 200 ( not very sure, but at about
the same level of hundreds), why restrict the max run length?

Re: RLE in hive ORC

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.
As Owen noted, max run for version 0.11 is 130. 3 is minimum run for RLE to be used. So max value that can be interpreted from 7 bits is 130. 

Thanks
Prasanth Jayachandran

On Nov 11, 2013, at 9:51 AM, Owen O'Malley <om...@apache.org> wrote:

> Hi,
>   The RLE in ORC is a tradeoff (as is all compression) between tight representations for commonly occurring patterns and longer representations for rarely occurring patterns. The question at hand is how to use the bits available to reduce the average size of the column. In Hive 0.12, ORC gained a second version of the RLE, so I'll split out the two versions:
> 
> ORC RLEv1 (max run = 130):
> 
>    1 million integer 0: 7692 copies of 7f 00 00 followed by 24 00 00 = 23,079 bytes
> 
> ORC RLEv2 (max run = 511):
> 
>   1 million integer 0: 1956 copies of c1 ff 00 00 followed by c1 e4 00 00 = 7,828 bytes
> 
> With generic compression (ZLIB, Snappy) on top of this, it shrinks even smaller.
> 
> So back to the original question, it was a tradeoff between complexity and size of common cases. The length of the run in both cases has a fixed number of bits and if we had used 32 bits of the repetition length, the more typical case of 5 to 10 repetitions would have been far worse.
> 
> 
> On Mon, Nov 11, 2013 at 6:22 AM, qihua wu <wu...@gmail.com> wrote:
> In vertica, if I have a column sorted, and the same value repeat 1M times, it only used very small storage as it only stores (value, 1M). But in ORC, looks like the max length is less than 200 ( not very sure, but at about the same level of hundreds), why restrict the max run length? 
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: RLE in hive ORC

Posted by Owen O'Malley <om...@apache.org>.
Hi,
  The RLE in ORC is a tradeoff (as is all compression) between tight
representations for commonly occurring patterns and longer representations
for rarely occurring patterns. The question at hand is how to use the bits
available to reduce the average size of the column. In Hive 0.12, ORC
gained a second version of the RLE, so I'll split out the two versions:

ORC RLEv1 (max run = 130):

   1 million integer 0: 7692 copies of 7f 00 00 followed by 24 00 00 =
23,079 bytes

ORC RLEv2 (max run = 511):

  1 million integer 0: 1956 copies of c1 ff 00 00 followed by c1 e4 00 00 =
7,828 bytes

With generic compression (ZLIB, Snappy) on top of this, it shrinks even
smaller.

So back to the original question, it was a tradeoff between complexity and
size of common cases. The length of the run in both cases has a fixed
number of bits and if we had used 32 bits of the repetition length, the
more typical case of 5 to 10 repetitions would have been far worse.


On Mon, Nov 11, 2013 at 6:22 AM, qihua wu <wu...@gmail.com> wrote:

> In vertica, if I have a column sorted, and the same value repeat 1M times,
> it only used very small storage as it only stores (value, 1M). But in ORC,
> looks like the max length is less than 200 ( not very sure, but at about
> the same level of hundreds), why restrict the max run length?
>

Re: RLE in hive ORC

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.
Runs of 1M is not common case. I am not sure how vertica stores the run lengths. It seems like variable length integers are used. 
ORC does not use variable length integers for storing run length. Using variable length integer has advantage of storing much longer runs but for repeating shorter runs, it wastes lots of bytes. ORC uses fixed lengths to store run length (7 bits in older version and 9 bits in newer version) and so it is good for shorter runs.

There are two versions for RLE in ORC. Old version 0.11 uses 127 as max run length so that it can be packed in lower 7 bits of a byte. In the new version 0.12 ORC uses 511 as max run length as it uses 9 bits to store run length. The new version of ORC uses a different encoding if the runs are smaller (<10) which saves a byte. 

Thanks
Prasanth Jayachandran

On Nov 11, 2013, at 6:22 AM, qihua wu <wu...@gmail.com> wrote:

> In vertica, if I have a column sorted, and the same value repeat 1M times, it only used very small storage as it only stores (value, 1M). But in ORC, looks like the max length is less than 200 ( not very sure, but at about the same level of hundreds), why restrict the max run length? 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.