You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by "Teng, dejun" <te...@buckeyemail.osu.edu> on 2017/01/09 03:46:32 UTC
答复: about rle encoding method selection

Hi Prasanth,

We had a group meeting today, and surprisingly I found we have a strong connection! I just noticed you graduated from OSU, and I’m currently a PHD student working with professor Xiaodong Zhang, and he said you attended his class and you are a smart and intelligent boy!
Thank you for your response, ORC is a great data format, but I still have some concerns about it. Despite the directly stored data like the double, float and data part of the string, the most important part is the RLE, which is also the most time consuming part while decoding.
One of my biggest concern is, why we involve the short repeat encoding? With this encoding, some column is separated into too much short runs, and the decoding becomes unnecessarily slow, especially after I vectorized all decoding functions, decoding the header of each run becomes the bottleneck. On the other hand, it doesn’t make the compression ratio bigger for many cases. I tried to disable this encoding method and re-encoding the lineitem table, which is the biggest table of TPC-H benchmark, I find most of the columns are even smaller without the short repeat encoding.

Another concern I have is, why we don’t have a patched base for the direct encoding? For example, if we have an integer list, 678999 678998 677777 678888, we can store the minimum value of this list which is 677777, and store only the differences of all values. I think this is a common case. I also added this feature in my test and the sizes of some columns are significantly smaller.

I believe the committee members must have their own concerns about those two cases,  do you have any idea why ORC choose to use the short repeat encoding and not has a patch base for the direct encoding?

Thanks,
Dejun


发件人: Prasanth Jayachandran [mailto:j.prasanth.j@gmail.com]
发送时间: 2017年1月3日 13:56
收件人: user@orc.apache.org; user@orc.apache.org
主题: Re: about rle encoding method selection

Hi Dejun

Currently the encoding mode is decided on the buffered data (512 values) combined. Only short repeat (delta 0 and run length <10 IIRC) is encoded immediately. For all other encodings all 512 values are analyzed together to make a decision. For the sequence that you had mentioned, 1 2 3 4 5 6 7 3 sequence is not a monotonic sequence and hence direct encoding is chosen. The encoding headers have 9 bits to encode the runs and hence 512 literals are analyzed together to minimize the header to data overhead. Patched delta is chosen only when there is sudden spike in bit requirements between 95th and 100th percentile values. Choosing this way results in smaller overhead for encoding the header information.

Thanks
Prasanth

_____________________________
From: Teng, dejun <te...@buckeyemail.osu.edu>>
Sent: Wednesday, December 28, 2016 1:55 AM
Subject: about rle encoding method selection
To: <us...@orc.apache.org>>

Hi,

I’m currently working on a project which needs me to dig deep into the RLE encoding of ORC. ORC’s version 2 RLE supports 4 types of encoding methods, direct, repeat, delta and patched as defined in the document. My concern is, is there some standard for the selection of encoding methods?

For example, for a list of integers: 1 2 3 4 5 6 7 3, it could be encoded with direct encoding for all 8 integers, or delta encoding for first 7 integers and direct encoding for the last one. And the latter has a better compression ratio.
I’ve read the source code of the class in RunLengthIntegerWriterV2.java, it seems like that the writer always track if any repeated values were written recently, and writeValues() the values before the repeated values in proper encoding and then start another run with the newly written repeated values. And for the integer list given above, it would be encoded with direct encoding for all 8 integers. However, for an integer list, 1 2 3 4 5 6 7 3 3 3, first 7 integers will be encoded with delta encoding and last three with short repeat encoding.

I got puzzled with the selection of encoding methods. Is there a fixed policy for it? Like a standard finite state machine to describe the whole process. And what is the principle of the selection? If it is to compress the data as small as possible, the released code is not optimized which is proved by the given case above. For my project, I need an official definition  of the encoding policy of ORC’s RLE2 but I cannot find any hint on the official website. It will be appreciated if you can give me some official response for this issue. Do we have any standard encoding policy? Or any policies are fine as long as they use the defined four encoding methods.


Thanks,
Dejun