You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Wu Gang <us...@gmail.com> on 2017/03/17 05:17:02 UTC

Fwd: ORC Decimal type

Hi

I'm Gang Wu who's an inactive Spark contributor and now curious about the
design of decimal type in ORC. From the documentation and java code, I
think it is as follows and correct me if I'm wrong:

1. A decimal type is at most 127 bits long for value plus 1 bit for
indicating the sign. In total, at most 128 bits;
2. Although the precision and scale of a decimal are stored in the file
footer, we still need to write every scale of each element in the secondary
stream using signed integer RLE. The scale written has nothing to do with
the scale stored in file footer. It can be same as the scale in file footer
or totally different.

If all the above statements are correct, then why not do something like the
present stream? We can make the secondary stream of decimal columns
optional. If all the scales of a column are same as the scale in the
footer, then we can just ignore it.

Also, I think we can save more spaces by writing delta scales in the
secondary stream, meaning that we can write (actualScale -
scaleInFileFooter) instead of actualScale. But this may break the backward
compatibility.

Any reply is welcome! Thanks!

Best,
Gang

Re: ORC Decimal type

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> If all the above statements are correct, then why not do something like the present stream? We can make the secondary stream of decimal columns optional. If all the scales of a column are same as the scale in the footer, then we can just ignore it. 

I think that is a valid case for suppressing that stream, since most people using Decimal in SQL form will specify a consistent decimal size across all values.

> Also, I think we can save more spaces by writing delta scales in the secondary stream, meaning that we can write (actualScale - scaleInFileFooter) instead of actualScale.

Integer columns should compress well without having to manually encode a delta encoding there.

I was under the impression that the current scale folds neatly into the integer encoding for RLE.

In my experiment, 1.9M values became 102 bytes, which is not nothing - but is very small.

    Column 1: count: 1920800 hasNull: false min: 1 max: 1920800 sum: 1844737280400
    Column 2: count: 1920800 hasNull: false min: 1 max: 1920800 sum: 1844737280400
…
    Stream: column 1 section DATA start: 7210 length 3178854
    Stream: column 1 section SECONDARY start: 3186064 length 102
    Stream: column 2 section DATA start: 3186166 length 5056
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2

This is a Decimal(28,10), Bigint table.The part that matters is the DATA size there, considering I inserted 1.9M sequential integers into this.

The same data as bigint is only 5056 bytes.

The SECONDARY seems to be compress pretty tight, however, as you mention it is completely unnecessary & can be suppressed when it is all the same.

A better decimal encoding is badly needed, to get the best out of this, the breaking change should also tackle DATA - experiments and ideas would be appreciated.

Cheers,
Gopal



Re: ORC Decimal type

Posted by Owen O'Malley <om...@apache.org>.
Gang,
   When decimal was first introduced in Hive, they were infinite precision.
ORC therefore had support for it. You should look at the discussion on
https://issues.apache.org/jira/browse/ORC-161 , but you are absolutely
right that we should create a new encoding for decimal that doesn't encode
scale. We should also use rle for the values.

.. Owen

On Thu, Mar 16, 2017 at 10:17 PM, Wu Gang <us...@gmail.com> wrote:

> Hi
>
> I'm Gang Wu who's an inactive Spark contributor and now curious about the
> design of decimal type in ORC. From the documentation and java code, I
> think it is as follows and correct me if I'm wrong:
>
> 1. A decimal type is at most 127 bits long for value plus 1 bit for
> indicating the sign. In total, at most 128 bits;
> 2. Although the precision and scale of a decimal are stored in the file
> footer, we still need to write every scale of each element in the secondary
> stream using signed integer RLE. The scale written has nothing to do with
> the scale stored in file footer. It can be same as the scale in file footer
> or totally different.
>
> If all the above statements are correct, then why not do something like
> the present stream? We can make the secondary stream of decimal columns
> optional. If all the scales of a column are same as the scale in the
> footer, then we can just ignore it.
>
> Also, I think we can save more spaces by writing delta scales in the
> secondary stream, meaning that we can write (actualScale -
> scaleInFileFooter) instead of actualScale. But this may break the backward
> compatibility.
>
> Any reply is welcome! Thanks!
>
> Best,
> Gang
>
>