You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Henry Robinson <he...@apache.org> on 2016/11/16 22:41:37 UTC

Decimal binary encoding

Hi -

I'm adding binary encoding support for decimal to Impala, and have one
question about some wording in the spec:

"binary: precision is not limited, but is required. The minimum number of
bytes to store the unscaled value should be used"

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

When the spec says 'the minimum number of bytes', which of the following
does that mean:

* the minimum number of bytes to store a particular unscaled value must be
used (so for '8' it's one byte, for '550' it's two bytes and so on), and
the encoded length is value dependent.

or

* the minimum number of bytes for the given precision must be used (so all
values in a given column should have the same byte length).

If it's the latter, the implementation is much easier because
FIXED_LEN_BYTE_ARRAY becomes a special case of BINARY, but the former
offers more opportunity for compact representations on a high precision
column that in practice has low precision values.

Thanks,
Henry

Re: Decimal binary encoding

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

The intent was for binary to store the minimum number of bytes for each
unscaled value. Fixed should be used if you want to store all values with
the same number of bytes because that avoids writing a length for each byte
array. Binary works well for the case you described, where you have a large
precision, but enough small values to offset the cost of storing the length.

rb

On Wed, Nov 16, 2016 at 2:41 PM, Henry Robinson <he...@apache.org> wrote:

> Hi -
>
> I'm adding binary encoding support for decimal to Impala, and have one
> question about some wording in the spec:
>
> "binary: precision is not limited, but is required. The minimum number of
> bytes to store the unscaled value should be used"
>
> https://github.com/apache/parquet-format/blob/master/
> LogicalTypes.md#decimal
>
> When the spec says 'the minimum number of bytes', which of the following
> does that mean:
>
> * the minimum number of bytes to store a particular unscaled value must be
> used (so for '8' it's one byte, for '550' it's two bytes and so on), and
> the encoded length is value dependent.
>
> or
>
> * the minimum number of bytes for the given precision must be used (so all
> values in a given column should have the same byte length).
>
> If it's the latter, the implementation is much easier because
> FIXED_LEN_BYTE_ARRAY becomes a special case of BINARY, but the former
> offers more opportunity for compact representations on a high precision
> column that in practice has low precision values.
>
> Thanks,
> Henry
>



-- 
Ryan Blue
Software Engineer
Netflix