You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Phillip Cloud <cp...@gmail.com> on 2017/09/18 17:20:21 UTC

Decimal Format

Hi all,

I’d like to propose the following changes to the in-memory Decimal128
format and solicit feedback.

   1. When converting to and from an array of bytes, the input bytes are
   *assumed* to be in big-endian order and the output bytes are *guaranteed*
   to be in big-endian order. Additionally, the number of bytes is not assumed
   to be 16 anymore, and the length must be passed in to the Decimal128
   constructor.
   2. The byte width of the underlying FixedSizeBinary builder is the
   minimum number of bytes necessary to represent a Decimal128 of a given
   precision. For example, if my decimal type has precision 21, the byte width
   of my FixedSizeBinary builder will be 9.

Both of these changes are motivated by the Parquet file format decimal
specification and the impala parquet reader/writer.

Pros:

   1. Parquet writes from arrow arrays write the minimum number of bytes to
   disk, saving space and there’s no additional logic required in parquet-cpp.
   2. parquet-cpp would be able to easily read from parquet files written
   from impala and easily write impala compatible parquet files

Cons:

   1. Additional work is required in the java implementation to implement
   this since right now decimals are always 16 bytes.
   2. Integration tests are broken because decimal byte widths are now not
   reliably 16 bytes.
   3. Possibly worse in-memory performance because of array-element byte
   widths that are not powers of two.

Possible alternatives:

   - Implement the logic only in parquet-cpp
      - I originally went down this road, but it was significantly more
      complicated to do this parquet-cpp than it was to change the
arrow decimal
      format. I’d be willing to push harder on this if changing the
arrow decimal
      format in the proposed way is not a viable option.

I have WIP PR up that implements this on the arrow side, but I’m happy to
can it (or not) based on this discussion (
https://github.com/apache/arrow/pull/1108).

Thanks,
Phillip

Re: Decimal Format

Posted by Wes McKinney <we...@gmail.com>.

hi Phillip,

Thanks for bringing up these questions.

I have a few questions about this:

1. What format is parquet-mr (e.g. Hive, Spark) writing for various
precisions? Is it always 16 bytes? My understanding is that they use
BYTE_ARRAY instead of FIXED_LEN_BYTE_ARRAY in parquet-mr for decimals

2. Can Impala read the decimals written by parquet-mr?

3. Can parquet-mr / Spark read the byte-packed decimals written by Impala?

4. Is writing decimals using the same encoding format as parquet-mr an option?

A parsimonious solution from a computational point of view is to make
the in-memory representation of decimals always 16 bytes, so that if
we write any analytical operator kernels for Decimal128 / 16-byte
decimal that they can operate directly on the memory without any bit
twiddling conversions. It would be potentially problematic
(performance-wise) to have to convert to Decimal128, do some math,
then convert back to the "packed" representation for the given
precision.

Based on this issue alone (evaluating in-memory analytics on decimals)
my inclination would be to address the encoding of decimals to Parquet
format strictly in parquet-cpp, and let the in-memory format be
consistently 16-bytes (with the concession that some of the bytes will
be all 0's for smaller precisions). It would be useful to know the
answers to the above questions to help with understanding all angles
of the problem, though.

- Wes

On Mon, Sep 18, 2017 at 1:20 PM, Phillip Cloud <cp...@gmail.com> wrote:
> Hi all,
>
> I’d like to propose the following changes to the in-memory Decimal128
> format and solicit feedback.
>
>    1. When converting to and from an array of bytes, the input bytes are
>    *assumed* to be in big-endian order and the output bytes are *guaranteed*
>    to be in big-endian order. Additionally, the number of bytes is not assumed
>    to be 16 anymore, and the length must be passed in to the Decimal128
>    constructor.
>    2. The byte width of the underlying FixedSizeBinary builder is the
>    minimum number of bytes necessary to represent a Decimal128 of a given
>    precision. For example, if my decimal type has precision 21, the byte width
>    of my FixedSizeBinary builder will be 9.
>
> Both of these changes are motivated by the Parquet file format decimal
> specification and the impala parquet reader/writer.
>
> Pros:
>
>    1. Parquet writes from arrow arrays write the minimum number of bytes to
>    disk, saving space and there’s no additional logic required in parquet-cpp.
>    2. parquet-cpp would be able to easily read from parquet files written
>    from impala and easily write impala compatible parquet files
>
> Cons:
>
>    1. Additional work is required in the java implementation to implement
>    this since right now decimals are always 16 bytes.
>    2. Integration tests are broken because decimal byte widths are now not
>    reliably 16 bytes.
>    3. Possibly worse in-memory performance because of array-element byte
>    widths that are not powers of two.
>
> Possible alternatives:
>
>    - Implement the logic only in parquet-cpp
>       - I originally went down this road, but it was significantly more
>       complicated to do this parquet-cpp than it was to change the
> arrow decimal
>       format. I’d be willing to push harder on this if changing the
> arrow decimal
>       format in the proposed way is not a viable option.
>
> I have WIP PR up that implements this on the arrow side, but I’m happy to
> can it (or not) based on this discussion (
> https://github.com/apache/arrow/pull/1108).
>
> Thanks,
> Phillip
>