You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Chao Sun <su...@apache.org> on 2022/04/21 22:28:46 UTC

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

Any update on this proposal? I think this will be a useful addition
too. I can potentially help on the Rust side implementation.

Chao

On Tue, Mar 8, 2022 at 1:00 PM Jorge Cardoso Leitão
<jo...@gmail.com> wrote:
>
> Agreed.
>
> Also, I would like to revise my previous comment about the small risk.
> While prototyping this I did hit some bumps. They primary came from two
> reasons:
>
> * I was unable to find arrow/json files in the arrow-testing generated
> files with a non-default decimal bitwidth (I think we only have the
> on-the-fly generated file in archery)
> * the FFI interface has a default decimal of 128 (`d:{precision}:{scale}`)
> and implementations may not support the 256 case (e.g. Rust has no native
> i256). For these cases, this could be the first non-default decimal
> implementation.
>
> So, maybe we follow the standard procedure?
>
> Best,
> Jorge
>
>
>
> On Tue, Mar 8, 2022 at 9:22 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > >
> > > I’d also like to chime in in favor of 32- and 64-bit decimals because
> > > it’ll help achieve better performance on TPC-H (and maybe other
> > > benchmarks). The decimal columns need only 12 digits of precision, for
> > > which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> > > 128-bit decimal. You can technically use a float too, but I expect 64-bit
> > > decimal to be faster.
> >
> >
> > We should be careful here.  If this assumes loading from Parquet or other
> > file formats currently in the library, arbitrarily changing the type to
> > load the minimum data-length possible could break users, this should
> > probably be a configuration option.  This also reminds me I think there is
> > some technical debt with decimals and parquet.
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-12022
> >
> > On Tue, Mar 8, 2022 at 11:05 AM Sasha Krassovsky <
> > krassovskysasha@gmail.com>
> > wrote:
> >
> > > I’d also like to chime in in favor of 32- and 64-bit decimals because
> > > it’ll help achieve better performance on TPC-H (and maybe other
> > > benchmarks). The decimal columns need only 12 digits of precision, for
> > > which a 64-bit decimal is sufficient. It’s currently wasteful to use a
> > > 128-bit decimal. You can technically use a float too, but I expect 64-bit
> > > decimal to be faster.
> > >
> > > Sasha Krassovsky
> > >
> > > > 8 марта 2022 г., в 09:01, Micah Kornfield <em...@gmail.com>
> > > написал(а):
> > > >
> > > > 
> > > >>
> > > >>
> > > >> Do we want to keep the historical "C++ and Java" requirement or
> > > >> do we want to make it a more flexible "two independent official
> > > >> implementations", which could be for example C++ and Rust, Rust and
> > > >> Java, etc.
> > > >
> > > >
> > > > I think flexibility here is a good idea, I'd like to hear other
> > opinions.
> > > >
> > > > For this particular case if there aren't volunteers to help out in
> > > another
> > > > implementation I'm willing to help with Java (I don't have bandwidth to
> > > > do both C++ and Java).
> > > >
> > > > Cheers,
> > > > -Micah
> > > >
> > > >> On Tue, Mar 8, 2022 at 8:23 AM Antoine Pitrou <an...@python.org>
> > > wrote:
> > > >>
> > > >>
> > > >> Le 07/03/2022 à 20:26, Micah Kornfield a écrit :
> > > >>>>
> > > >>>> Relaxing from {128,256} to {32,64,128,256} seems a low risk
> > > >>>> from an integration perspective, as implementations already need to
> > > read
> > > >>>> the bitwidth to select the appropriate physical representation (if
> > > they
> > > >>>> support it).
> > > >>>
> > > >>> I think there are two reasons for having implementations first.
> > > >>> 1.  Lower risk bugs in implementation/spec.
> > > >>> 2.  A mechanism to ensure that there is some boot-strapped coverage
> > in
> > > >>> commonly used reference implementations.
> > > >>
> > > >> That sounds reasonable.
> > > >>
> > > >> Another question that came to my mind is: traditionally, we've
> > mandated
> > > >> implementations in the two reference Arrow implementations (C++ and
> > > >> Java).  However, our implementation landscape is now much richer than
> > it
> > > >> used to be (for example, there is a tremendous activity on the Rust
> > > >> side).  Do we want to keep the historical "C++ and Java" requirement
> > or
> > > >> do we want to make it a more flexible "two independent official
> > > >> implementations", which could be for example C++ and Rust, Rust and
> > > >> Java, etc.
> > > >>
> > > >> (by "independent" I mean that one should not be based on the other,
> > for
> > > >> example it should not be "C++ and Python" :-))
> > > >>
> > > >> Regards
> > > >>
> > > >> Antoine.
> > > >>
> > > >>
> > > >>>
> > > >>> I agree 1, is fairly low-risk.
> > > >>>
> > > >>> On Mon, Mar 7, 2022 at 11:11 AM Jorge Cardoso Leitão <
> > > >>> jorgecarleitao@gmail.com> wrote:
> > > >>>
> > > >>>> +1 adding 32 and 64 bit decimals.
> > > >>>>
> > > >>>> +0 to release it without integration tests - both IPC and the C data
> > > >>>> interface use a variable bit width to declare the appropriate size
> > for
> > > >>>> decimal types. Relaxing from {128,256} to {32,64,128,256} seems a
> > low
> > > >> risk
> > > >>>> from an integration perspective, as implementations already need to
> > > read
> > > >>>> the bitwidth to select the appropriate physical representation (if
> > > they
> > > >>>> support it).
> > > >>>>
> > > >>>> Best,
> > > >>>> Jorge
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Mar 7, 2022, 11:41 Antoine Pitrou <an...@python.org>
> > wrote:
> > > >>>>
> > > >>>>>
> > > >>>>> Le 03/03/2022 à 18:05, Micah Kornfield a écrit :
> > > >>>>>> I think this makes sense to add these.  Typically when adding new
> > > >>>> types,
> > > >>>>>> we've waited  on the official vote until there are two reference
> > > >>>>>> implementations demonstrating compatibility.
> > > >>>>>
> > > >>>>> You are right, I had forgotten about that.  Though in this case, it
> > > >>>>> might be argued we are just relaxing the constraints on an existing
> > > >> type.
> > > >>>>>
> > > >>>>> What do others think?
> > > >>>>>
> > > >>>>> Regards
> > > >>>>>
> > > >>>>> Antoine.
> > > >>>>>
> > > >>>>>
> > > >>>>>>
> > > >>>>>> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou <antoine@python.org
> > >
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>> Hello,
> > > >>>>>>>
> > > >>>>>>> Currently, the Arrow format specification restricts the bitwidth
> > of
> > > >>>>>>> decimal numbers to either 128 or 256 bits.
> > > >>>>>>>
> > > >>>>>>> However, there is interest in allowing other bitwidths, at least
> > 32
> > > >>>> and
> > > >>>>>>> 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal
> > > >>>>>>> datatype would allow for precisions of up to 18 digits
> > > (respectively
> > > >> 9
> > > >>>>>>> digits), which are sufficient for some applications which are
> > > mainly
> > > >>>>>>> looking for exact computations rather than sheer precision.
> > > >> Obviously,
> > > >>>>>>> smaller datatypes are cheaper to store in memory and cheaper to
> > run
> > > >>>>>>> computations on.
> > > >>>>>>>
> > > >>>>>>> For example, the Spark documentation mentions that some decimal
> > > types
> > > >>>>>>> may fit in a Java int (32 bits) or long (64 bits):
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > >
> > https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html
> > > >>>>>>>
> > > >>>>>>> ... and a draft PR had even been filed for initial support in the
> > > C++
> > > >>>>>>> implementation (https://github.com/apache/arrow/pull/8578).
> > > >>>>>>>
> > > >>>>>>> I am therefore proposing that we relax the wording in the Arrow
> > > >> format
> > > >>>>>>> specification to also allow 32- and 64-bit decimal types.
> > > >>>>>>>
> > > >>>>>>> This is a preliminary discussion to gather opinions and potential
> > > >>>>>>> counter-arguments against this proposal. If no strong
> > > >> counter-argument
> > > >>>>>>> emerges, we will probably run a vote in a week or two.
> > > >>>>>>>
> > > >>>>>>> Best regards
> > > >>>>>>>
> > > >>>>>>> Antoine.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> >

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

Posted by Wes McKinney <we...@gmail.com>.

I think there's a couple of embedded / entangled questions here that about this:

* Should Arrow be able to be used to *transport* narrow decimals — for
the (now very abundant) use cases where Arrow is being used as an
internal wire protocol or client/server interface

* Should *compute engines* that are Arrow-native or Arrow-compatible
provide guarantees about when and how decimals will be widened, or
whether narrow decimals input which can technically (from a pedantic
mathematical standpoint) yield narrow output

I think supporting the serialization-free transport case is pretty
important, since systems with narrow decimals have to pre-widen them
before sending them over Arrow. Clickhouse for example has decimal32
through decimal256 [1]. Result sets there returned would have to be
serialized to decimal128, or else define extension types which could
have compatibility issues.

On the latter question, I think that no query engine for Arrow should
be compelled to offer pedantically-consistent support for narrow
decimals — so if a query engine received decimal32 or decimal64, it
could define implicit cast to decimal128 and implement all kernels and
algorithms for decimal128. I note the comment from the Clickhouse link
that "Because modern CPUs do not support 128-bit integers natively,
operations on Decimal128 are emulated. Because of this Decimal128
works significantly slower than Decimal32/Decimal64." To not afford
query engines for Arrow the option to optimize some frequently-used
calculations on narrow decimals (even though the implementation is
burdensome) seems unfortunate.

[1]: https://clickhouse.com/docs/en/sql-reference/data-types/decimal/

On Sat, Apr 23, 2022 at 9:15 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> I'm generally -0.01 against narrow decimals. My experience in practice has
> been that widening happens so quickly that they are little used and add
> unnecessary complexity. For reference, the original Arrow code actually
> implemented Decimal9 [1] and Decimal18 [2] but we removed both because of
> this experience of complexity. (Good to note that we worked with them for
> several years before the model was in the Arrow project before we came to
> this conclusion.)
>
> One of the other commenters here spoke of the benefit to things like tpch.
> I doubt this would be meaningful as I believe most (if not all) decimal
> operations in TPCH would typically immediately widen to DECIMAL38.
>
> Another possible approach here might be to add DECIMAL18 to the spec and
> see the usage with it (and how much value it really added) before
> adding DECIMAL9.
>
> It's easy to add types to the spec, hard to remove them.
>
> [1]
> https://github.com/apache/arrow/blob/fa5f0299f046c46e1b2f671e5e3b4f1956522711/java/vector/src/main/codegen/data/ValueVectorTypes.tdd#L66
> [2]
> https://github.com/apache/arrow/blob/fa5f0299f046c46e1b2f671e5e3b4f1956522711/java/vector/src/main/codegen/data/ValueVectorTypes.tdd#L81
>
>
>
> >

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

Posted by Sasha Krassovsky <kr...@gmail.com>.

Regarding TPC-H and widening, we can (and do currently for the one query we have implemented) cast the decimal back down to the correct precision after each multiplication, so I don’t think this is an issue. On the other hand, there are definitely things we can do to dynamically detect if decimals are narrow, and if so only update the lower half without too much overhead, so it wouldn’t be a huge loss if we didn’t add them. 

Sasha

> On Apr 23, 2022, at 7:14 PM, Jacques Nadeau <ja...@apache.org> wrote:
> 
> I'm generally -0.01 against narrow decimals. My experience in practice has
> been that widening happens so quickly that they are little used and add
> unnecessary complexity. For reference, the original Arrow code actually
> implemented Decimal9 [1] and Decimal18 [2] but we removed both because of
> this experience of complexity. (Good to note that we worked with them for
> several years before the model was in the Arrow project before we came to
> this conclusion.)
> 
> One of the other commenters here spoke of the benefit to things like tpch.
> I doubt this would be meaningful as I believe most (if not all) decimal
> operations in TPCH would typically immediately widen to DECIMAL38.
> 
> Another possible approach here might be to add DECIMAL18 to the spec and
> see the usage with it (and how much value it really added) before
> adding DECIMAL9.
> 
> It's easy to add types to the spec, hard to remove them.
> 
> [1]
> https://github.com/apache/arrow/blob/fa5f0299f046c46e1b2f671e5e3b4f1956522711/java/vector/src/main/codegen/data/ValueVectorTypes.tdd#L66
> [2]
> https://github.com/apache/arrow/blob/fa5f0299f046c46e1b2f671e5e3b4f1956522711/java/vector/src/main/codegen/data/ValueVectorTypes.tdd#L81
> 
> 
> 
>>

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

Posted by Jacques Nadeau <ja...@apache.org>.

I'm generally -0.01 against narrow decimals. My experience in practice has
been that widening happens so quickly that they are little used and add
unnecessary complexity. For reference, the original Arrow code actually
implemented Decimal9 [1] and Decimal18 [2] but we removed both because of
this experience of complexity. (Good to note that we worked with them for
several years before the model was in the Arrow project before we came to
this conclusion.)

One of the other commenters here spoke of the benefit to things like tpch.
I doubt this would be meaningful as I believe most (if not all) decimal
operations in TPCH would typically immediately widen to DECIMAL38.

Another possible approach here might be to add DECIMAL18 to the spec and
see the usage with it (and how much value it really added) before
adding DECIMAL9.

It's easy to add types to the spec, hard to remove them.

[1]
https://github.com/apache/arrow/blob/fa5f0299f046c46e1b2f671e5e3b4f1956522711/java/vector/src/main/codegen/data/ValueVectorTypes.tdd#L66
[2]
https://github.com/apache/arrow/blob/fa5f0299f046c46e1b2f671e5e3b4f1956522711/java/vector/src/main/codegen/data/ValueVectorTypes.tdd#L81



>