You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2019/06/10 21:18:10 UTC

[DISCUSS] 32- and 64-bit decimal types

On the 1.0.0 protocol discussion, one item that we've skirted for some
time is other decimal sizes:

https://issues.apache.org/jira/browse/ARROW-2009

I understand this is a loaded subject since a deliberate decision was
made to remove types from the initial Java implementation of Arrow
that was forked from Apache Drill. However, it's a friction point that
has come up in a number of scenarios as many database and storage
systems have 32- and 64-bit variants for low precision decimal data.
As an example Apache Kudu [1] has all three types, and the Parquet
columnar format allows not only 32/64 bit storage but fixed size
binary (size a function of precision) and variable-length binary
encoding [2].

One of the arguments against using these types in a computational
setting is that many mathematical operations will necessarily trigger
an up-promotion to a larger type. It's hard for us to predict how
people will use the Arrow format, though, and the current situation is
forcing an up-promotion regardless of how the format is being used,
even for simple data transport

In anticipation of long-term needs, I would suggest a possible solution of:

* Adding bitWidth field to Decimal table in Schema.fbs [3] with
default value of 128
* Constraining bit widths to 32, 64, and 128 bits for the time being
* Permit storage of smaller precision decimals in larger storage like
we have now

If this isn't deemed desirable by the community, decimal extension
types could be employed for serialization-free transport for smaller
decimals, but I view this as suboptimal.

Interested in the thoughts of others.

thanks
Wes

[1]: https://github.com/apache/kudu/blob/master/src/kudu/common/common.proto#L55
[2]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
[3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L121

Re: [DISCUSS] 32- and 64-bit decimal types

Posted by Antoine Pitrou <an...@python.org>.
Le 10/06/2019 à 23:24, Wes McKinney a écrit :
> 
> BTW, even if we do not allow 32/64 bit decimals in the format, we
> should consider adding a bitWidth field with static value 128 as a
> matter of future-proofing the metadata. This change would make it so
> that old readers are unable to see the bitWidth field, so the addition
> would not be possible without bumping the protocol version.

That sounds reasonable to me.

Regards

Antoine.

Re: [DISCUSS] 32- and 64-bit decimal types

Posted by Wes McKinney <we...@gmail.com>.
On Mon, Jun 10, 2019 at 4:18 PM Wes McKinney <we...@gmail.com> wrote:
>
> On the 1.0.0 protocol discussion, one item that we've skirted for some
> time is other decimal sizes:
>
> https://issues.apache.org/jira/browse/ARROW-2009
>
> I understand this is a loaded subject since a deliberate decision was
> made to remove types from the initial Java implementation of Arrow
> that was forked from Apache Drill. However, it's a friction point that
> has come up in a number of scenarios as many database and storage
> systems have 32- and 64-bit variants for low precision decimal data.
> As an example Apache Kudu [1] has all three types, and the Parquet
> columnar format allows not only 32/64 bit storage but fixed size
> binary (size a function of precision) and variable-length binary
> encoding [2].
>
> One of the arguments against using these types in a computational
> setting is that many mathematical operations will necessarily trigger
> an up-promotion to a larger type. It's hard for us to predict how
> people will use the Arrow format, though, and the current situation is
> forcing an up-promotion regardless of how the format is being used,
> even for simple data transport
>
> In anticipation of long-term needs, I would suggest a possible solution of:
>
> * Adding bitWidth field to Decimal table in Schema.fbs [3] with
> default value of 128
> * Constraining bit widths to 32, 64, and 128 bits for the time being
> * Permit storage of smaller precision decimals in larger storage like
> we have now

BTW, even if we do not allow 32/64 bit decimals in the format, we
should consider adding a bitWidth field with static value 128 as a
matter of future-proofing the metadata. This change would make it so
that old readers are unable to see the bitWidth field, so the addition
would not be possible without bumping the protocol version.

>
> If this isn't deemed desirable by the community, decimal extension
> types could be employed for serialization-free transport for smaller
> decimals, but I view this as suboptimal.
>
> Interested in the thoughts of others.
>
> thanks
> Wes
>
> [1]: https://github.com/apache/kudu/blob/master/src/kudu/common/common.proto#L55
> [2]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L121

Re: [DISCUSS] 32- and 64-bit decimal types

Posted by Jacques Nadeau <ja...@apache.org>.
I'm probably one of the people who has vocally been against this :D

On the ARROW-2009 ticket that you referenced, it has been open for 18
months and has two watchers and no comments. I suggest we wait until there
is ground swell around this before changing anything.

We can always come up with an optimization that is being missed by a
particular design decision. I agree that optimizations are missed here. On
the flipside, how broad is the benefit for this optimization?. I'd guess
that many other optimizations would be more beneficial to focus on...


On Tue, Jul 2, 2019 at 5:27 PM Wes McKinney <we...@gmail.com> wrote:

> Note that if we do make this change as described, it will probably
> need to accompany a bump in the MetadataVersion (for
> forward-compatibility reasons, otherwise old clients won't be able to
> distinguish one decimal type from another). But that seems prudent
> regardless to force an upgrade to the stable 1.x.x series of releases.
>
> Are there any other opinions about this? I can bring a vote about it
> and we can decide when to actually commit a patch based on the rest of
> the 1.0.0 timeline.
>
> On Tue, Jun 11, 2019 at 11:29 AM Ravindra Pindikura <ra...@dremio.com>
> wrote:
> >
> > On Tue, Jun 11, 2019 at 2:48 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > On the 1.0.0 protocol discussion, one item that we've skirted for some
> > > time is other decimal sizes:
> > >
> > > https://issues.apache.org/jira/browse/ARROW-2009
> > >
> > > I understand this is a loaded subject since a deliberate decision was
> > > made to remove types from the initial Java implementation of Arrow
> > > that was forked from Apache Drill. However, it's a friction point that
> > > has come up in a number of scenarios as many database and storage
> > > systems have 32- and 64-bit variants for low precision decimal data.
> > > As an example Apache Kudu [1] has all three types, and the Parquet
> > > columnar format allows not only 32/64 bit storage but fixed size
> > > binary (size a function of precision) and variable-length binary
> > > encoding [2].
> > >
> > > One of the arguments against using these types in a computational
> > > setting is that many mathematical operations will necessarily trigger
> > > an up-promotion to a larger type. It's hard for us to predict how
> > > people will use the Arrow format, though, and the current situation is
> > > forcing an up-promotion regardless of how the format is being used,
> > > even for simple data transport
> > >
> > > In anticipation of long-term needs, I would suggest a possible
> solution of:
> > >
> > > * Adding bitWidth field to Decimal table in Schema.fbs [3] with
> > > default value of 128
> > >
> >
> > +1
> >
> >
> > > * Constraining bit widths to 32, 64, and 128 bits for the time being
> > > * Permit storage of smaller precision decimals in larger storage like
> > > we have now
> > >
> > > If this isn't deemed desirable by the community, decimal extension
> > > types could be employed for serialization-free transport for smaller
> > > decimals, but I view this as suboptimal.
> > >
> > > Interested in the thoughts of others.
> > >
> > > thanks
> > > Wes
> > >
> > > [1]:
> > >
> https://github.com/apache/kudu/blob/master/src/kudu/common/common.proto#L55
> > > [2]:
> > >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
> > > [3]:
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L121
> > >
> >
> >
> > --
> > Thanks and regards,
> > Ravindra.
>

Re: [DISCUSS] 32- and 64-bit decimal types

Posted by Wes McKinney <we...@gmail.com>.
That's certainly an option, too.

On Tue, Jul 2, 2019 at 9:40 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Wes,
> Just a question, I'm ok going either way on this but why not a new variable
> width decimal type and deprecating the old one instead of breaking forward
> compatibility?
>
> Thanks,
> Micah
>
> On Tuesday, July 2, 2019, Wes McKinney <we...@gmail.com> wrote:
>
> > Note that if we do make this change as described, it will probably
> > need to accompany a bump in the MetadataVersion (for
> > forward-compatibility reasons, otherwise old clients won't be able to
> > distinguish one decimal type from another). But that seems prudent
> > regardless to force an upgrade to the stable 1.x.x series of releases.
> >
> > Are there any other opinions about this? I can bring a vote about it
> > and we can decide when to actually commit a patch based on the rest of
> > the 1.0.0 timeline.
> >
> > On Tue, Jun 11, 2019 at 11:29 AM Ravindra Pindikura <ra...@dremio.com>
> > wrote:
> > >
> > > On Tue, Jun 11, 2019 at 2:48 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > On the 1.0.0 protocol discussion, one item that we've skirted for some
> > > > time is other decimal sizes:
> > > >
> > > > https://issues.apache.org/jira/browse/ARROW-2009
> > > >
> > > > I understand this is a loaded subject since a deliberate decision was
> > > > made to remove types from the initial Java implementation of Arrow
> > > > that was forked from Apache Drill. However, it's a friction point that
> > > > has come up in a number of scenarios as many database and storage
> > > > systems have 32- and 64-bit variants for low precision decimal data.
> > > > As an example Apache Kudu [1] has all three types, and the Parquet
> > > > columnar format allows not only 32/64 bit storage but fixed size
> > > > binary (size a function of precision) and variable-length binary
> > > > encoding [2].
> > > >
> > > > One of the arguments against using these types in a computational
> > > > setting is that many mathematical operations will necessarily trigger
> > > > an up-promotion to a larger type. It's hard for us to predict how
> > > > people will use the Arrow format, though, and the current situation is
> > > > forcing an up-promotion regardless of how the format is being used,
> > > > even for simple data transport
> > > >
> > > > In anticipation of long-term needs, I would suggest a possible
> > solution of:
> > > >
> > > > * Adding bitWidth field to Decimal table in Schema.fbs [3] with
> > > > default value of 128
> > > >
> > >
> > > +1
> > >
> > >
> > > > * Constraining bit widths to 32, 64, and 128 bits for the time being
> > > > * Permit storage of smaller precision decimals in larger storage like
> > > > we have now
> > > >
> > > > If this isn't deemed desirable by the community, decimal extension
> > > > types could be employed for serialization-free transport for smaller
> > > > decimals, but I view this as suboptimal.
> > > >
> > > > Interested in the thoughts of others.
> > > >
> > > > thanks
> > > > Wes
> > > >
> > > > [1]:
> > > > https://github.com/apache/kudu/blob/master/src/kudu/
> > common/common.proto#L55
> > > > [2]:
> > > > https://github.com/apache/parquet-format/blob/master/
> > LogicalTypes.md#decimal
> > > > [3]: https://github.com/apache/arrow/blob/master/format/
> > Schema.fbs#L121
> > > >
> > >
> > >
> > > --
> > > Thanks and regards,
> > > Ravindra.
> >

Re: [DISCUSS] 32- and 64-bit decimal types

Posted by Micah Kornfield <em...@gmail.com>.
Hi Wes,
Just a question, I'm ok going either way on this but why not a new variable
width decimal type and deprecating the old one instead of breaking forward
compatibility?

Thanks,
Micah

On Tuesday, July 2, 2019, Wes McKinney <we...@gmail.com> wrote:

> Note that if we do make this change as described, it will probably
> need to accompany a bump in the MetadataVersion (for
> forward-compatibility reasons, otherwise old clients won't be able to
> distinguish one decimal type from another). But that seems prudent
> regardless to force an upgrade to the stable 1.x.x series of releases.
>
> Are there any other opinions about this? I can bring a vote about it
> and we can decide when to actually commit a patch based on the rest of
> the 1.0.0 timeline.
>
> On Tue, Jun 11, 2019 at 11:29 AM Ravindra Pindikura <ra...@dremio.com>
> wrote:
> >
> > On Tue, Jun 11, 2019 at 2:48 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > On the 1.0.0 protocol discussion, one item that we've skirted for some
> > > time is other decimal sizes:
> > >
> > > https://issues.apache.org/jira/browse/ARROW-2009
> > >
> > > I understand this is a loaded subject since a deliberate decision was
> > > made to remove types from the initial Java implementation of Arrow
> > > that was forked from Apache Drill. However, it's a friction point that
> > > has come up in a number of scenarios as many database and storage
> > > systems have 32- and 64-bit variants for low precision decimal data.
> > > As an example Apache Kudu [1] has all three types, and the Parquet
> > > columnar format allows not only 32/64 bit storage but fixed size
> > > binary (size a function of precision) and variable-length binary
> > > encoding [2].
> > >
> > > One of the arguments against using these types in a computational
> > > setting is that many mathematical operations will necessarily trigger
> > > an up-promotion to a larger type. It's hard for us to predict how
> > > people will use the Arrow format, though, and the current situation is
> > > forcing an up-promotion regardless of how the format is being used,
> > > even for simple data transport
> > >
> > > In anticipation of long-term needs, I would suggest a possible
> solution of:
> > >
> > > * Adding bitWidth field to Decimal table in Schema.fbs [3] with
> > > default value of 128
> > >
> >
> > +1
> >
> >
> > > * Constraining bit widths to 32, 64, and 128 bits for the time being
> > > * Permit storage of smaller precision decimals in larger storage like
> > > we have now
> > >
> > > If this isn't deemed desirable by the community, decimal extension
> > > types could be employed for serialization-free transport for smaller
> > > decimals, but I view this as suboptimal.
> > >
> > > Interested in the thoughts of others.
> > >
> > > thanks
> > > Wes
> > >
> > > [1]:
> > > https://github.com/apache/kudu/blob/master/src/kudu/
> common/common.proto#L55
> > > [2]:
> > > https://github.com/apache/parquet-format/blob/master/
> LogicalTypes.md#decimal
> > > [3]: https://github.com/apache/arrow/blob/master/format/
> Schema.fbs#L121
> > >
> >
> >
> > --
> > Thanks and regards,
> > Ravindra.
>

Re: [DISCUSS] 32- and 64-bit decimal types

Posted by Wes McKinney <we...@gmail.com>.
Note that if we do make this change as described, it will probably
need to accompany a bump in the MetadataVersion (for
forward-compatibility reasons, otherwise old clients won't be able to
distinguish one decimal type from another). But that seems prudent
regardless to force an upgrade to the stable 1.x.x series of releases.

Are there any other opinions about this? I can bring a vote about it
and we can decide when to actually commit a patch based on the rest of
the 1.0.0 timeline.

On Tue, Jun 11, 2019 at 11:29 AM Ravindra Pindikura <ra...@dremio.com> wrote:
>
> On Tue, Jun 11, 2019 at 2:48 AM Wes McKinney <we...@gmail.com> wrote:
>
> > On the 1.0.0 protocol discussion, one item that we've skirted for some
> > time is other decimal sizes:
> >
> > https://issues.apache.org/jira/browse/ARROW-2009
> >
> > I understand this is a loaded subject since a deliberate decision was
> > made to remove types from the initial Java implementation of Arrow
> > that was forked from Apache Drill. However, it's a friction point that
> > has come up in a number of scenarios as many database and storage
> > systems have 32- and 64-bit variants for low precision decimal data.
> > As an example Apache Kudu [1] has all three types, and the Parquet
> > columnar format allows not only 32/64 bit storage but fixed size
> > binary (size a function of precision) and variable-length binary
> > encoding [2].
> >
> > One of the arguments against using these types in a computational
> > setting is that many mathematical operations will necessarily trigger
> > an up-promotion to a larger type. It's hard for us to predict how
> > people will use the Arrow format, though, and the current situation is
> > forcing an up-promotion regardless of how the format is being used,
> > even for simple data transport
> >
> > In anticipation of long-term needs, I would suggest a possible solution of:
> >
> > * Adding bitWidth field to Decimal table in Schema.fbs [3] with
> > default value of 128
> >
>
> +1
>
>
> > * Constraining bit widths to 32, 64, and 128 bits for the time being
> > * Permit storage of smaller precision decimals in larger storage like
> > we have now
> >
> > If this isn't deemed desirable by the community, decimal extension
> > types could be employed for serialization-free transport for smaller
> > decimals, but I view this as suboptimal.
> >
> > Interested in the thoughts of others.
> >
> > thanks
> > Wes
> >
> > [1]:
> > https://github.com/apache/kudu/blob/master/src/kudu/common/common.proto#L55
> > [2]:
> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
> > [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L121
> >
>
>
> --
> Thanks and regards,
> Ravindra.

Re: [DISCUSS] 32- and 64-bit decimal types

Posted by Ravindra Pindikura <ra...@dremio.com>.
On Tue, Jun 11, 2019 at 2:48 AM Wes McKinney <we...@gmail.com> wrote:

> On the 1.0.0 protocol discussion, one item that we've skirted for some
> time is other decimal sizes:
>
> https://issues.apache.org/jira/browse/ARROW-2009
>
> I understand this is a loaded subject since a deliberate decision was
> made to remove types from the initial Java implementation of Arrow
> that was forked from Apache Drill. However, it's a friction point that
> has come up in a number of scenarios as many database and storage
> systems have 32- and 64-bit variants for low precision decimal data.
> As an example Apache Kudu [1] has all three types, and the Parquet
> columnar format allows not only 32/64 bit storage but fixed size
> binary (size a function of precision) and variable-length binary
> encoding [2].
>
> One of the arguments against using these types in a computational
> setting is that many mathematical operations will necessarily trigger
> an up-promotion to a larger type. It's hard for us to predict how
> people will use the Arrow format, though, and the current situation is
> forcing an up-promotion regardless of how the format is being used,
> even for simple data transport
>
> In anticipation of long-term needs, I would suggest a possible solution of:
>
> * Adding bitWidth field to Decimal table in Schema.fbs [3] with
> default value of 128
>

+1


> * Constraining bit widths to 32, 64, and 128 bits for the time being
> * Permit storage of smaller precision decimals in larger storage like
> we have now
>
> If this isn't deemed desirable by the community, decimal extension
> types could be employed for serialization-free transport for smaller
> decimals, but I view this as suboptimal.
>
> Interested in the thoughts of others.
>
> thanks
> Wes
>
> [1]:
> https://github.com/apache/kudu/blob/master/src/kudu/common/common.proto#L55
> [2]:
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L121
>


-- 
Thanks and regards,
Ravindra.