You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2021/10/08 04:50:34 UTC

Re: [C++] Decimal arithmetic edge cases

Do we error out regardless of scale, or does rounding occur with scale > 0?

Arguably these last three cases should be:
> "1" (decimal128(38, 0)) + "1" (decimal128(38, 0)) = "2" (decimal256(39,
> 0)) (promote to decimal256)
> "1" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = "2" (decimal256(76,
> 0)) (saturate at max precision)
> "99...9 (76 digits)" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = error
> (overflow)


This seems more or less what I expect.

On Thu, Sep 30, 2021 at 7:03 AM Keith Kraus <ke...@gmail.com> wrote:

> For another point of reference, here's microsoft's docs for SQL server on
> resulting precision and scale for different operators including its
> overflow rules:
>
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-ver15
>
> -Keith
>
> On Thu, Sep 30, 2021 at 9:42 AM David Li <li...@apache.org> wrote:
>
> > Hello all,
> >
> > While looking at decimal arithmetic kernels in ARROW-13130, the question
> > of what to do about overflow came up.
> >
> > Currently, our rules are based on Redshift [1], except we raise an error
> > if we exceed the maximum precision (Redshift's docs implies it saturates
> > instead). Hence, we can always add/subtract/etc. without checking for
> > overflow, but we can't do things like add two decimal256(76, 0) since
> > there's no more precision available.
> >
> > If we were to support this last case, what would people expect the
> > unchecked arithmetic kernels to do on overflow? For integers, we wrap
> > around, but this doesn't really make sense for decimals; we could also
> > return nulls, or just always raise an error (this seems the most
> reasonable
> > to me). Any thoughts?
> >
> > For reference, for an unchecked add, currently we have:
> > "1" (decimal256(75, 0)) + "1" (decimal256(75, 0)) = "2" (decimal256(76,
> 0))
> > "1" (decimal128(38, 0)) + "1" (decimal128(38, 0)) = error (not enough
> > precision)
> > "1" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = error (not enough
> > precision)
> > "99...9 (76 digits)" (decimal256(76, 0)) + "1" (decimal256(76, 0)) =
> error
> > (not enough precision)
> >
> > Arguably these last three cases should be:
> > "1" (decimal128(38, 0)) + "1" (decimal128(38, 0)) = "2" (decimal256(39,
> > 0)) (promote to decimal256)
> > "1" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = "2" (decimal256(76,
> > 0)) (saturate at max precision)
> > "99...9 (76 digits)" (decimal256(76, 0)) + "1" (decimal256(76, 0)) =
> error
> > (overflow)
> >
> > On a related note, you could also argue that we shouldn't increase the
> > precision like this, though DBs other than Redshift also do this. Playing
> > with DuckDB a bit, though, it doesn't match Redshift:
> addition/subtraction
> > increase precision by 1 like Redshift does, but division results in a
> > float, and multiplication only adds the input precisions together, while
> > Redshift adds 1 to the sum of the precisions. (That is, decimal128(3, 0)
> *
> > decimal128(3, 0) is decimal128(7, 0) in Redshift/Arrow but decimal128(6,
> 0)
> > in DuckDB.)
> >
> > [1]:
> >
> https://docs.aws.amazon.com/redshift/latest/dg/r_numeric_computations201.html
> >
> > -David
>

Re: [C++] Decimal arithmetic edge cases

Posted by Micah Kornfield <em...@gmail.com>.

Wait and see seems reasonable.  As data-points, BigQuery's corresponding
data types, set precision to max precision for the given type.  I seem to
recall from prior DBs default to p=38, scale=9 also being a common default
for columns when one didn't really know what to expect (this might have
just been my experience).

On Mon, Oct 11, 2021 at 1:21 PM David Li <li...@apache.org> wrote:

> Thanks all for chiming in. Perhaps we can take a wait-and-see approach for
> now; I agree we'll eventually have to support multiple behaviors for
> different use cases. I'll synthesize the discussion so far into JIRA.
>
> -David
>
> On Mon, Oct 11, 2021, at 16:12, Weston Pace wrote:
> > I can see valid arguments for both cases.
> >
> > Traditionally, "front end" users (e.g. Iceberg, Flink, SQL Server,
> > Postgres) parameterize decimal only by precision and scale and the
> > underlying storage is an implementation detail (this is not true for
> > integers, e.g. tinyint and smallint).  From this perspective it would
> > be very strange that
> >
> > MULT(DECIMAL<5, 5>, DECIMAL<5, 5>) => DECIMAL<11, 10>
> > MULT(DECIMAL<40, 10>, DECIMAL<10, 10>) => DECIMAL<51, 20>
> > MULT(DECIMAL<20, 10>, DECIMAL<20, 10>) => ERROR
> >
> > On the other hand, Arrow DOES parameterize decimal by precision,
> > scale, and storage.  So then Antoine's point holds, which is that we
> > don't upcast integers automatically.
> >
> > At the end of the day, I don't think we have to get it right.  Arrow
> > is a back end library and front end libraries can always perform the
> > cast if automatic upcast is the behavior they desire.  If they want
> > the plan rejected at plan-time they can do that too.  Those are the
> > only two easy options available to Arrow anyways.
> >
> > Long term, I would think we'd probably end up being asked to support
> > the whole gamut of "saturate", "runtime error", and "go to null" for
> > each of the data types, controlled by an option, but I'd defer that
> > work until someone actually asked for it.
> >
> > If I had to choose an option I'd vote for automatic upcasting since I
> > think users will be thinking of decimals in terms of precision and
> > scale only.
> >
> > On Sun, Oct 10, 2021 at 10:39 PM Antoine Pitrou <an...@python.org>
> wrote:
> > >
> > > On Fri, 08 Oct 2021 08:47:33 -0400
> > > "David Li" <li...@apache.org> wrote:
> > > > At max precision, we currently error out regardless of scale.
> > > >
> > > > It looks like SQL Server not only saturates at max precision, but
> may also reduce scale. I think we can provide that as an option if there's
> demand for it. But we can at least promote decimal128 to decimal256 and
> perform overflow checks.
> > >
> > > I am not sure if promotion to decimal256 is something we want to do
> > > automatically (we don't promote int32 to int64 automatically, for
> > > example).
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> >
>

Re: [C++] Decimal arithmetic edge cases

Posted by David Li <li...@apache.org>.

Thanks all for chiming in. Perhaps we can take a wait-and-see approach for now; I agree we'll eventually have to support multiple behaviors for different use cases. I'll synthesize the discussion so far into JIRA.

-David

On Mon, Oct 11, 2021, at 16:12, Weston Pace wrote:
> I can see valid arguments for both cases.
> 
> Traditionally, "front end" users (e.g. Iceberg, Flink, SQL Server,
> Postgres) parameterize decimal only by precision and scale and the
> underlying storage is an implementation detail (this is not true for
> integers, e.g. tinyint and smallint).  From this perspective it would
> be very strange that
> 
> MULT(DECIMAL<5, 5>, DECIMAL<5, 5>) => DECIMAL<11, 10>
> MULT(DECIMAL<40, 10>, DECIMAL<10, 10>) => DECIMAL<51, 20>
> MULT(DECIMAL<20, 10>, DECIMAL<20, 10>) => ERROR
> 
> On the other hand, Arrow DOES parameterize decimal by precision,
> scale, and storage.  So then Antoine's point holds, which is that we
> don't upcast integers automatically.
> 
> At the end of the day, I don't think we have to get it right.  Arrow
> is a back end library and front end libraries can always perform the
> cast if automatic upcast is the behavior they desire.  If they want
> the plan rejected at plan-time they can do that too.  Those are the
> only two easy options available to Arrow anyways.
> 
> Long term, I would think we'd probably end up being asked to support
> the whole gamut of "saturate", "runtime error", and "go to null" for
> each of the data types, controlled by an option, but I'd defer that
> work until someone actually asked for it.
> 
> If I had to choose an option I'd vote for automatic upcasting since I
> think users will be thinking of decimals in terms of precision and
> scale only.
> 
> On Sun, Oct 10, 2021 at 10:39 PM Antoine Pitrou <an...@python.org> wrote:
> >
> > On Fri, 08 Oct 2021 08:47:33 -0400
> > "David Li" <li...@apache.org> wrote:
> > > At max precision, we currently error out regardless of scale.
> > >
> > > It looks like SQL Server not only saturates at max precision, but may also reduce scale. I think we can provide that as an option if there's demand for it. But we can at least promote decimal128 to decimal256 and perform overflow checks.
> >
> > I am not sure if promotion to decimal256 is something we want to do
> > automatically (we don't promote int32 to int64 automatically, for
> > example).
> >
> > Regards
> >
> > Antoine.
> >
> >
>

Re: [C++] Decimal arithmetic edge cases

Posted by Weston Pace <we...@gmail.com>.

I can see valid arguments for both cases.

Traditionally, "front end" users (e.g. Iceberg, Flink, SQL Server,
Postgres) parameterize decimal only by precision and scale and the
underlying storage is an implementation detail (this is not true for
integers, e.g. tinyint and smallint).  From this perspective it would
be very strange that

MULT(DECIMAL<5, 5>, DECIMAL<5, 5>) => DECIMAL<11, 10>
MULT(DECIMAL<40, 10>, DECIMAL<10, 10>) => DECIMAL<51, 20>
MULT(DECIMAL<20, 10>, DECIMAL<20, 10>) => ERROR

On the other hand, Arrow DOES parameterize decimal by precision,
scale, and storage.  So then Antoine's point holds, which is that we
don't upcast integers automatically.

At the end of the day, I don't think we have to get it right.  Arrow
is a back end library and front end libraries can always perform the
cast if automatic upcast is the behavior they desire.  If they want
the plan rejected at plan-time they can do that too.  Those are the
only two easy options available to Arrow anyways.

Long term, I would think we'd probably end up being asked to support
the whole gamut of "saturate", "runtime error", and "go to null" for
each of the data types, controlled by an option, but I'd defer that
work until someone actually asked for it.

If I had to choose an option I'd vote for automatic upcasting since I
think users will be thinking of decimals in terms of precision and
scale only.

On Sun, Oct 10, 2021 at 10:39 PM Antoine Pitrou <an...@python.org> wrote:
>
> On Fri, 08 Oct 2021 08:47:33 -0400
> "David Li" <li...@apache.org> wrote:
> > At max precision, we currently error out regardless of scale.
> >
> > It looks like SQL Server not only saturates at max precision, but may also reduce scale. I think we can provide that as an option if there's demand for it. But we can at least promote decimal128 to decimal256 and perform overflow checks.
>
> I am not sure if promotion to decimal256 is something we want to do
> automatically (we don't promote int32 to int64 automatically, for
> example).
>
> Regards
>
> Antoine.
>
>

Re: [C++] Decimal arithmetic edge cases

Posted by Antoine Pitrou <an...@python.org>.

On Fri, 08 Oct 2021 08:47:33 -0400
"David Li" <li...@apache.org> wrote:
> At max precision, we currently error out regardless of scale.
> 
> It looks like SQL Server not only saturates at max precision, but may also reduce scale. I think we can provide that as an option if there's demand for it. But we can at least promote decimal128 to decimal256 and perform overflow checks.

I am not sure if promotion to decimal256 is something we want to do
automatically (we don't promote int32 to int64 automatically, for
example).

Regards

Antoine.

Re: [C++] Decimal arithmetic edge cases

Posted by David Li <li...@apache.org>.

At max precision, we currently error out regardless of scale.

It looks like SQL Server not only saturates at max precision, but may also reduce scale. I think we can provide that as an option if there's demand for it. But we can at least promote decimal128 to decimal256 and perform overflow checks.

Thanks,
David

On Fri, Oct 8, 2021, at 00:50, Micah Kornfield wrote:
> Do we error out regardless of scale, or does rounding occur with scale > 0?
> 
> Arguably these last three cases should be:
> > "1" (decimal128(38, 0)) + "1" (decimal128(38, 0)) = "2" (decimal256(39,
> > 0)) (promote to decimal256)
> > "1" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = "2" (decimal256(76,
> > 0)) (saturate at max precision)
> > "99...9 (76 digits)" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = error
> > (overflow)
> 
> 
> This seems more or less what I expect.
> 
> On Thu, Sep 30, 2021 at 7:03 AM Keith Kraus <ke...@gmail.com> wrote:
> 
> > For another point of reference, here's microsoft's docs for SQL server on
> > resulting precision and scale for different operators including its
> > overflow rules:
> >
> > https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-ver15
> >
> > -Keith
> >
> > On Thu, Sep 30, 2021 at 9:42 AM David Li <li...@apache.org> wrote:
> >
> > > Hello all,
> > >
> > > While looking at decimal arithmetic kernels in ARROW-13130, the question
> > > of what to do about overflow came up.
> > >
> > > Currently, our rules are based on Redshift [1], except we raise an error
> > > if we exceed the maximum precision (Redshift's docs implies it saturates
> > > instead). Hence, we can always add/subtract/etc. without checking for
> > > overflow, but we can't do things like add two decimal256(76, 0) since
> > > there's no more precision available.
> > >
> > > If we were to support this last case, what would people expect the
> > > unchecked arithmetic kernels to do on overflow? For integers, we wrap
> > > around, but this doesn't really make sense for decimals; we could also
> > > return nulls, or just always raise an error (this seems the most
> > reasonable
> > > to me). Any thoughts?
> > >
> > > For reference, for an unchecked add, currently we have:
> > > "1" (decimal256(75, 0)) + "1" (decimal256(75, 0)) = "2" (decimal256(76,
> > 0))
> > > "1" (decimal128(38, 0)) + "1" (decimal128(38, 0)) = error (not enough
> > > precision)
> > > "1" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = error (not enough
> > > precision)
> > > "99...9 (76 digits)" (decimal256(76, 0)) + "1" (decimal256(76, 0)) =
> > error
> > > (not enough precision)
> > >
> > > Arguably these last three cases should be:
> > > "1" (decimal128(38, 0)) + "1" (decimal128(38, 0)) = "2" (decimal256(39,
> > > 0)) (promote to decimal256)
> > > "1" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = "2" (decimal256(76,
> > > 0)) (saturate at max precision)
> > > "99...9 (76 digits)" (decimal256(76, 0)) + "1" (decimal256(76, 0)) =
> > error
> > > (overflow)
> > >
> > > On a related note, you could also argue that we shouldn't increase the
> > > precision like this, though DBs other than Redshift also do this. Playing
> > > with DuckDB a bit, though, it doesn't match Redshift:
> > addition/subtraction
> > > increase precision by 1 like Redshift does, but division results in a
> > > float, and multiplication only adds the input precisions together, while
> > > Redshift adds 1 to the sum of the precisions. (That is, decimal128(3, 0)
> > *
> > > decimal128(3, 0) is decimal128(7, 0) in Redshift/Arrow but decimal128(6,
> > 0)
> > > in DuckDB.)
> > >
> > > [1]:
> > >
> > https://docs.aws.amazon.com/redshift/latest/dg/r_numeric_computations201.html
> > >
> > > -David
> >
>