You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Jacek Pliszka <ja...@gmail.com> on 2020/02/12 11:36:16 UTC

Decimal casting or scaling

Hi!

I am interested in having cast from Decimal to Int in pyarrow.

I have couple ideas but I am a newbie so I might be wrong:

Do I understand correctly that the problem lies in the fact that
CastFunctor knows nothing about decimal scale?

Were there any ideas how to handle this properly?

My ideas are not that great but maybe one of them would be OK:

1. We can pass 'numeric_scale_shift' or 'decimal_scale_shift' in CastOptions.
Then while casting, then numbers would be scaled properly.

2. Pass byte group selector in CastOptions i.e. when casting from
N*M bytes to N bytes we can pick any of the M groups

3. Do not modify cast but add scale/multiply compute kernel so we can
apply it explicitly prior
to casting

What do you think?  I like 1 most 2 least.

Would any of this solutions be accepted into the code?

I do not want to work on something that would be rejected immediately...

Thanks for any input provided,

Jacek

Re: [ARROW-3329] Re: Decimal casting or scaling

Posted by Wes McKinney <we...@gmail.com>.

On Wed, Feb 12, 2020 at 2:37 PM Jacek Pliszka <ja...@gmail.com> wrote:
>
> Actually these options still make some sense - but not as much as before.
>
> The use case: unit conversion
>
> Data about prices exported from sql in Decimal(38,10) which uses 128
> bit but the numbers are actually prices which expressed in cents fit
> perfectly in uint32
>
> Having scaling would reduce bandwidth/disk usage by factor of 4.

You'd need to implement a separate function for this since you're
changing the semantics of the cast. I don't think it makes sense to
convert from 123.45 (decimal) to 12345 (uint32) in Cast

> What would be the best approach to such use case?
>
> Would decimal_scale CastOption be OK or should it rather be compute
> 'multiply' kernel ?
>
> BR,
>
> Jacek
>
>
> śr., 12 lut 2020 o 19:32 Jacek Pliszka <ja...@gmail.com> napisał(a):
> >
> > OK, then what I proposed does not make sense and I can just copy the
> > solution you pointed out.
> >
> > Thank you,
> >
> > Jacek
> >
> > śr., 12 lut 2020 o 19:27 Wes McKinney <we...@gmail.com> napisał(a):
> > >
> > > On Wed, Feb 12, 2020 at 12:09 PM Jacek Pliszka <ja...@gmail.com> wrote:
> > > >
> > > > Hi!
> > > >
> > > > ARROW-3329 - we can discuss there.
> > > >
> > > > > It seems like it makes sense to implement both lossless safe casts
> > > > > (when all zeros after the decimal point) and lossy casts (fractional
> > > > > part discarded) from decimal to integer, do I have that right?
> > > >
> > > > Yes, though if I understood your examples are the same case - in both
> > > > cases fractional part is discarded - just it is all 0s in the first
> > > > case.
> > > >
> > > > The key question is whether CastFunctor in cast.cc has access to scale
> > > > of the decimal? If yes how?
> > >
> > > Yes, it's in the type of the input array. Here's a kernel
> > > implementation that uses the TimestampType metadata of the input
> > >
> > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L521
> > >
> > > >
> > > > If not - these are the options I've came up with:
> > > >
> > > > Let's assume Decimal128Type value is  n
> > > >
> > > > Then I expect that base call
> > > > .cast('int64') will return  overflow for n beyond int64 values, value otherwise
> > > >
> > > > Option 1:
> > > >
> > > > .cast('int64', decimal_scale=s)  would calculate  n/10**s and return
> > > > overflow if it is beyond int64, value otherwise
> > > >
> > > > Option 2:
> > > >
> > > > .cast('int64', bytes_group=0) would return n & 0x00000000FFFFFFFF
> > > > .cast('int64', bytes_group=1) would return (n >> 64) & 0x00000000FFFFFFFF
> > > > .cast('int64') would have default value bytes_group=0
> > > >
> > > > Option 3:
> > > >
> > > > cast has no CastOptions but we add  multiply compute kernel and have
> > > > something like this instead:
> > > >
> > > > .compute('multiply', 10**-s).cast('int64')
> > > >
> > > > BR,
> > > >
> > > > Jacek

Re: [ARROW-3329] Re: Decimal casting or scaling

Posted by Jacek Pliszka <ja...@gmail.com>.

Actually these options still make some sense - but not as much as before.

The use case: unit conversion

Data about prices exported from sql in Decimal(38,10) which uses 128
bit but the numbers are actually prices which expressed in cents fit
perfectly in uint32

Having scaling would reduce bandwidth/disk usage by factor of 4.

What would be the best approach to such use case?

Would decimal_scale CastOption be OK or should it rather be compute
'multiply' kernel ?

BR,

Jacek


śr., 12 lut 2020 o 19:32 Jacek Pliszka <ja...@gmail.com> napisał(a):
>
> OK, then what I proposed does not make sense and I can just copy the
> solution you pointed out.
>
> Thank you,
>
> Jacek
>
> śr., 12 lut 2020 o 19:27 Wes McKinney <we...@gmail.com> napisał(a):
> >
> > On Wed, Feb 12, 2020 at 12:09 PM Jacek Pliszka <ja...@gmail.com> wrote:
> > >
> > > Hi!
> > >
> > > ARROW-3329 - we can discuss there.
> > >
> > > > It seems like it makes sense to implement both lossless safe casts
> > > > (when all zeros after the decimal point) and lossy casts (fractional
> > > > part discarded) from decimal to integer, do I have that right?
> > >
> > > Yes, though if I understood your examples are the same case - in both
> > > cases fractional part is discarded - just it is all 0s in the first
> > > case.
> > >
> > > The key question is whether CastFunctor in cast.cc has access to scale
> > > of the decimal? If yes how?
> >
> > Yes, it's in the type of the input array. Here's a kernel
> > implementation that uses the TimestampType metadata of the input
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L521
> >
> > >
> > > If not - these are the options I've came up with:
> > >
> > > Let's assume Decimal128Type value is  n
> > >
> > > Then I expect that base call
> > > .cast('int64') will return  overflow for n beyond int64 values, value otherwise
> > >
> > > Option 1:
> > >
> > > .cast('int64', decimal_scale=s)  would calculate  n/10**s and return
> > > overflow if it is beyond int64, value otherwise
> > >
> > > Option 2:
> > >
> > > .cast('int64', bytes_group=0) would return n & 0x00000000FFFFFFFF
> > > .cast('int64', bytes_group=1) would return (n >> 64) & 0x00000000FFFFFFFF
> > > .cast('int64') would have default value bytes_group=0
> > >
> > > Option 3:
> > >
> > > cast has no CastOptions but we add  multiply compute kernel and have
> > > something like this instead:
> > >
> > > .compute('multiply', 10**-s).cast('int64')
> > >
> > > BR,
> > >
> > > Jacek

Re: [ARROW-3329] Re: Decimal casting or scaling

Posted by Jacek Pliszka <ja...@gmail.com>.

OK, then what I proposed does not make sense and I can just copy the
solution you pointed out.

Thank you,

Jacek

śr., 12 lut 2020 o 19:27 Wes McKinney <we...@gmail.com> napisał(a):
>
> On Wed, Feb 12, 2020 at 12:09 PM Jacek Pliszka <ja...@gmail.com> wrote:
> >
> > Hi!
> >
> > ARROW-3329 - we can discuss there.
> >
> > > It seems like it makes sense to implement both lossless safe casts
> > > (when all zeros after the decimal point) and lossy casts (fractional
> > > part discarded) from decimal to integer, do I have that right?
> >
> > Yes, though if I understood your examples are the same case - in both
> > cases fractional part is discarded - just it is all 0s in the first
> > case.
> >
> > The key question is whether CastFunctor in cast.cc has access to scale
> > of the decimal? If yes how?
>
> Yes, it's in the type of the input array. Here's a kernel
> implementation that uses the TimestampType metadata of the input
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L521
>
> >
> > If not - these are the options I've came up with:
> >
> > Let's assume Decimal128Type value is  n
> >
> > Then I expect that base call
> > .cast('int64') will return  overflow for n beyond int64 values, value otherwise
> >
> > Option 1:
> >
> > .cast('int64', decimal_scale=s)  would calculate  n/10**s and return
> > overflow if it is beyond int64, value otherwise
> >
> > Option 2:
> >
> > .cast('int64', bytes_group=0) would return n & 0x00000000FFFFFFFF
> > .cast('int64', bytes_group=1) would return (n >> 64) & 0x00000000FFFFFFFF
> > .cast('int64') would have default value bytes_group=0
> >
> > Option 3:
> >
> > cast has no CastOptions but we add  multiply compute kernel and have
> > something like this instead:
> >
> > .compute('multiply', 10**-s).cast('int64')
> >
> > BR,
> >
> > Jacek

Re: [ARROW-3329] Re: Decimal casting or scaling

Posted by Wes McKinney <we...@gmail.com>.

On Wed, Feb 12, 2020 at 12:09 PM Jacek Pliszka <ja...@gmail.com> wrote:
>
> Hi!
>
> ARROW-3329 - we can discuss there.
>
> > It seems like it makes sense to implement both lossless safe casts
> > (when all zeros after the decimal point) and lossy casts (fractional
> > part discarded) from decimal to integer, do I have that right?
>
> Yes, though if I understood your examples are the same case - in both
> cases fractional part is discarded - just it is all 0s in the first
> case.
>
> The key question is whether CastFunctor in cast.cc has access to scale
> of the decimal? If yes how?

Yes, it's in the type of the input array. Here's a kernel
implementation that uses the TimestampType metadata of the input

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc#L521

>
> If not - these are the options I've came up with:
>
> Let's assume Decimal128Type value is  n
>
> Then I expect that base call
> .cast('int64') will return  overflow for n beyond int64 values, value otherwise
>
> Option 1:
>
> .cast('int64', decimal_scale=s)  would calculate  n/10**s and return
> overflow if it is beyond int64, value otherwise
>
> Option 2:
>
> .cast('int64', bytes_group=0) would return n & 0x00000000FFFFFFFF
> .cast('int64', bytes_group=1) would return (n >> 64) & 0x00000000FFFFFFFF
> .cast('int64') would have default value bytes_group=0
>
> Option 3:
>
> cast has no CastOptions but we add  multiply compute kernel and have
> something like this instead:
>
> .compute('multiply', 10**-s).cast('int64')
>
> BR,
>
> Jacek

[ARROW-3329] Re: Decimal casting or scaling

Posted by Jacek Pliszka <ja...@gmail.com>.

Hi!

ARROW-3329 - we can discuss there.

> It seems like it makes sense to implement both lossless safe casts
> (when all zeros after the decimal point) and lossy casts (fractional
> part discarded) from decimal to integer, do I have that right?

Yes, though if I understood your examples are the same case - in both
cases fractional part is discarded - just it is all 0s in the first
case.

The key question is whether CastFunctor in cast.cc has access to scale
of the decimal? If yes how?

If not - these are the options I've came up with:

Let's assume Decimal128Type value is  n

Then I expect that base call
.cast('int64') will return  overflow for n beyond int64 values, value otherwise

Option 1:

.cast('int64', decimal_scale=s)  would calculate  n/10**s and return
overflow if it is beyond int64, value otherwise

Option 2:

.cast('int64', bytes_group=0) would return n & 0x00000000FFFFFFFF
.cast('int64', bytes_group=1) would return (n >> 64) & 0x00000000FFFFFFFF
.cast('int64') would have default value bytes_group=0

Option 3:

cast has no CastOptions but we add  multiply compute kernel and have
something like this instead:

.compute('multiply', 10**-s).cast('int64')

BR,

Jacek

Re: Decimal casting or scaling

Posted by Wes McKinney <we...@gmail.com>.

hi Jacek,

What is the JIRA issue for this change? In the interest of organizing
the discussion (may make sense to move some of this to that issue)

There are no casts implemented DecimalType at all in [1], either to
decimal or from decimal to anything else.

It seems like it makes sense to implement both lossless safe casts
(when all zeros after the decimal point) and lossy casts (fractional
part discarded) from decimal to integer, do I have that right? I don't
understand your other questions very well so perhaps you can provide
some illustration about what values are to be expected when calling
".cast('int64')" on a Decimal128Type array.

Thanks
Wes

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.cc

On Wed, Feb 12, 2020 at 5:36 AM Jacek Pliszka <ja...@gmail.com> wrote:
>
> Hi!
>
> I am interested in having cast from Decimal to Int in pyarrow.
>
> I have couple ideas but I am a newbie so I might be wrong:
>
> Do I understand correctly that the problem lies in the fact that
> CastFunctor knows nothing about decimal scale?
>
> Were there any ideas how to handle this properly?
>
> My ideas are not that great but maybe one of them would be OK:
>
> 1. We can pass 'numeric_scale_shift' or 'decimal_scale_shift' in CastOptions.
> Then while casting, then numbers would be scaled properly.
>
> 2. Pass byte group selector in CastOptions i.e. when casting from
> N*M bytes to N bytes we can pick any of the M groups
>
> 3. Do not modify cast but add scale/multiply compute kernel so we can
> apply it explicitly prior
> to casting
>
> What do you think?  I like 1 most 2 least.
>
> Would any of this solutions be accepted into the code?
>
> I do not want to work on something that would be rejected immediately...
>
> Thanks for any input provided,
>
> Jacek