You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Ian Joiner <ia...@gmail.com> on 2023/05/23 17:13:09 UTC

New datatype: Huge integers & decimals

Hi,

We need to have really large integers (with 128, 256 and 512 bits) as well
as decimals (up to at least decimal1024) because they do actually exist in
crypto / web3 space.

See https://docs.rs/primitive-types/latest/primitive_types/ for an example
of what needs to be supported.

If accepted we can implement the types for C++/Python and Rust.

Thanks,
Ian

Re: New datatype: Huge integers & decimals

Posted by Spencer Nelson <sw...@uw.edu>.

A further advantage of third-party extension types is that they give you a
way to experiment without as much concern for compatibility.

I think writing an extension type if possible, and promoting it to an
official type (extension or otherwise) only if necessary, is a good general
approach.

On Tue, May 23, 2023 at 2:48 PM Will Jones <wi...@gmail.com> wrote:

> Hello Arrow devs,
>
> I actually have a use case where we'd like to support a new number type in
> Arrow, but instead of larger numbers, smaller ones. :) For machine learning
> use cases, we at Lance would like to support bfloat16 [1]. These are 16-bit
> floating point numbers that trade significant digits to exponent, so they
> have the same range as float 32 but less precision than float 16. They are
> natively supported on newer AI-focused silicon [1]
>
> I'm just starting to look at this, so not yet sure what the pros and cons
> are of implementing it as an extension type versus a native Arrow type. My
> initial ideas:
>
> Pros of an extension type:
> * It can be moved through Arrow-native systems that don't implement it, as
> long as they preserve extension type information.
>
> Pros of a native type:
> * We have established patterns for writing compute kernels for natively
> supported types.
>
> If we were to implement these as extension types, I think bfloat16 and the
> number types Ian Joiner mentions would be best implemented as extension
> types based on fixed-size binary. We have a native float16 type already,
> but I think making bfloat16 an extension type based on that it could get
> accidentally manipulated as a float16, which IIUC would be invalid.
>
> If anyone has any advice from our work thus far on extension types, I'd
> welcome your input.
>
> Best,
>
> Will Jones
>
> [1]
>
> https://urldefense.com/v3/__https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus__;!!K-Hz7m0Vt54!lyl3ZVe7uNEaUQrW2uJ8yJyzVJzONy9SZu0zkJLWN0WfDdu9V2ZpEN6ElavNaRrJUn8SjSMJ80Wp_UPoUq44vQ$
> [2]
> https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Bfloat16_floating-point_format__;!!K-Hz7m0Vt54!lyl3ZVe7uNEaUQrW2uJ8yJyzVJzONy9SZu0zkJLWN0WfDdu9V2ZpEN6ElavNaRrJUn8SjSMJ80Wp_UMhJnNRZQ$
>
> On Tue, May 23, 2023 at 10:49 AM Antoine Pitrou <an...@python.org>
> wrote:
>
> >
> > Your question seems unspecific, but we now have the possibility of
> > standardizing canonical extension types (which are, of course, optional
> > to implement and support):
> >
> >
> https://urldefense.com/v3/__https://arrow.apache.org/docs/format/CanonicalExtensions.html__;!!K-Hz7m0Vt54!lyl3ZVe7uNEaUQrW2uJ8yJyzVJzONy9SZu0zkJLWN0WfDdu9V2ZpEN6ElavNaRrJUn8SjSMJ80Wp_UPRLGl1Gg$
> >
> >
> > Le 23/05/2023 à 19:45, Ian Joiner a écrit :
> > > That’s a possibility. Do we consider officially support them?
> > >
> > >
> > > On Tuesday, May 23, 2023, Antoine Pitrou <an...@python.org> wrote:
> > >
> > >>
> > >> I'm not sure what you're actually proposing here. A new extension type
> > >> perhaps?
> > >>
> > >>
> > >> Le 23/05/2023 à 19:13, Ian Joiner a écrit :
> > >>
> > >>> Hi,
> > >>>
> > >>> We need to have really large integers (with 128, 256 and 512 bits) as
> > well
> > >>> as decimals (up to at least decimal1024) because they do actually
> > exist in
> > >>> crypto / web3 space.
> > >>>
> > >>> See
> https://urldefense.com/v3/__https://docs.rs/primitive-types/latest/primitive_types/__;!!K-Hz7m0Vt54!lyl3ZVe7uNEaUQrW2uJ8yJyzVJzONy9SZu0zkJLWN0WfDdu9V2ZpEN6ElavNaRrJUn8SjSMJ80Wp_UN9rRd91w$
> for an
> > >>> example
> > >>> of what needs to be supported.
> > >>>
> > >>> If accepted we can implement the types for C++/Python and Rust.
> > >>>
> > >>> Thanks,
> > >>> Ian
> > >>>
> > >>>
> > >
> >
>

Re: New datatype: Huge integers & decimals

Posted by Antoine Pitrou <an...@python.org>.

Hi Will,

I'll also note that, while float16 is a first-class datatype, I'm not 
sure any Arrow implementation is able to do anything else than just 
transport it currently.

You're right that we'd probably want extension number types to be based 
on fixed-size-binary. A complication is endianness, though. Currently, 
we have logic (for example in Arrow C++) to optionally byte-swap number 
data at the edge (when receiving non-native-endian data). How would it 
work with extension types based on fixed-size-binary? There is a risk 
that implementations recognizing the bfloat16 extension type would 
byte-swap, but others would not, leading to corrupt data streams.

The bfloat16 extension type would then have to be parametrized with its 
endianness, or mandate a fixed endianness (probably little endian).

For bigints, I think the situation is simpler. Little-endian is, I 
think, a much more convenient representation for bigints (at the cost of 
some potential runtime byte-shuffling on big-endian systems).

Regards

Antoine.


Le 23/05/2023 à 23:47, Will Jones a écrit :
> 
> I'm just starting to look at this, so not yet sure what the pros and cons
> are of implementing it as an extension type versus a native Arrow type. My
> initial ideas:
> 
> Pros of an extension type:
> * It can be moved through Arrow-native systems that don't implement it, as
> long as they preserve extension type information.
> 
> Pros of a native type:
> * We have established patterns for writing compute kernels for natively
> supported types.
> 
> If we were to implement these as extension types, I think bfloat16 and the
> number types Ian Joiner mentions would be best implemented as extension
> types based on fixed-size binary. We have a native float16 type already,
> but I think making bfloat16 an extension type based on that it could get
> accidentally manipulated as a float16, which IIUC would be invalid.
> 
> If anyone has any advice from our work thus far on extension types, I'd
> welcome your input.
> 
> Best,
> 
> Will Jones
> 
> [1]
> https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
> [2] https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
> 
> On Tue, May 23, 2023 at 10:49 AM Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> Your question seems unspecific, but we now have the possibility of
>> standardizing canonical extension types (which are, of course, optional
>> to implement and support):
>>
>> https://arrow.apache.org/docs/format/CanonicalExtensions.html
>>
>>
>> Le 23/05/2023 à 19:45, Ian Joiner a écrit :
>>> That’s a possibility. Do we consider officially support them?
>>>
>>>
>>> On Tuesday, May 23, 2023, Antoine Pitrou <an...@python.org> wrote:
>>>
>>>>
>>>> I'm not sure what you're actually proposing here. A new extension type
>>>> perhaps?
>>>>
>>>>
>>>> Le 23/05/2023 à 19:13, Ian Joiner a écrit :
>>>>
>>>>> Hi,
>>>>>
>>>>> We need to have really large integers (with 128, 256 and 512 bits) as
>> well
>>>>> as decimals (up to at least decimal1024) because they do actually
>> exist in
>>>>> crypto / web3 space.
>>>>>
>>>>> See https://docs.rs/primitive-types/latest/primitive_types/ for an
>>>>> example
>>>>> of what needs to be supported.
>>>>>
>>>>> If accepted we can implement the types for C++/Python and Rust.
>>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>
>>>
>>
>

Re: New datatype: Huge integers & decimals

Posted by Will Jones <wi...@gmail.com>.

Hello Arrow devs,

I actually have a use case where we'd like to support a new number type in
Arrow, but instead of larger numbers, smaller ones. :) For machine learning
use cases, we at Lance would like to support bfloat16 [1]. These are 16-bit
floating point numbers that trade significant digits to exponent, so they
have the same range as float 32 but less precision than float 16. They are
natively supported on newer AI-focused silicon [1]

I'm just starting to look at this, so not yet sure what the pros and cons
are of implementing it as an extension type versus a native Arrow type. My
initial ideas:

Pros of an extension type:
* It can be moved through Arrow-native systems that don't implement it, as
long as they preserve extension type information.

Pros of a native type:
* We have established patterns for writing compute kernels for natively
supported types.

If we were to implement these as extension types, I think bfloat16 and the
number types Ian Joiner mentions would be best implemented as extension
types based on fixed-size binary. We have a native float16 type already,
but I think making bfloat16 an extension type based on that it could get
accidentally manipulated as a float16, which IIUC would be invalid.

If anyone has any advice from our work thus far on extension types, I'd
welcome your input.

Best,

Will Jones

[1]
https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
[2] https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

On Tue, May 23, 2023 at 10:49 AM Antoine Pitrou <an...@python.org> wrote:

>
> Your question seems unspecific, but we now have the possibility of
> standardizing canonical extension types (which are, of course, optional
> to implement and support):
>
> https://arrow.apache.org/docs/format/CanonicalExtensions.html
>
>
> Le 23/05/2023 à 19:45, Ian Joiner a écrit :
> > That’s a possibility. Do we consider officially support them?
> >
> >
> > On Tuesday, May 23, 2023, Antoine Pitrou <an...@python.org> wrote:
> >
> >>
> >> I'm not sure what you're actually proposing here. A new extension type
> >> perhaps?
> >>
> >>
> >> Le 23/05/2023 à 19:13, Ian Joiner a écrit :
> >>
> >>> Hi,
> >>>
> >>> We need to have really large integers (with 128, 256 and 512 bits) as
> well
> >>> as decimals (up to at least decimal1024) because they do actually
> exist in
> >>> crypto / web3 space.
> >>>
> >>> See https://docs.rs/primitive-types/latest/primitive_types/ for an
> >>> example
> >>> of what needs to be supported.
> >>>
> >>> If accepted we can implement the types for C++/Python and Rust.
> >>>
> >>> Thanks,
> >>> Ian
> >>>
> >>>
> >
>

Re: New datatype: Huge integers & decimals

Posted by Antoine Pitrou <an...@python.org>.

Your question seems unspecific, but we now have the possibility of 
standardizing canonical extension types (which are, of course, optional 
to implement and support):

https://arrow.apache.org/docs/format/CanonicalExtensions.html


Le 23/05/2023 à 19:45, Ian Joiner a écrit :
> That’s a possibility. Do we consider officially support them?
> 
> 
> On Tuesday, May 23, 2023, Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> I'm not sure what you're actually proposing here. A new extension type
>> perhaps?
>>
>>
>> Le 23/05/2023 à 19:13, Ian Joiner a écrit :
>>
>>> Hi,
>>>
>>> We need to have really large integers (with 128, 256 and 512 bits) as well
>>> as decimals (up to at least decimal1024) because they do actually exist in
>>> crypto / web3 space.
>>>
>>> See https://docs.rs/primitive-types/latest/primitive_types/ for an
>>> example
>>> of what needs to be supported.
>>>
>>> If accepted we can implement the types for C++/Python and Rust.
>>>
>>> Thanks,
>>> Ian
>>>
>>>
>

Re: New datatype: Huge integers & decimals

Posted by Ian Joiner <ia...@gmail.com>.

That’s a possibility. Do we consider officially support them?


On Tuesday, May 23, 2023, Antoine Pitrou <an...@python.org> wrote:

>
> I'm not sure what you're actually proposing here. A new extension type
> perhaps?
>
>
> Le 23/05/2023 à 19:13, Ian Joiner a écrit :
>
>> Hi,
>>
>> We need to have really large integers (with 128, 256 and 512 bits) as well
>> as decimals (up to at least decimal1024) because they do actually exist in
>> crypto / web3 space.
>>
>> See https://docs.rs/primitive-types/latest/primitive_types/ for an
>> example
>> of what needs to be supported.
>>
>> If accepted we can implement the types for C++/Python and Rust.
>>
>> Thanks,
>> Ian
>>
>>

Re: New datatype: Huge integers & decimals

Posted by Antoine Pitrou <an...@python.org>.

I'm not sure what you're actually proposing here. A new extension type 
perhaps?


Le 23/05/2023 à 19:13, Ian Joiner a écrit :
> Hi,
> 
> We need to have really large integers (with 128, 256 and 512 bits) as well
> as decimals (up to at least decimal1024) because they do actually exist in
> crypto / web3 space.
> 
> See https://docs.rs/primitive-types/latest/primitive_types/ for an example
> of what needs to be supported.
> 
> If accepted we can implement the types for C++/Python and Rust.
> 
> Thanks,
> Ian
>

Re: New datatype: Huge integers & decimals

Posted by Felipe Oliveira Carvalho <fe...@gmail.com>.

Have you considered using fixed-length binary values for these?

Crypto algorithms might logically be defined in terms of mathematical
operations on integers, but their efficient implementation tends to feature
inlined operations at the machine word level instead of generic add, div,
mod, mul operations at the big-integer level. Having these as logical
integers in Arrow would be a lot of work and they wouldn’t be adequate
building blocks for efficient crypto algorithms.

For instance, compare the pseuso-code for Poly1305

      clamp(r): r &= 0x0ffffffc0ffffffc0ffffffc0fffffff
      poly1305_mac(msg, key):
         r = (le_bytes_to_num(key[0..15])
         clamp(r)
         s = le_num(key[16..31])
         accumulator = 0
         p = (1<<130)-5
         for i=1 upto ceil(msg length in bytes / 16)
            n = le_bytes_to_num(msg[((i-1)*16)..(i*16)] | [0x01])
            a += n
            a = (r * a) % p
            end
         a += s
         return num_to_16_le_bytes(a)
         end

with an implementation using only 128-bit and 64-bit math operations

https://github.com/hacl-star/hacl-star/blob/main/dist/gcc-compatible/Hacl_Poly1305_128.c

What crypto algorithms are you trying to implement using Arrow data?

—
Felipe

On Tue, 23 May 2023 at 14:13 Ian Joiner <ia...@gmail.com> wrote:

> Hi,
>
> We need to have really large integers (with 128, 256 and 512 bits) as well
> as decimals (up to at least decimal1024) because they do actually exist in
> crypto / web3 space.
>
> See https://docs.rs/primitive-types/latest/primitive_types/ for an example
> of what needs to be supported.
>
> If accepted we can implement the types for C++/Python and Rust.
>
> Thanks,
> Ian
>