You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Jonathan Yu <jo...@gmail.com> on 2020/10/12 21:20:57 UTC

Creating and populating Arrow table directly?

Hello there,

I'm recording an a-priori known number of entries per column, and I want to
create a Table using these entries. I'm currently using numpy.empty to
pre-allocate empty arrays, then creating a Table from that via the
pyarrow.table(data={}) constructor.

It seems a bit silly to create a bunch of NumPy arrays, only to convert
them to Arrow arrays to serialize. Is there any benefit to
creating/populating pyarrow.array() objects directly, and if so, how do I
do that? Otherwise, is the recommendation to first create a DataFrame in
pandas (or a number of NumPy arrays as I'm doing currently), then convert
to a Table?

I think I want to have a way to create a fixed-size Table consisting of a
number of columns, then set the values for each column one by one (similar
to iloc/iat in pandas). Is this a sensible thing to try to do?

Best,

Jonathan

Re: Creating and populating Arrow table directly?

Posted by "Uwe L. Korn" <ma...@uwekorn.com>.
Hello,

You actually can use NumPy arrays to construct an Arrow array without the need to copy any data. The important aspect here is to treat these NumPy arrays simply as plain memory allocations. You use it to construct the separate memory memory buffers (i.e. the valid-bits and data buffers) and then pass it top pyarrow.Array.from_buffers. One example of this can be seen here: https://github.com/xhochy/fletcher/blob/467054f0954c5c5796c6a5ced10891ce0e02e064/fletcher/algorithms/bool.py#L397-L407 <https://github.com/xhochy/fletcher/blob/467054f0954c5c5796c6a5ced10891ce0e02e064/fletcher/algorithms/bool.py#L397-L407> (context is a library where I construct a lot of pyarrow.Array instances from Python using Numba). Your underlying memory can be mutable as long as you keep it in the form of the Arrow memory format and treat it as immutable once you have passed your data structure to a third-party library/function.

Best
Uwe

> Am 13.10.2020 um 22:00 schrieb Jonathan Yu <jo...@gmail.com>:
> 
> Hey Jacob!
> 
> Thanks so much for your response and explanation.
> 
> On Mon, Oct 12, 2020 at 9:13 PM Jacob Quinn <quinn.jacobd@gmail.com <ma...@gmail.com>> wrote:
> I'm not familiar with the internals of the pyarrow implementation, but as the primary author of the Arrow.jl Julia implementation, I think I can provide a little insight that's probably applicable.
> 
> Awesome! One of these days, I hope to get around to learning Julia :) 
> 
> The conceptual problem here is that the arrow format is immutable; arrow data is laid out in a fixed memory pattern. For the simplest column types (integer, float, etc.), this isn't inherently a problem because it's straightforward to allocate the fixed amount of memory, and then you could set the memory slots for array elements. For non-"primitive" types, however, it quickly becomes nontrivial. String columns, for example, are an array-of-array memory layout, where all the string bytes are laid out end to end, then individual column elements contain offsets into that fixed memory blob. That can be pretty tricky to 1) pre-allocate and 2) allow "setting" values afterward. You would need a way to allocate the exact # of bytes from all the strings, then chase the right indirections when setting values and generating the correct offsets.
> 
> This makes sense to me, but wasn't clear to me reading the documentation for PyArrow, so I'll try to contribute something when I have some free time.
> 
> So with all that said, my guess is that creating the NumPy arrays with your data and getting the data set first, then converting to the arrow format is indeed an acceptable workflow. Hopefully that helps.
> 
> This also makes sense to me given the current available APIs. However, since some of the documentation/blogs I've read about Arrow make reference to some of the inefficiencies of converting to and from pandas or NumPy formats, namely that some of the representations - like the bitmask for null values - are different, I wonder whether it might make sense to provide "builder" classes that can be used temporarily to construct arrays value by value, with a "freeze" method that would then return the immutable arrays? The contract would be that users would not be allowed to modify the data in the builder class after invoking that build/freeze function. This is at least a common pattern in Java to make it easier to build immutable data structures.
> 
> I suppose that may be a question for the dev@ list, based on the guidance in "Contributing to Arrow" documentation.
>  
> 
> -Jacob
> 
> On Mon, Oct 12, 2020 at 3:21 PM Jonathan Yu <jonathan.i.yu@gmail.com <ma...@gmail.com>> wrote:
> Hello there,
> 
> I'm recording an a-priori known number of entries per column, and I want to create a Table using these entries. I'm currently using numpy.empty to pre-allocate empty arrays, then creating a Table from that via the pyarrow.table(data={}) constructor.
> 
> It seems a bit silly to create a bunch of NumPy arrays, only to convert them to Arrow arrays to serialize. Is there any benefit to creating/populating pyarrow.array() objects directly, and if so, how do I do that? Otherwise, is the recommendation to first create a DataFrame in pandas (or a number of NumPy arrays as I'm doing currently), then convert to a Table?
> 
> I think I want to have a way to create a fixed-size Table consisting of a number of columns, then set the values for each column one by one (similar to iloc/iat in pandas). Is this a sensible thing to try to do?
> 
> Best,
> 
> Jonathan
> 


Re: Creating and populating Arrow table directly?

Posted by Jonathan Yu <jo...@gmail.com>.
Hey Jacob!

Thanks so much for your response and explanation.

On Mon, Oct 12, 2020 at 9:13 PM Jacob Quinn <qu...@gmail.com> wrote:

> I'm not familiar with the internals of the pyarrow implementation, but as
> the primary author of the Arrow.jl Julia implementation, I think I can
> provide a little insight that's probably applicable.
>

Awesome! One of these days, I hope to get around to learning Julia :)

>
> The conceptual problem here is that the arrow format is immutable; arrow
> data is laid out in a fixed memory pattern. For the simplest column types
> (integer, float, etc.), this isn't inherently a problem because it's
> straightforward to allocate the fixed amount of memory, and then you could
> set the memory slots for array elements. For non-"primitive" types,
> however, it quickly becomes nontrivial. String columns, for example, are an
> array-of-array memory layout, where all the string bytes are laid out end
> to end, then individual column elements contain offsets into that fixed
> memory blob. That can be pretty tricky to 1) pre-allocate and 2) allow
> "setting" values afterward. You would need a way to allocate the exact # of
> bytes from all the strings, then chase the right indirections when setting
> values and generating the correct offsets.
>

This makes sense to me, but wasn't clear to me reading the documentation
for PyArrow, so I'll try to contribute something when I have some free time.

So with all that said, my guess is that creating the NumPy arrays with your
> data and getting the data set first, then converting to the arrow format is
> indeed an acceptable workflow. Hopefully that helps.
>

This also makes sense to me given the current available APIs. However,
since some of the documentation/blogs I've read about Arrow make reference
to some of the inefficiencies of converting to and from pandas or NumPy
formats, namely that some of the representations - like the bitmask for
null values - are different, I wonder whether it might make sense to
provide "builder" classes that can be used temporarily to construct arrays
value by value, with a "freeze" method that would then return the immutable
arrays? The contract would be that users would not be allowed to modify the
data in the builder class after invoking that build/freeze function. This
is at least a common pattern in Java to make it easier to build immutable
data structures.

I suppose that may be a question for the dev@ list, based on the guidance
in "Contributing to Arrow" documentation.


>
> -Jacob
>
> On Mon, Oct 12, 2020 at 3:21 PM Jonathan Yu <jo...@gmail.com>
> wrote:
>
>> Hello there,
>>
>> I'm recording an a-priori known number of entries per column, and I want
>> to create a Table using these entries. I'm currently using numpy.empty to
>> pre-allocate empty arrays, then creating a Table from that via the
>> pyarrow.table(data={}) constructor.
>>
>> It seems a bit silly to create a bunch of NumPy arrays, only to convert
>> them to Arrow arrays to serialize. Is there any benefit to
>> creating/populating pyarrow.array() objects directly, and if so, how do I
>> do that? Otherwise, is the recommendation to first create a DataFrame in
>> pandas (or a number of NumPy arrays as I'm doing currently), then convert
>> to a Table?
>>
>> I think I want to have a way to create a fixed-size Table consisting of a
>> number of columns, then set the values for each column one by one (similar
>> to iloc/iat in pandas). Is this a sensible thing to try to do?
>>
>> Best,
>>
>> Jonathan
>>
>

Re: Creating and populating Arrow table directly?

Posted by Jacob Quinn <qu...@gmail.com>.
I'm not familiar with the internals of the pyarrow implementation, but as
the primary author of the Arrow.jl Julia implementation, I think I can
provide a little insight that's probably applicable.

The conceptual problem here is that the arrow format is immutable; arrow
data is laid out in a fixed memory pattern. For the simplest column types
(integer, float, etc.), this isn't inherently a problem because it's
straightforward to allocate the fixed amount of memory, and then you could
set the memory slots for array elements. For non-"primitive" types,
however, it quickly becomes nontrivial. String columns, for example, are an
array-of-array memory layout, where all the string bytes are laid out end
to end, then individual column elements contain offsets into that fixed
memory blob. That can be pretty tricky to 1) pre-allocate and 2) allow
"setting" values afterward. You would need a way to allocate the exact # of
bytes from all the strings, then chase the right indirections when setting
values and generating the correct offsets.

So with all that said, my guess is that creating the NumPy arrays with your
data and getting the data set first, then converting to the arrow format is
indeed an acceptable workflow. Hopefully that helps.

-Jacob

On Mon, Oct 12, 2020 at 3:21 PM Jonathan Yu <jo...@gmail.com> wrote:

> Hello there,
>
> I'm recording an a-priori known number of entries per column, and I want
> to create a Table using these entries. I'm currently using numpy.empty to
> pre-allocate empty arrays, then creating a Table from that via the
> pyarrow.table(data={}) constructor.
>
> It seems a bit silly to create a bunch of NumPy arrays, only to convert
> them to Arrow arrays to serialize. Is there any benefit to
> creating/populating pyarrow.array() objects directly, and if so, how do I
> do that? Otherwise, is the recommendation to first create a DataFrame in
> pandas (or a number of NumPy arrays as I'm doing currently), then convert
> to a Table?
>
> I think I want to have a way to create a fixed-size Table consisting of a
> number of columns, then set the values for each column one by one (similar
> to iloc/iat in pandas). Is this a sensible thing to try to do?
>
> Best,
>
> Jonathan
>