You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Alessandro Molina <al...@ursacomputing.com> on 2021/11/03 12:49:42 UTC

Question about Arrow Mutable/Immutable Arrays choice

I recently noticed that in the Java implementation we expose a set/setSafe
function that allows to mutate Arrow Arrays [1]

This seems to be at odds with the general design of the C++ (and by
consequence Python and R) library where Arrays are immutable and can be
modified only through compute functions returning copies.

The Arrow Format documentation [2] seems to suggest that mutation of data
structures is possible and left as an implementation detail, but given that
some users might be willing to mutate existing structures (for example to
avoid incurring in the memory cost of copies when dealing with big arrays)
I think there might be reasons for both allowing mutation of Arrays and
disallowing it. It probably makes sense to ensure that all the
implementations agree on such a fundamental choice to avoid setting
expectations on users' side that might not apply when they cross language
barriers.

[1]
https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/SmallIntVector.html#setSafe-int-int-
[2] https://arrow.apache.org/docs/format/Columnar.html

Re: Question about Arrow Mutable/Immutable Arrays choice

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

I think the c data interface requires the arrays to be immutable or two
implementations will race when mutating/reading the shared regions, since
we have no mechanism to synchronize read/write access across the boundary.

Best,
Jorge


On Wed, Nov 3, 2021 at 1:50 PM Alessandro Molina <
alessandro@ursacomputing.com> wrote:

> I recently noticed that in the Java implementation we expose a set/setSafe
> function that allows to mutate Arrow Arrays [1]
>
> This seems to be at odds with the general design of the C++ (and by
> consequence Python and R) library where Arrays are immutable and can be
> modified only through compute functions returning copies.
>
> The Arrow Format documentation [2] seems to suggest that mutation of data
> structures is possible and left as an implementation detail, but given that
> some users might be willing to mutate existing structures (for example to
> avoid incurring in the memory cost of copies when dealing with big arrays)
> I think there might be reasons for both allowing mutation of Arrays and
> disallowing it. It probably makes sense to ensure that all the
> implementations agree on such a fundamental choice to avoid setting
> expectations on users' side that might not apply when they cross language
> barriers.
>
> [1]
>
> https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/SmallIntVector.html#setSafe-int-int-
> [2] https://arrow.apache.org/docs/format/Columnar.html
>

Re: Question about Arrow Mutable/Immutable Arrays choice

Posted by Wes McKinney <we...@gmail.com>.

I don't think there is a problem with having "internal" data
structures that provide mutation and other capabilities, but when
internal data structures are made external (exported to consumers
through "public" C++ APIs / namespaces) then immutability is good
there (or at least forcing a consumer to dig into the arrow::Buffer
objects if they want to mutate memory).

On Thu, Nov 4, 2021 at 5:11 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 04/11/2021 à 10:56, Alessandro Molina a écrit :
> > On Wed, Nov 3, 2021 at 11:34 PM Jacques Nadeau <ja...@apache.org> wrote:
> >
> >
> >> In a perfect world we would have done a better job in the object
> >> hierarchy/behavior of making this explicit but we don't live in that world,
> >> unfortunately.
> >
> >
> > Makes sense, but I thought that was exactly the reason why set/setSafe are
> > only available for FixedWidth vectors.
> > On those once the size is set it seems fairly safe to mutate them if the
> > set methods take care of updating null values too.
>
> Not if you have multiple threads reading the same data, which is going
> to be common in many Arrow applications.  To enable efficient access
> without thread synchronization, immutable data is almost mandatory.
>
> So I don't think we should modify the C++ Array APIs to allow for
> mutations (as you say, people can say still do dirty things at a lower
> level if they want to, and they have to live with the consequences).
>
> Regards
>
> Antoine.

Re: Question about Arrow Mutable/Immutable Arrays choice

Posted by Antoine Pitrou <an...@python.org>.

Le 04/11/2021 à 10:56, Alessandro Molina a écrit :
> On Wed, Nov 3, 2021 at 11:34 PM Jacques Nadeau <ja...@apache.org> wrote:
> 
> 
>> In a perfect world we would have done a better job in the object
>> hierarchy/behavior of making this explicit but we don't live in that world,
>> unfortunately.
> 
> 
> Makes sense, but I thought that was exactly the reason why set/setSafe are
> only available for FixedWidth vectors.
> On those once the size is set it seems fairly safe to mutate them if the
> set methods take care of updating null values too.

Not if you have multiple threads reading the same data, which is going 
to be common in many Arrow applications.  To enable efficient access 
without thread synchronization, immutable data is almost mandatory.

So I don't think we should modify the C++ Array APIs to allow for 
mutations (as you say, people can say still do dirty things at a lower 
level if they want to, and they have to live with the consequences).

Regards

Antoine.

Re: Question about Arrow Mutable/Immutable Arrays choice

Posted by Alessandro Molina <al...@ursacomputing.com>.

On Wed, Nov 3, 2021 at 11:34 PM Jacques Nadeau <ja...@apache.org> wrote:

> In a perfect world we would have done a better job in the object
> hierarchy/behavior of making this explicit but we don't live in that world,
> unfortunately.

Makes sense, but I thought that was exactly the reason why set/setSafe are
only available for FixedWidth vectors.
On those once the size is set it seems fairly safe to mutate them if the
set methods take care of updating null values too.

So more in general I think that my question was if we should grow mutate
functions in C++ and other bindings too for fixed size arrays or if we
should remove mutate features from Java API and have people deal with
buffers if they want to mutate things (so that's more explicit that you are
messing with internals) so that we have a consistent experience across
bindings.

Re: Question about Arrow Mutable/Immutable Arrays choice

Posted by Jacques Nadeau <ja...@apache.org>.

Hey Alessandro, take a look at the top level docs on ValueVector:

https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/ValueVector.html

Specifically the following:

   - values need to be written in order (e.g. index 0, 1, 2, 5)
   - null vectors start with all values as null before writing anything
   - for variable width types, the offset vector should be all zeros before
   writing
   - you must call setValueCount before a vector can be read
   - you should never write to a vector once it has been read.


In a perfect world we would have done a better job in the object
hierarchy/behavior of making this explicit but we don't live in that world,
unfortunately. I'll also say that these rules are actually more stringent
than what is technically safe. For example, in one project we would use a
BigIntVector to maintain and update sums when doing hash aggregations
(which includes a read-modify-write on individual cells out of order). That
being said, that's advanced usage and most people should stick with the
guidelines above.

On Wed, Nov 3, 2021 at 5:50 AM Alessandro Molina <
alessandro@ursacomputing.com> wrote:

> I recently noticed that in the Java implementation we expose a set/setSafe
> function that allows to mutate Arrow Arrays [1]
>
> This seems to be at odds with the general design of the C++ (and by
> consequence Python and R) library where Arrays are immutable and can be
> modified only through compute functions returning copies.
>
> The Arrow Format documentation [2] seems to suggest that mutation of data
> structures is possible and left as an implementation detail, but given that
> some users might be willing to mutate existing structures (for example to
> avoid incurring in the memory cost of copies when dealing with big arrays)
> I think there might be reasons for both allowing mutation of Arrays and
> disallowing it. It probably makes sense to ensure that all the
> implementations agree on such a fundamental choice to avoid setting
> expectations on users' side that might not apply when they cross language
> barriers.
>
> [1]
>
> https://arrow.apache.org/docs/java/reference/org/apache/arrow/vector/SmallIntVector.html#setSafe-int-int-
> [2] https://arrow.apache.org/docs/format/Columnar.html
>