You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/05/17 09:56:00 UTC
[jira] [Commented] (ARROW-12666) [Python] Array construction from numpy array is unclear about zero copy behaviour

    [ https://issues.apache.org/jira/browse/ARROW-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346032#comment-17346032 ] 

Joris Van den Bossche commented on ARROW-12666:
-----------------------------------------------

bq. {{copy=False}}  would probably have to throw an exception in some cases where we can't guarantee zero copy, like when building from a Python List

Or {{copy=False}} could also not guarantee that no copy is made, but will only try to not make a copy if possible. That's basically the behaviour of the {{copy}} keyword in {{numpy.array(..)}}

On the general issue, I agree that the current behaviour is not ideal and potentially being confusing/having surprising effects. But I also think it's not that easy to change. I think a lot of people rely on the zero-copy behaviour to avoid unnecessary copies (eg if you just convert to Arrow to then directly write that to Parquet file, then you don't want to make an additional copy).

> [Python] Array construction from numpy array is unclear about zero copy behaviour
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-12666
>                 URL: https://issues.apache.org/jira/browse/ARROW-12666
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 4.0.0
>            Reporter: Alessandro Molina
>            Assignee: Alessandro Molina
>            Priority: Major
>
> When building an Arrow array from a numpy array it's very confusing from the user point of view that the result is not always a new array.
> Under the hood Arrow sometimes reuses the memory if no casting is needed
> {code:python}
> npa = np.array([1, 2, 3]*3)
> arrow_array = pa.array(npa, type=pa.int64())
> npa[npa == 2] = 10
> print(arrow_array.to_pylist())
> # Prints: [1, 10, 3, 1, 10, 3, 1, 10, 3]
> {code}
> and sometimes doesn't if a cast is involved
> {code:python}
> npa = np.array([1, 2, 3]*3)
> arrow_array = pa.array(npa, type=pa.int32())
> npa[npa == 2] = 10
> print(arrow_array.to_pylist())
> # Prints: [1, 2, 3, 1, 2, 3, 1, 2, 3]
> {code}
> For non primite types instead it does always copy
> {code:python}
> npa = np.array(["a", "b", "c"]*3)
> arrow_array = pa.array(npa, type=pa.string())
> npa[npa == "b"] = "X"
> print(arrow_array.to_pylist())
> # Prints: ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c']
> # Different from numpy array that was modified
> {code}
> This behaviour needs a lot of attention from the user and understanding of what's going on, which makes pyarrow hard to use.
> A {{copy=True/False}} should be added to {{pa.array}} and the default value should probably be {{copy=True}} so that by default you can always create an arrow array out of a numpy one (as {{copy=False}}  would probably have to throw an exception in some cases where we can't guarantee zero copy, like when building from a Python List)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)