You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alessandro Molina (Jira)" <ji...@apache.org> on 2021/05/06 09:32:00 UTC

[jira] [Created] (ARROW-12666) [Python] Array construction from numpy array is unclear about zero copy behaviour

Alessandro Molina created ARROW-12666:
-----------------------------------------

             Summary: [Python] Array construction from numpy array is unclear about zero copy behaviour
                 Key: ARROW-12666
                 URL: https://issues.apache.org/jira/browse/ARROW-12666
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 4.0.0
            Reporter: Alessandro Molina


When building an Arrow array from a numpy array it's very confusing from the user point of view that the result is not always a new array.

Under the hood Arrow sometimes reuses the memory if no casting is needed
{code:python}
npa = np.array([1, 2, 3]*3)
arrow_array = pa.array(npa, type=pa.int64())
npa[npa == 2] = 10
print(arrow_array.to_pylist())
# Prints: [1, 10, 3, 1, 10, 3, 1, 10, 3]
{code}

and sometimes doesn't if a cast is involved
{code:python}
npa = np.array([1, 2, 3]*3)
arrow_array = pa.array(npa, type=pa.int32())
npa[npa == 2] = 10
print(arrow_array.to_pylist())
# Prints: [1, 2, 3, 1, 2, 3, 1, 2, 3]
{code}

For non primite types instead it does always copy
{code:python}
npa = np.array(["a", "b", "c"]*3)
arrow_array = pa.array(npa, type=pa.string())
npa[npa == "b"] = "X"
print(arrow_array.to_pylist())
# Prints: ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c']
# Different from numpy array that was modified
{code}

This behaviour needs a lot of attention from the user and understanding of what's going on, which makes pyarrow hard to use.

A {{copy=True/False}} should be added to {{pa.array}} and the default value should probably be {{copy=True}} so that by default you can always create an arrow array out of a numpy one  (as {{copy=False}}  would probably have to throw an exception in some cases were we can't guarantee zero copy)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)