You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/14 09:41:01 UTC

[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [Python] Mask is inverted when creating FixedSizeBinaryArray

amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r632411238



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2666,6 +2666,38 @@ def test_array_masked():
     assert arr.type == pa.int64()
 
 
+def test_binary_array_masked():
+    # ARROW-12431
+    masked_basic = pa.array([b'\x05'], type=pa.binary(1),
+                            mask=np.array([False]))
+    assert [b'\x05'] == masked_basic.to_pylist()
+
+    # Fixed Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                      mask=np.array([False]))
+    assert [b'\x05'] == masked.to_pylist()
+
+    masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                            mask=np.array([True]))
+    assert [None] == masked_nulls.to_pylist()
+
+    # Variable Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(),
+                      mask=np.array([False]))
+    assert [b'\x05'] == masked.to_pylist()
+
+    masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(),
+                            mask=np.array([True]))
+    assert [None] == masked_nulls.to_pylist()
+
+    # Fixed Length Binary, copy

Review comment:
       It verifies that the behaviour is the same that we get from variable length binary arrays, which do not reuse the numpy array memory. I don't think it's an implementation detail because it changes the user experience.
   
   The fact that the underlying numpy array is shared or not changes the user experience as it means the user can't modify the original numpy array without indirectly modifying (probably unexpectedly) the Arrow array too.
   
   which lead me to create https://issues.apache.org/jira/browse/ARROW-12666 because in some cases we reuse the numpy memory (all basic types) and in other cases we don't (the string, binary etc... types). The follow up ticket suggests to make that behaviour clear as numpy does by adding a `copy=True/False` argument to the `pyarrow.array`  function. 
   
   We can discuss further what's the best way to go in that dedicated ticket, here I wanted to make sure we were consistent with that happens when `pa.binary()` and `pa.binary(N)` are used.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org