You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/29 16:26:01 UTC

[GitHub] [arrow] amol- opened a new pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

amol- opened a new pull request #10199:
URL: https://github.com/apache/arrow/pull/10199


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on pull request #10199: ARROW-12431: [Python] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#issuecomment-845943151


   @pitrou @jorisvandenbossche did you have a chance to have a final pass?
   
   Given that the solution is comparable to what we are already doing for variable length arrays, it doesn't seem to introduce new issues and is isolated enough, I think it could make sense to ship a fix to contain the bug while we work on eventual performance improvements and the other two related issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r625115182



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2666,6 +2666,31 @@ def test_array_masked():
     assert arr.type == pa.int64()
 
 
+def test_binary_array_masked():
+    # ARROW-12431
+    masked_basic = pa.array([b'\x05'], type=pa.binary(1),
+                            mask=np.array([False]))
+    assert [b'\x05'] == masked_basic.to_pylist()
+
+    # Fixed Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                      mask=np.array([False]))
+    assert [b'\x05'] == masked.to_pylist()
+
+    masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                            mask=np.array([True]))
+    assert [None] == masked_nulls.to_pylist()
+
+    # Variable Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(-1),

Review comment:
       Well, `pa.binary()` is the official way to get a variable length binary. `binary(-1)` is an implementation detail.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r625177654



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       The version for variable length binaries ( `NumPyConverter::Visit(const BinaryType& type)` ) does that same work simply iterating over the data and invoking `AppendNull` / `Append` depending on the value of the mask.  Would that approach be ok? That would avoid the need to allocate the inverted mask




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10199: ARROW-12431: [Python] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r630955362



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2666,6 +2666,38 @@ def test_array_masked():
     assert arr.type == pa.int64()
 
 
+def test_binary_array_masked():
+    # ARROW-12431
+    masked_basic = pa.array([b'\x05'], type=pa.binary(1),
+                            mask=np.array([False]))
+    assert [b'\x05'] == masked_basic.to_pylist()
+
+    # Fixed Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                      mask=np.array([False]))
+    assert [b'\x05'] == masked.to_pylist()
+
+    masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                            mask=np.array([True]))
+    assert [None] == masked_nulls.to_pylist()
+
+    # Variable Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(),
+                      mask=np.array([False]))
+    assert [b'\x05'] == masked.to_pylist()
+
+    masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(),
+                            mask=np.array([True]))
+    assert [None] == masked_nulls.to_pylist()
+
+    # Fixed Length Binary, copy

Review comment:
       I don't understand what this is supposed to test. The fact that a copy is made is just an implementation detail.

##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,15 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    RETURN_NOT_OK(builder.Reserve(length_));
+    for (int64_t i = 0; i < length_; ++i) {
+      if (mask_values[i]) {
+        RETURN_NOT_OK(builder.AppendNull());
+      } else {
+        RETURN_NOT_OK(builder.Append(data));
+      }
+      data += stride_;

Review comment:
       Does this solve the strided conversion case? If so, perhaps you can add a test for it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r623836273



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2666,6 +2666,31 @@ def test_array_masked():
     assert arr.type == pa.int64()
 
 
+def test_binary_array_masked():
+    # ARROW-12431
+    masked_basic = pa.array([b'\x05'], type=pa.binary(1),
+                            mask=np.array([False]))
+    assert pa.array([b'\x05']).to_pylist() == masked_basic.to_pylist()

Review comment:
       👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r625640867



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       I'm wondering, given that this seems a fairly reasonable need, if the ArrayBuilder itself shouldn't have a specialised version of `AppendValues` that accepts a mask. Maybe `AppendValuesWithMask`. 
   
   It seems confusing by the way that the builder interface would have multiple different "definitions" of how to edit the bitmap, but in the end we already have such dichotomy in place as the ``Array`` **mask** is ``True for NULL`` while the **bitmap** is ``0 for NULL`` 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r625179220



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       Yes, that would work. Note you can also call `builder.Reserve` to preallocate the required data size.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#issuecomment-832776765


   > Not necessarily related to this PR (just noticed while looking at the code and testing a few things), but the conversion for fixed size binary is currently (on master) incorrect for the strided case:
   > 
   
   Good that you pointed that out, because I just added a test for that case and it fails (differently) also with my changes in place. Throws a
   ```
   NumPy type not implemented: unrecognized type (18) in GetNumPyTypeName
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [Python] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r651543978



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,15 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    RETURN_NOT_OK(builder.Reserve(length_));
+    for (int64_t i = 0; i < length_; ++i) {
+      if (mask_values[i]) {
+        RETURN_NOT_OK(builder.AppendNull());
+      } else {
+        RETURN_NOT_OK(builder.Append(data));
+      }
+      data += stride_;

Review comment:
       Also added tests and fix for strided binary arrays (with and without mask)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r626464715



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       Ok, the implementation provided by https://github.com/apache/arrow/pull/10199/commits/b3e0a37049dffdcac63b5ce446907be60523e651 should do the trick.
   
   I also verified that the `PrepareInputData` does a zero copy and `ArrayData::Make` does move the buffer.
   Which is good because it should be far faster than previous solution, but raises questions about the memory ownership. If the resulting Arrow Array is built reusing the `NumPyBuffer`, how is guaranteed that the data for the Arrow Array is around even if the numpy array is gone?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10199: ARROW-12431: [Python] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10199:
URL: https://github.com/apache/arrow/pull/10199


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r625650105



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       The `mask` keyword is useful if you have numpy-style masked arrays (having a boolean numpy array as the mask). We use this keyword in pandas.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r626365942



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       I didn't benchmark the code, but notice that we were iterating over a mask of the same length and this is the same thing we are also doing for strings and varlen binary
   
   https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/numpy_to_arrow.cc#L561-L567
   
   I do see btw that ``ArrayData::Make`` is able to do a zero copy from numpy data under certain conditions, so it will surely be a lot faster if we end into that condition.  In the other case, it seems that we end into ``CopyStrided*`` which does iterate over the data like now and thus the performance benefit would probably be negligible if any.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#issuecomment-829462317


   https://issues.apache.org/jira/browse/ARROW-12431


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r626335393



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       With this change, we are now looping over the values, and appending iteratively. Would that have performance implications?
   
   Since this is for a fixed size/strided array, an alternative could be to create the ArrayData directly (as mentioned by Antoine), which would be similar to VisitNative I suppose?
   
   https://github.com/apache/arrow/blob/3a9aea3dd0e2a2b1c68bd9cfc141d83adb9fbc35/cpp/src/arrow/python/numpy_to_arrow.cc#L276-L280
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r626365942



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       I didn't benchmark the code, but notice that we were iterating over a mask of the same length and this is the same thing we are also doing for strings and varlen binary
   
   https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/numpy_to_arrow.cc#L561-L567
   
   I do see btw that ``ArrayData::Make`` is able to do a zero copy from numpy data under certain conditions, so it will surely be a lot faster if we end into that condition.  In the other case, it seems that we end into ``CopyStrided*`` which does iterate over the data like now and thus the performance benefit would probably be negligible if any.
   
   https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/numpy_to_arrow.cc#L366-L384




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r625052940



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       This is doing a spurious additional allocation and memory copy. You could reuse `MaskToBitmap` instead.

##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2666,6 +2666,31 @@ def test_array_masked():
     assert arr.type == pa.int64()
 
 
+def test_binary_array_masked():
+    # ARROW-12431
+    masked_basic = pa.array([b'\x05'], type=pa.binary(1),
+                            mask=np.array([False]))
+    assert [b'\x05'] == masked_basic.to_pylist()
+
+    # Fixed Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                      mask=np.array([False]))
+    assert [b'\x05'] == masked.to_pylist()
+
+    masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                            mask=np.array([True]))
+    assert [None] == masked_nulls.to_pylist()
+
+    # Variable Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(-1),

Review comment:
       This should be `pa.binary()`, not `pa.binary(-1)`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r625647786



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       Well, the only place where this is needed is the PyArrow API, and I don't even know why we have this `mask` argument (with the confusing convention that `true` means "null", which is exactly the reverse of Arrow validity bitmaps).
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r625640867



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       I'm wondering, given that this seems a fairly reasonable need, if the ArrayBuilder itself shouldn't have a specialised version of `AppendValues` that accepts a mask. Maybe `AppendValuesWithMask`. 
   
   It seems confusing, on the other side, that the builder interface would have multiple different "definitions" of how to edit the bitmap, but in the end we already have such dichotomy in place as the ``Array`` **mask** is ``True for NULL`` while the **bitmap** is ``0 for NULL`` 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r623819764



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2666,6 +2666,31 @@ def test_array_masked():
     assert arr.type == pa.int64()
 
 
+def test_binary_array_masked():
+    # ARROW-12431
+    masked_basic = pa.array([b'\x05'], type=pa.binary(1),
+                            mask=np.array([False]))
+    assert pa.array([b'\x05']).to_pylist() == masked_basic.to_pylist()

Review comment:
       ```suggestion
       assert [b'\x05'] == masked_basic.to_pylist()
   ```
   
   I would compare it here to the known result, since it's a simple result (and to not rely on another conversion path)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r627241356



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       Confirmed that this new implementation is fully sharing memory with numpy. 
   
   ```
           npa = np.array([b'aaa', b'bbb', b'ccc']*100)
           arrow_array = pa.array(npa, type=pa.binary(3),
                                  mask=np.array([False, False, False]*100))
           npa[npa == b"bbb"] = b"XXX"
   >       assert ([b'aaa', b'bbb', b'ccc']*100) == arrow_array.to_pylist()
   E       AssertionError: assert [b'aaa', b'bb..., b'ccc', ...] == [b'aaa', b'XX..., b'ccc', ...]
   E         At index 1 diff: b'bbb' != b'XXX'
   ```
   
   Not sure that's what we want to do. It seems that when doing `pa.array(np.array)` it's currently pretty random if we end up sharing memory or not. 
   
   For example 
   ```
   npa = np.array([1, 2, 3]*3)
   arrow_array = pa.array(npa, type=pa.int64())
   ```
   shares memory, while
   ``` 
   npa = np.array([1, 2, 3]*3)
   arrow_array = pa.array(npa, type=pa.int32())
   ``` 
   doesn't.
   
   That should probably be consolidated.
   
   Thinking about it, I don't think we should be sharing memory by default when building a new array, as the user expectation is probably that they are building something new.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [Python] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r632412367



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,15 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    RETURN_NOT_OK(builder.Reserve(length_));
+    for (int64_t i = 0; i < length_; ++i) {
+      if (mask_values[i]) {
+        RETURN_NOT_OK(builder.AppendNull());
+      } else {
+        RETURN_NOT_OK(builder.Append(data));
+      }
+      data += stride_;

Review comment:
       Sadly not, I expected it would, but I wrote some tests and it wasn't sufficient. That's why I made https://issues.apache.org/jira/browse/ARROW-12667 as a follow up issue. So that I can test it for all various types and make sure it works in all cases.

##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,15 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    RETURN_NOT_OK(builder.Reserve(length_));
+    for (int64_t i = 0; i < length_; ++i) {
+      if (mask_values[i]) {
+        RETURN_NOT_OK(builder.AppendNull());
+      } else {
+        RETURN_NOT_OK(builder.Append(data));
+      }
+      data += stride_;

Review comment:
       Sadly not, I expected it would, but I wrote some tests and it wasn't enough. That's why I made https://issues.apache.org/jira/browse/ARROW-12667 as a follow up issue. So that I can test it for all various types and make sure it works in all cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r627241356



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       Confirmed that this new implementation is fully sharing memory with numpy. 
   
   ```
           npa = np.array([b'aaa', b'bbb', b'ccc']*100)
           arrow_array = pa.array(npa, type=pa.binary(3),
                                  mask=np.array([False, False, False]*100))
           npa[npa == b"bbb"] = b"XXX"
   >       assert ([b'aaa', b'bbb', b'ccc']*100) == arrow_array.to_pylist()
   E       AssertionError: assert [b'aaa', b'bb..., b'ccc', ...] == [b'aaa', b'XX..., b'ccc', ...]
   E         At index 1 diff: b'bbb' != b'XXX'
   ```
   
   Not sure that's what we want to do. It seems that when doing `pa.array(np.array)` it's currently pretty random if we end up sharing memory or not. 
   
   For example 
   ```
   npa = np.array([1, 2, 3]*3)
   arrow_array = pa.array(npa, type=pa.int64())
   ```
   shares memory, while
   ``` 
   npa = np.array([1, 2, 3]*3)
   arrow_array = pa.array(npa, type=pa.int32())
   ```
   doesn't.
   
   That should probably be consolidated.
   
   Thinking about it, I don't think we should be sharing memory by default when building a new array, as the user expectation is probably that they are building something new.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r625114336



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2666,6 +2666,31 @@ def test_array_masked():
     assert arr.type == pa.int64()
 
 
+def test_binary_array_masked():
+    # ARROW-12431
+    masked_basic = pa.array([b'\x05'], type=pa.binary(1),
+                            mask=np.array([False]))
+    assert [b'\x05'] == masked_basic.to_pylist()
+
+    # Fixed Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                      mask=np.array([False]))
+    assert [b'\x05'] == masked.to_pylist()
+
+    masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                            mask=np.array([True]))
+    assert [None] == masked_nulls.to_pylist()
+
+    # Variable Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(-1),

Review comment:
       I thought that it was better to make it explicit that we were testing against a variable length binary, but given that ``-1``  is the default anyway I guess it doesn't make much difference.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [Python] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r632411238



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2666,6 +2666,38 @@ def test_array_masked():
     assert arr.type == pa.int64()
 
 
+def test_binary_array_masked():
+    # ARROW-12431
+    masked_basic = pa.array([b'\x05'], type=pa.binary(1),
+                            mask=np.array([False]))
+    assert [b'\x05'] == masked_basic.to_pylist()
+
+    # Fixed Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                      mask=np.array([False]))
+    assert [b'\x05'] == masked.to_pylist()
+
+    masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                            mask=np.array([True]))
+    assert [None] == masked_nulls.to_pylist()
+
+    # Variable Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(),
+                      mask=np.array([False]))
+    assert [b'\x05'] == masked.to_pylist()
+
+    masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(),
+                            mask=np.array([True]))
+    assert [None] == masked_nulls.to_pylist()
+
+    # Fixed Length Binary, copy

Review comment:
       It verifies that the behaviour is the same that we get from variable length binary arrays, which do not reuse the numpy array memory. I don't think it's an implementation detail because it changes the user experience.
   
   The fact that the underlying numpy array is shared or not changes the user experience as it means the user can't modify the original numpy array without indirectly modifying (probably unexpectedly) the Arrow array too.
   
   which lead me to create https://issues.apache.org/jira/browse/ARROW-12666 because in some cases we reuse the numpy memory (all basic types) and in other cases we don't (the string, binary etc... types). The follow up ticket suggests to make that behaviour clear as numpy does by adding a `copy=True/False` argument to the `pyarrow.array`  function. 
   
   We can discuss further what's the best way to go in that dedicated ticket, here I wanted to make sure we were consistent with that happens when `pa.binary()` and `pa.binary(N)` are used.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [Python] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r632411238



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -2666,6 +2666,38 @@ def test_array_masked():
     assert arr.type == pa.int64()
 
 
+def test_binary_array_masked():
+    # ARROW-12431
+    masked_basic = pa.array([b'\x05'], type=pa.binary(1),
+                            mask=np.array([False]))
+    assert [b'\x05'] == masked_basic.to_pylist()
+
+    # Fixed Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                      mask=np.array([False]))
+    assert [b'\x05'] == masked.to_pylist()
+
+    masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(1),
+                            mask=np.array([True]))
+    assert [None] == masked_nulls.to_pylist()
+
+    # Variable Length Binary
+    masked = pa.array(np.array([b'\x05']), type=pa.binary(),
+                      mask=np.array([False]))
+    assert [b'\x05'] == masked.to_pylist()
+
+    masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(),
+                            mask=np.array([True]))
+    assert [None] == masked_nulls.to_pylist()
+
+    # Fixed Length Binary, copy

Review comment:
       It verifies that the behaviour is the same that we get from variable length binary arrays, which do not reuse the numpy array memory. I don't think it's an implementation detail because it changes the user experience.
   
   The fact that the underlying numpy array is shared or not changes the user experience as it means the user can't modify the original numpy array without indirectly modifying (probably unexpectedly) the Arrow array too.
   
   which lead me to create https://issues.apache.org/jira/browse/ARROW-12666 because in some cases we reuse the numpy memory (all basic types) and in other cases we don't (the string, binary etc... types). The follow up ticket suggests to make that behaviour clear as numpy does by adding a `copy=True/False` argument to the `pyarrow.array`  function. 
   
   We can discuss further what's the best way to go in that dedicated ticket, here I wanted to make sure we were consistent with that happens when `pa.binary()` and `pa.binary(N)` are used. So the test verifies that if you modify the numpy array the arraw array doesn't change.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r626464715



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       Ok, the implementation provided by https://github.com/apache/arrow/pull/10199/commits/b3e0a37049dffdcac63b5ce446907be60523e651 should do the trick.
   
   I also verified that the `PrepareInputData` does a zero copy and `ArrayData::Make` does move the buffer.
   Which is good because it should be far faster than previous solution, but raises questions about the memory ownership.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r625056720



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       The issue is that `AppendValues` needs a byte mask, not bit mask




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] amol- commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r627288345



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       Created https://issues.apache.org/jira/browse/ARROW-12667 and https://issues.apache.org/jira/browse/ARROW-12666 as follow-up of this issue. Reverted the implementation to the previous one which was in line with the behaviour exposed for VariableLengthBinary and didn't have the memory copy issue. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#issuecomment-832645495


   Not necessarily related to this PR (just noticed while looking at the code and testing a few things), but the conversion for fixed size binary is currently (on master) incorrect for the strided case:
   
   ```
   In [38]: arr = np.array([b"ab", b"cd", b"ef"])
   
   In [39]: arr[::2]
   Out[39]: array([b'ab', b'ef'], dtype='|S2')
   
   In [41]: pa.array(arr[::2], pa.binary(2)).to_pylist()
   Out[41]: [b'ab', b'cd']
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10199: ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10199:
URL: https://github.com/apache/arrow/pull/10199#discussion_r625057919



##########
File path: cpp/src/arrow/python/numpy_to_arrow.cc
##########
@@ -594,7 +594,13 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {
 
   if (mask_ != nullptr) {
     Ndarray1DIndexer<uint8_t> mask_values(mask_);
-    RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
+    std::unique_ptr<uint8_t[]> inverted_mask(new uint8_t[length_]);
+    for (int64_t i = 0; i < length_; ++i) {
+      inverted_mask[i] = !mask_values[i];
+    }

Review comment:
       Right. That said, we can also eschew `AppendValues` and construct the `ArrayData` directly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org