You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "lukemanley (via GitHub)" <gi...@apache.org> on 2023/03/19 15:16:06 UTC

[GitHub] [arrow] lukemanley opened a new issue, #34634: pc.replace_with_mask produces invalid return when array is boolean ChunkedArray

lukemanley opened a new issue, #34634:
URL: https://github.com/apache/arrow/issues/34634

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Passing a boolean `ChunkedArray` to `pc.replace_with_mask` seems to produces an invalid result. Replacing the `ChunkedArray` with a regular `Array` seems to work fine. Other types seem to work fine with `ChunkedArray`. The invalid result appears that it may be specific to the boolean type. 
   
   ```
   import pyarrow as pa
   import pyarrow.compute as pc
   
   arr = pa.chunked_array([[True, True]])
   mask = pa.array([False, True])
   replacements = pa.array([False, False])
   
   pc.replace_with_mask(arr, mask, replacements)
   ```
   
   output:
   ```
   <pyarrow.lib.ChunkedArray object at 0x7fb57a7a6900>
   [
   <Invalid array: Buffer #1 too small in array of type bool and length 2: expected at least 1 byte(s), got 0>
   ]
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34634: [Python] pc.replace_with_mask produces invalid result when array is a boolean ChunkedArray

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34634:
URL: https://github.com/apache/arrow/issues/34634#issuecomment-1477517937

   > According [to documentation](https://arrow.apache.org/docs/python/generated/pyarrow.compute.replace_with_mask.html) the following should be fulfilled: `len(replacements) == sum(mask == true)`.
   > In this case, `len(replacements) == 2` and `sum(mask==True) == 1`.
   
   Indeed, the usage in the top-post reproducer is actually wrong, but generally we currently ignore if the `replacements` array is too long (see also https://github.com/apache/arrow/issues/32436). And indeed, also if you provide the correct length, the same issue occurs:
   
   ```
   In [12]: import pyarrow as pa
       ...: import pyarrow.compute as pc
       ...: 
       ...: arr = pa.chunked_array([[True, True]])
       ...: mask = pa.array([False, True])
       ...: replacements = pa.array([False])
   
   In [13]: pc.replace_with_mask(arr, mask, replacements)
   Out[13]: 
   <pyarrow.lib.ChunkedArray object at 0x7f027c727650>
   [
   <Invalid array: Buffer #1 too small in array of type bool and length 2: expected at least 1 byte(s), got 0
   ]
   ```
   
   The support for chunked arrays in this kernel is generally limited (see https://github.com/apache/arrow/issues/31665), but providing a chunked array for the _input_ array should now work. The above example also works if I use a different type (eg int64) instead of boolean. So this seems to be an issue specifically with boolean input array.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pstorozenko commented on issue #34634: [Python] pc.replace_with_mask produces invalid result when array is a boolean ChunkedArray

Posted by "pstorozenko (via GitHub)" <gi...@apache.org>.
pstorozenko commented on issue #34634:
URL: https://github.com/apache/arrow/issues/34634#issuecomment-1476812472

   According [to documentation](https://arrow.apache.org/docs/python/generated/pyarrow.compute.replace_with_mask.html) the following should be fulfilled: `len(replacements) == sum(mask == true)`.
   In this case, `len(replacements) == 2` and `sum(mask==True) == 1`.
   However, error talks about something yet different.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34634: [C++] "replace_with_mask" kernel produces invalid result (wrong validity bitmap) when array is a boolean ChunkedArray

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34634:
URL: https://github.com/apache/arrow/issues/34634#issuecomment-1477566537

   The same issue happens for the fill_forward/backward kernels:
   
   ```
   In [37]: pc.fill_null_forward(pa.array([True, False, None]))
   Out[37]: 
   <pyarrow.lib.BooleanArray object at 0x7f0276f33820>
   [
     true,
     false,
     false
   ]
   
   In [38]: pc.fill_null_forward(pa.chunked_array([[True, False, None]]))
   Out[38]: 
   <pyarrow.lib.ChunkedArray object at 0x7f0276ff56c0>
   [
   <Invalid array: Buffer #1 too small in array of type bool and length 3: expected at least 1 byte(s), got 0
   /home/joris/scipy/repos/arrow/cpp/src/arrow/array/validate.cc:116  ValidateLayout(*data.type)>
   ]
   ```
   
   Looking at the implementation, I think the issue is with pre-allocating the result arrays for fixed width types. This uses `->byte_width()`, but that only works for fixed width types with at least 1 byte per element, while boolean uses bits, and so this `->byte_width()` returns 0 (basically that should never be used if the type can be boolean):
   
   https://github.com/apache/arrow/blob/69118b2d26f2fbbbe65ad30dbe167d74b70fe791/cpp/src/arrow/compute/kernels/vector_replace.cc#L424-L426
   
   So in case of boolean arrays, we are allocating here a data buffer of length 0.
   
   ---
   
   @lukemanley or @pstorozenko if someone of you would like to try fixing this, I would be happy to provide some guidance 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pstorozenko commented on issue #34634: [Python] pc.replace_with_mask produces invalid result when array is a boolean ChunkedArray

Posted by "pstorozenko (via GitHub)" <gi...@apache.org>.
pstorozenko commented on issue #34634:
URL: https://github.com/apache/arrow/issues/34634#issuecomment-1476820862

   Anyway, the following code in R yields the same error.
   
   ```r
   library(arrow)
   arr <- chunked_array(c(TRUE, TRUE))
   mask <- Array$create(c(FALSE, TRUE))
   replacements <- Array$create(c(FALSE, FALSE))
   
   call_function("replace_with_mask", arr, mask, replacements)
   ```
   
   ```
   ChunkedArray
   <bool>
   [
   <Invalid array: Buffer #1 too small in array of type bool and length 2: expected at least 1 byte(s), got 0>
   ]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lukemanley commented on issue #34634: [C++] "replace_with_mask" kernel produces invalid result (wrong validity bitmap) when array is a boolean ChunkedArray

Posted by "lukemanley (via GitHub)" <gi...@apache.org>.
lukemanley commented on issue #34634:
URL: https://github.com/apache/arrow/issues/34634#issuecomment-1487642244

   > @lukemanley or @pstorozenko if someone of you would like to try fixing this, I would be happy to provide some guidance
   
   Thanks for the offer. I'm not sure I will get around to this as I'm not all that familiar with cpp. Based on a quick glance, I assume something like the following (from vector_selection.cpp) is needed:
   
   https://github.com/apache/arrow/blob/69118b2d26f2fbbbe65ad30dbe167d74b70fe791/cpp/src/arrow/compute/kernels/vector_selection.cc#L252-L256


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org