You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/04/06 12:27:14 UTC

[GitHub] [arrow] jorisvandenbossche commented on issue #14991: [Python] pyarrow.compute.utf8_slice_codeunits fails when stop=None

jorisvandenbossche commented on issue #14991:
URL: https://github.com/apache/arrow/issues/14991#issuecomment-1498983117

   Using my development version to get a bit more informative traceback:
   
   ```
   ArrowInvalid: Negative buffer resize: -4
   /home/joris/scipy/repos/arrow/cpp/src/arrow/memory_pool.cc:931  buffer->Resize(size)
   /home/joris/scipy/repos/arrow/cpp/src/arrow/compute/kernels/scalar_string_internal.h:88  ctx->Allocate(max_output_ncodeunits)
   /home/joris/scipy/repos/arrow/cpp/src/arrow/compute/exec.cc:920  kernel_->exec(kernel_ctx_, input, &output)
   /home/joris/scipy/repos/arrow/cpp/src/arrow/compute/function.cc:276  executor->Execute(input, &listener)
   ```
   
   So if `max_output_ncodeunits` is -4, we might have run into some integer overflow while calculating that value:
   
   https://github.com/apache/arrow/blob/e2afb8cc04acec4cc14235b0973a5bc86b37d157/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc#L1085-L1097
   
   Reproducing that logic in python:
   
   ```
   In [11]: import sys
   
   In [12]: stop = np.int64(sys.maxsize)
   
   In [13]: start = np.int64(0)
   
   In [14]: step = np.int64(1)
   
   In [19]: max_slice_codepoints = (stop - start + step - 1) // step
   <ipython-input-19-0fd4a0c6e713>:1: RuntimeWarning: overflow encountered in scalar add
     max_slice_codepoints = (stop - start + step - 1) // step
   <ipython-input-19-0fd4a0c6e713>:1: RuntimeWarning: overflow encountered in scalar subtract
     max_slice_codepoints = (stop - start + step - 1) // step
   
   In [20]: max_slice_codepoints
   Out[20]: 9223372036854775807
   
   In [21]: 4 * max_slice_codepoints
   <ipython-input-21-240e76cab6f7>:1: RuntimeWarning: overflow encountered in scalar multiply
     4 * max_slice_codepoints
   Out[21]: -4
   ```
   
   So indeed multiple steps here are overflowing. We will need to refactor this calculation a bit (there are utilities like `MultiplyWithOverflow` to do overflow safe calculations that could be used here)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org