You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "wirable23 (via GitHub)" <gi...@apache.org> on 2023/06/26 21:54:43 UTC

[GitHub] [arrow] wirable23 opened a new issue, #36311: utf8_slice_codeunits produces invalid unicode sequence

wirable23 opened a new issue, #36311:
URL: https://github.com/apache/arrow/issues/36311

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   ```
   pa.compute.utf8_slice_codeunits(f"AB{chr(127917)}C{chr(127917)}ㇱD", start=2, stop=None, step=4).as_py()
   
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow\scalar.pxi", line 632, in pyarrow.lib.StringScalar.as_py
   UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 4: invalid start byte
   >>>
   ```
   
   The result of utf8_slice_codeunits produced an invalid unicode sequence.
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #36311: utf8_slice_codeunits produces invalid unicode sequence

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1609987745

   Even a trivial identity slice fails:
   ```python
   >>> s = f"AB{chr(127917)}C{chr(127917)}ㇱD"
   >>> a = pa.array([s])
   >>> pc.utf8_slice_codeunits(a, start=0, stop=None)
   Traceback (most recent call last):
     Cell In[11], line 1
       pc.utf8_slice_codeunits(a, start=0, stop=None)
     File ~/arrow/dev/python/pyarrow/compute.py:261 in wrapper
       return func.call(args, options, memory_pool)
     File pyarrow/_compute.pyx:367 in pyarrow._compute.Function.call
       result = GetResultValue(
     File pyarrow/error.pxi:144 in pyarrow.lib.pyarrow_internal_check_status
       return check_status(status)
     File pyarrow/error.pxi:100 in pyarrow.lib.check_status
       raise ArrowInvalid(message)
   ArrowInvalid: Negative buffer resize: -4
   /home/antoine/arrow/dev/cpp/src/arrow/memory_pool.cc:931  buffer->Resize(size)
   /home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/scalar_string_internal.h:88  ctx->Allocate(max_output_ncodeunits)
   /home/antoine/arrow/dev/cpp/src/arrow/compute/exec.cc:920  kernel_->exec(kernel_ctx_, input, &output)
   /home/antoine/arrow/dev/cpp/src/arrow/compute/function.cc:276  executor->Execute(input, &listener)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on issue #36311: [C++][Python] utf8_slice_codeunits produces invalid unicode sequence

Posted by "raulcd (via GitHub)" <gi...@apache.org>.
raulcd commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1625109458

   @pitrou should this be a blocker?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #36311: utf8_slice_codeunits produces invalid unicode sequence

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1609989053

   @felipecrv @benibus Perhaps one of you is interested as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou closed issue #36311: [C++][Python] utf8_slice_codeunits produces invalid unicode sequence

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou closed issue #36311: [C++][Python] utf8_slice_codeunits produces invalid unicode sequence
URL: https://github.com/apache/arrow/issues/36311


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on issue #36311: utf8_slice_codeunits produces invalid unicode sequence

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1610075413

   I can try taking a look ...might be nice to do some unicode debugging again


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on issue #36311: [C++][Python] utf8_slice_codeunits produces invalid unicode sequence

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1625715248

   Fix for this is in progress, but it seems the default value for `SliceOptions::stop` (which is `INT64_MAX`) isn't handled in a few calculations, resulting in overflows.
   
   Seems kinda crazy that this hasn't come up before actually... since the specific uinicode sequence doesn't appear to be relevant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #36311: utf8_slice_codeunits produces invalid unicode sequence

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1609981981

   It gets even better in debug mode:
   ```python
   >>> import pyarrow.compute as pc
   >>> import pyarrow as pa
   >>> 
   >>> pa.compute.utf8_slice_codeunits(f"AB{chr(127917)}C{chr(127917)}ㇱD", start=2, stop=None, step=4)
   /home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/scalar_string_internal.h:109:  Check failed: (output_ncodeunits) <= (max_output_ncodeunits) 
   ```
   
   @rok Would you like to take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on issue #36311: utf8_slice_codeunits produces invalid unicode sequence

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1614543313

   I can take a look but need to complete something else first. Will ping if I start work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #36311: [C++][Python] utf8_slice_codeunits produces invalid unicode sequence

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1625467969

   I don't think it needs to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org