You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "wirable23 (via GitHub)" <gi...@apache.org> on 2023/06/26 21:54:43 UTC
[GitHub] [arrow] wirable23 opened a new issue, #36311: utf8_slice_codeunits produces invalid unicode sequence
wirable23 opened a new issue, #36311:
URL: https://github.com/apache/arrow/issues/36311
### Describe the bug, including details regarding any error messages, version, and platform.
```
pa.compute.utf8_slice_codeunits(f"AB{chr(127917)}C{chr(127917)}ㇱD", start=2, stop=None, step=4).as_py()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow\scalar.pxi", line 632, in pyarrow.lib.StringScalar.as_py
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 4: invalid start byte
>>>
```
The result of utf8_slice_codeunits produced an invalid unicode sequence.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #36311: utf8_slice_codeunits produces invalid unicode sequence
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1609987745
Even a trivial identity slice fails:
```python
>>> s = f"AB{chr(127917)}C{chr(127917)}ㇱD"
>>> a = pa.array([s])
>>> pc.utf8_slice_codeunits(a, start=0, stop=None)
Traceback (most recent call last):
Cell In[11], line 1
pc.utf8_slice_codeunits(a, start=0, stop=None)
File ~/arrow/dev/python/pyarrow/compute.py:261 in wrapper
return func.call(args, options, memory_pool)
File pyarrow/_compute.pyx:367 in pyarrow._compute.Function.call
result = GetResultValue(
File pyarrow/error.pxi:144 in pyarrow.lib.pyarrow_internal_check_status
return check_status(status)
File pyarrow/error.pxi:100 in pyarrow.lib.check_status
raise ArrowInvalid(message)
ArrowInvalid: Negative buffer resize: -4
/home/antoine/arrow/dev/cpp/src/arrow/memory_pool.cc:931 buffer->Resize(size)
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/scalar_string_internal.h:88 ctx->Allocate(max_output_ncodeunits)
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec.cc:920 kernel_->exec(kernel_ctx_, input, &output)
/home/antoine/arrow/dev/cpp/src/arrow/compute/function.cc:276 executor->Execute(input, &listener)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] raulcd commented on issue #36311: [C++][Python] utf8_slice_codeunits produces invalid unicode sequence
Posted by "raulcd (via GitHub)" <gi...@apache.org>.
raulcd commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1625109458
@pitrou should this be a blocker?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #36311: utf8_slice_codeunits produces invalid unicode sequence
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1609989053
@felipecrv @benibus Perhaps one of you is interested as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou closed issue #36311: [C++][Python] utf8_slice_codeunits produces invalid unicode sequence
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou closed issue #36311: [C++][Python] utf8_slice_codeunits produces invalid unicode sequence
URL: https://github.com/apache/arrow/issues/36311
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] benibus commented on issue #36311: utf8_slice_codeunits produces invalid unicode sequence
Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1610075413
I can try taking a look ...might be nice to do some unicode debugging again
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] benibus commented on issue #36311: [C++][Python] utf8_slice_codeunits produces invalid unicode sequence
Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1625715248
Fix for this is in progress, but it seems the default value for `SliceOptions::stop` (which is `INT64_MAX`) isn't handled in a few calculations, resulting in overflows.
Seems kinda crazy that this hasn't come up before actually... since the specific uinicode sequence doesn't appear to be relevant.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #36311: utf8_slice_codeunits produces invalid unicode sequence
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1609981981
It gets even better in debug mode:
```python
>>> import pyarrow.compute as pc
>>> import pyarrow as pa
>>>
>>> pa.compute.utf8_slice_codeunits(f"AB{chr(127917)}C{chr(127917)}ㇱD", start=2, stop=None, step=4)
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/scalar_string_internal.h:109: Check failed: (output_ncodeunits) <= (max_output_ncodeunits)
```
@rok Would you like to take a look?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] rok commented on issue #36311: utf8_slice_codeunits produces invalid unicode sequence
Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1614543313
I can take a look but need to complete something else first. Will ping if I start work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #36311: [C++][Python] utf8_slice_codeunits produces invalid unicode sequence
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36311:
URL: https://github.com/apache/arrow/issues/36311#issuecomment-1625467969
I don't think it needs to.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org