You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "bdice (via GitHub)" <gi...@apache.org> on 2023/04/06 20:19:55 UTC
[GitHub] [arrow] bdice opened a new issue, #34944: PyArrow pa.array fails intermittently with custom iterable object
bdice opened a new issue, #34944:
URL: https://github.com/apache/arrow/issues/34944
### Describe the bug, including details regarding any error messages, version, and platform.
I worked with @shwina recently on a problem we saw in cudf, and we identified a bug in PyArrow. The bug can be reproduced (but only intermittently) with the following snippet:
```python
import pyarrow as pa
class A:
def __getitem__(self, key):
return 3
pa.array(A())
```
I can run this snippet under `pdb` by inserting `breakpoint()` before `pa.array(A())` and then continuing from the breakpoint by pressing `c`. With `pdb`, I consistently get errors like:
```
WARNING: Logging before InitGoogleLogging() is written to STDERR
F20230406 15:12:07.595571 153661 inference.cc:348] Check failed: _s.ok() Operation failed: internal::ImportDecimalType(&decimal_type_)
Bad status: Unknown error: <built-in function __import__> returned a result with an exception set. Detail: Python exception: SystemError
*** Check failure stack trace: ***
```
Without `pdb`, the error is _sometimes_ this one (which I expect) and _sometimes_ the crash shown above.
```python
Traceback (most recent call last):
File "/home/bdice/issue.py", line 16, in <module>
pa.array(A())
File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
TypeError: object of type 'A' has no len()
```
We want a `TypeError` to be raised rather than getting the crash above under `pdb`. The crash is reproducible without `pdb` if the snippet is run repeatedly. This might suggest some kind of memory corruption is happening behind the scenes.
### Component(s)
C++, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #34944: [Python] PyArrow pa.array fails intermittently with custom iterable object
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34944:
URL: https://github.com/apache/arrow/issues/34944#issuecomment-1500079228
@bdice thanks for the report! I can reproduce this (I actually seem to get the crash consistently)
It might also be an upstream Python issue? Because if I replace `pa.array(A())` with `list(A())`, this hangs / blows up memory.
You call it an "iterable" object, but since it doesn't have a length, it will iterate indefinitly?
In the C++ code handling generic input, we first try to convert any python object to an actual sequence:
https://github.com/apache/arrow/blob/7526df9ad97219cc44f9b460887405f2b0e86fd4/python/pyarrow/src/arrow/python/python_to_arrow.cc#L1098-L1101
It seems that this object is passing the `if (PySequence_Check(obj))` check, which I find a bit surprising (in pure python, it's not considered a Sequence when using collections.abc)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche closed issue #34944: [Python] PyArrow pa.array fails intermittently with custom iterable object
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche closed issue #34944: [Python] PyArrow pa.array fails intermittently with custom iterable object
URL: https://github.com/apache/arrow/issues/34944
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #34944: [Python] PyArrow pa.array fails intermittently with custom iterable object
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34944:
URL: https://github.com/apache/arrow/issues/34944#issuecomment-1500087960
Ah, it's failing in the last line that I showed in the snippet above (calling `PySequence_Size`), but we fail to check for a python error afterwards (I suppose because the code assumed that would always work because of the sequence check before that).
Checking for a python error and raising this one, correctly gives the "TypeError: object of type 'A' has no len()" error.
Looking at the python C API docs, it indeed mentions that `PySequence_Check` always passes for objects with `__getitem__`: https://docs.python.org/3/c-api/sequence.html#c.PySequence_Check
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #34944: [Python] PyArrow pa.array fails intermittently with custom iterable object
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34944:
URL: https://github.com/apache/arrow/issues/34944#issuecomment-1500094766
PR at https://github.com/apache/arrow/pull/34958
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org