You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "bdice (via GitHub)" <gi...@apache.org> on 2023/04/06 20:19:55 UTC

[GitHub] [arrow] bdice opened a new issue, #34944: PyArrow pa.array fails intermittently with custom iterable object

bdice opened a new issue, #34944:
URL: https://github.com/apache/arrow/issues/34944

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I worked with @shwina recently on a problem we saw in cudf, and we identified a bug in PyArrow. The bug can be reproduced (but only intermittently) with the following snippet:
   
   ```python
   import pyarrow as pa
   
   class A:
       def __getitem__(self, key):
           return 3
   
   pa.array(A())
   ```
   
   I can run this snippet under `pdb` by inserting `breakpoint()` before `pa.array(A())` and then continuing from the breakpoint by pressing `c`. With `pdb`, I consistently get errors like:
   ```
   WARNING: Logging before InitGoogleLogging() is written to STDERR
   F20230406 15:12:07.595571 153661 inference.cc:348]  Check failed: _s.ok() Operation failed: internal::ImportDecimalType(&decimal_type_)
   Bad status: Unknown error: <built-in function __import__> returned a result with an exception set. Detail: Python exception: SystemError
   *** Check failure stack trace: ***
   ```
   
   Without `pdb`, the error is _sometimes_ this one (which I expect) and _sometimes_ the crash shown above.
   ```python
   Traceback (most recent call last):
     File "/home/bdice/issue.py", line 16, in <module>
       pa.array(A())
     File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
   TypeError: object of type 'A' has no len()
   ```
   
   We want a `TypeError` to be raised rather than getting the crash above under `pdb`. The crash is reproducible without `pdb` if the snippet is run repeatedly. This might suggest some kind of memory corruption is happening behind the scenes.
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34944: [Python] PyArrow pa.array fails intermittently with custom iterable object

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34944:
URL: https://github.com/apache/arrow/issues/34944#issuecomment-1500079228

   @bdice thanks for the report! I can reproduce this (I actually seem to get the crash consistently)
   
   It might also be an upstream Python issue? Because if I replace `pa.array(A())` with `list(A())`, this hangs / blows up memory. 
   You call it an "iterable" object, but since it doesn't have a length, it will iterate indefinitly?
   
   In the C++ code handling generic input, we first try to convert any python object to an actual sequence:
   
   https://github.com/apache/arrow/blob/7526df9ad97219cc44f9b460887405f2b0e86fd4/python/pyarrow/src/arrow/python/python_to_arrow.cc#L1098-L1101
   
   It seems that this object is passing the `if (PySequence_Check(obj))` check, which I find a bit surprising (in pure python, it's not considered a Sequence when using collections.abc)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche closed issue #34944: [Python] PyArrow pa.array fails intermittently with custom iterable object

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche closed issue #34944: [Python] PyArrow pa.array fails intermittently with custom iterable object
URL: https://github.com/apache/arrow/issues/34944


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34944: [Python] PyArrow pa.array fails intermittently with custom iterable object

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34944:
URL: https://github.com/apache/arrow/issues/34944#issuecomment-1500087960

   Ah, it's failing in the last line that I showed in the snippet above (calling `PySequence_Size`), but we fail to check for a python error afterwards (I suppose because the code assumed that would always work because of the sequence check before that). 
   Checking for a python error and raising this one, correctly gives the "TypeError: object of type 'A' has no len()" error.
   
   Looking at the python C API docs, it indeed mentions that `PySequence_Check` always passes for objects with `__getitem__`: https://docs.python.org/3/c-api/sequence.html#c.PySequence_Check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34944: [Python] PyArrow pa.array fails intermittently with custom iterable object

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34944:
URL: https://github.com/apache/arrow/issues/34944#issuecomment-1500094766

   PR at https://github.com/apache/arrow/pull/34958


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org