You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/25 19:22:52 UTC

[GitHub] [arrow] douglas-raillard-arm opened a new issue, #14736: pyarrow.compute.struct_field crashing

douglas-raillard-arm opened a new issue, #14736:
URL: https://github.com/apache/arrow/issues/14736

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Calling `pyarrow.compute.struct_field` crashes on some `ChunkedArray`.
   
   versions:
   Python: 3.8 and 3.11 (same behavior)
   pyarrow: 10.0.1
   
   Here is a reproducer:
   
   Feather file triggering the crash:
   [crash.feather.gz](https://github.com/apache/arrow/files/10094787/crash.feather.gz)
   
   ```python
   import pyarrow as pa
   from pyarrow.compute import struct_field
   import pyarrow.feather
   
   
   path = "crash.feather"
   table = pa.feather.read_table(path)
   print(table)
   
   # This crashes
   struct_field(struct_field(struct_field(table[0], [3]), [2]), [1])
   
   # And that crashes as well
   struct_field(struct_field(struct_field(table[0], [3]), [3]), [0])
   ```
   
   The backtrace in both cases is the same:
   ```
   /arrow/cpp/src/arrow/array/array_nested.cc:686:  Check failed: (data_->buffers[0]) == (nullptr) 
   /lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(+0xfc30c8)[0x7f852ed830c8]
   /lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7f852ed8347d]
   /lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZN5arrow16SparseUnionArray7SetDataESt10shared_ptrINS_9ArrayDataEE+0x154)[0x7f852efbb664]
   /lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZN5arrow16SparseUnionArrayC1ESt10shared_ptrINS_9ArrayDataEE+0x53)[0x7f852efbb8c3]
   /lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0xe4f)[0x7f852eeb4d2f]
   /lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZNK5arrow11StructArray17GetFlattenedFieldEiPNS_10MemoryPoolE+0x209)[0x7f852ef9ddc9]
   /lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(+0x974159)[0x7f852e734159]
   /lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(+0xef1f29)[0x7f852ecb1f29]
   /lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(+0xd9568e)[0x7f852eb5568e]
   /lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZNK5arrow7compute8Function7ExecuteERKSt6vectorINS_5DatumESaIS3_EEPKNS0_15FunctionOptionsEPNS0_11ExecContextE+0xf2)[0x7f852eb64612]
   lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/_compute.cpython-311-x86_64-linux-gnu.so(+0x89d95)[0x7f8513983d95]
   python[0x62cfe8]
   python(PyObject_Vectorcall+0x35)[0x63a9d5]
   python(_PyEval_EvalFrameDefault+0x76d)[0x58ab6d]
   python[0x6dd49f]
   python(PyEval_EvalCode+0x97)[0x6dd727]
   python[0x6b7367]
   python[0x6b73f0]
   python[0x6b760b]
   python(_PyRun_SimpleFileObject+0x194)[0x6b7bf4]
   python(_PyRun_AnyFileObject+0x47)[0x6b7cb7]
   python(Py_RunMain+0x2bc)[0x6af99c]
   python(Py_BytesMain+0x2d)[0x6afb4d]
   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f8531ce2083]
   python(_start+0x2e)[0x664e6e]
   ```
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #14736: pyarrow.compute.struct_field crashing

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #14736:
URL: https://github.com/apache/arrow/issues/14736#issuecomment-1331827206

   Thanks for the report! I can reproduce it also with the latest dev version.
   
   cc @milesgranger 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #14736: [C++] pyarrow.compute.struct_field crashing

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #14736:
URL: https://github.com/apache/arrow/issues/14736#issuecomment-1415879965

   To summarize this issue (see also the discussion on this closed PR:https://github.com/apache/arrow/pull/14838/files#r1039544789): when selecting a field out of a StructArray (`StructArray::GetFlattenedField` in C++), the top-level validity bitmap of the struct array needs to be combined with the validity bitmap of the child array. For most types this means combining the two bitmaps (`BitmapAnd`) and setting that on the resulting field array. However, in case the child field is a UnionArray, this is more complicated, because a union array itself doesn't have a validity bitmap, only each of its childs has one (https://arrow.apache.org/docs/dev/format/Columnar.html#union-layout). So to combine the parent bitmap with the union field, it has to be combined with the bitmap of each of the union's child arrays. 
   Currently, the code in `GetFlattenedField` just sets the and-ed bitmaps on the returned array:
   
   https://github.com/apache/arrow/blob/54ff2d8777717ea5bb811f3653deeb12fc93452e/cpp/src/arrow/array/array_nested.cc#L661
   
   In case `flattened_data` is the data for a UnionArray, this violates the expectation that the first buffer (validity bitmap) is always null for unions, and this causes a crash.
   
   A minimal reproducer to get the crash:
   
   ```python
   binary = pa.array([b'a', b' ', b'b', b'c', b' ', b' ', b'd'], type='binary')
   int64 = pa.array([0, 1, 0, 0, 2, 3, 0], type='int64')
   types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
   union_array = pa.UnionArray.from_sparse(types, [binary, int64], ['bin', 'int'])
   
   int_array = pa.array(range(7))
   # struct array with union array child and a validity bitmap that is present
   struct_array = pa.StructArray.from_arrays(
       [int_array, union_array], names=["int", "union"], mask=pa.array([False]*7)
   )
   struct_array.type
   # StructType(struct<int: int64, union: sparse_union<bin: binary=0, int: int64=1>>)
   
   import pyarrow.compute as pc
   # using struct_field() kernel to select a field -> works for int field
   pc.struct_field(struct_array, ["int"])
   # crashes for union field
   pc.struct_field(struct_array, ["union"])
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] douglas-raillard-arm commented on issue #14736: pyarrow.compute.struct_field crashing

Posted by GitBox <gi...@apache.org>.
douglas-raillard-arm commented on issue #14736:
URL: https://github.com/apache/arrow/issues/14736#issuecomment-1327868979

   If anyone else is affected, I found a workaround:
   1. Call `.combine_chunks()` on the ChunkedArray to turn it into a StructArray
   2. Call `.field(i)` on the StructArray to get each field. `.type.num_fields` gives the number of fields to iterate for.
   
   I guess this has some overhead compared to just calling flatten() as this forces to combine the chunks first but it's better than triggering an assert


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] douglas-raillard-arm commented on issue #14736: pyarrow.compute.struct_field crashing

Posted by GitBox <gi...@apache.org>.
douglas-raillard-arm commented on issue #14736:
URL: https://github.com/apache/arrow/issues/14736#issuecomment-1327821180

   I also tried with an equivalent feather file created with dense unions instead of sparse ones and it crashes similarly (but with different backtrace):
   [crash.dense.feather.gz](https://github.com/apache/arrow/files/10094799/crash.dense.feather.gz)
   
   ```/arrow/cpp/src/arrow/array/array_nested.cc:696:  Check failed: (data_->buffers[0]) == (nullptr) 
   lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(+0xfc30c8)[0x7ff4d02a60c8]
   lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7ff4d02a647d]
   lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZN5arrow15DenseUnionArray7SetDataERKSt10shared_ptrINS_9ArrayDataEE+0x104)[0x7ff4d04df114]
   lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZN5arrow15DenseUnionArrayC2ERKSt10shared_ptrINS_9ArrayDataEE+0x45)[0x7ff4d04df3f5]
   lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0xde5)[0x7ff4d03d7cc5]
   lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZNK5arrow11StructArray17GetFlattenedFieldEiPNS_10MemoryPoolE+0x209)[0x7ff4d04c0dc9]
   lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/libarrow.so.1000(_ZNK5arrow11StructArray7FlattenEPNS_10MemoryPoolE+0xdd)[0x7ff4d04c109d]
   lisa/.lisa-venv-3.11/lib/python3.11/site-packages/pyarrow/lib.cpython-311-x86_64-linux-gnu.so(+0x177824)[0x7ff4d25d8824]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] douglas-raillard-arm commented on issue #14736: pyarrow.compute.struct_field crashing

Posted by GitBox <gi...@apache.org>.
douglas-raillard-arm commented on issue #14736:
URL: https://github.com/apache/arrow/issues/14736#issuecomment-1327839203

   Also note that calling `ChunkedArray.flatten()` crashes as well on the offending arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org