You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "alippai (via GitHub)" <gi...@apache.org> on 2023/06/29 15:12:50 UTC

[GitHub] [arrow] alippai opened a new issue, #36389: pq.write_to_dataset crash from pandas

alippai opened a new issue, #36389:
URL: https://github.com/apache/arrow/issues/36389

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Hi,
   
   I cannot get a simple pd.DataFrame ->pq.dataset working on pyarrow 11.0.0, 12.0.0, 12.0.1.
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pandas as pd
   
   df = pd.DataFrame({"a": [0.0] * 100})
   pq.write_to_dataset(pa.Table.from_pandas(df), "/tmp/dump", use_threads=False)
   ```
   This file yields:
   ```
   terminate called without an active exception
   Aborted (core dumped)
   ```
   instantly.
   
   Setting use_threads, partitioning, set_cpu_count, any env vars from https://arrow.apache.org/docs/cpp/env_vars.html#cpp-env-vars doesn't change the behavior. Python versions 3.11 or 3.10 produce the error too.
   If I write:
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pandas as pd
   
   df = pd.DataFrame({"a": [0.0] * 100})
   t = pa.Table.from_pandas(df)
   pq.write_to_dataset(t, "/tmp/dump", use_threads=False)
   ```
   or 
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pandas as pd
   import time
   
   df = pd.DataFrame({"a": [0.0] * 100})
   pq.write_to_dataset(pa.Table.from_pandas(df), "/tmp/dump", use_threads=False)
   time.sleep(0.001)
   ```
   The errors are less frequent / gone, so I assume it's about the interaction of pyarrow and python gc (some GIL magic?).
   
   Multiple package versions were tried, all from conda eg: `conda create -n arrowcrash python=3.11 pyarrow=12.0.1 pandas` using regular linux x86_64 with many cores.
   
   Limiting the number of cores to a few using `taskset` reduces the number of crashes too. 
   
   Downgrading to pyarrow=10.0.1 fixes the issue as well.
   
   Some sanitized gdb output, maybe it helps:
   ```
   (gdb) bt
   #0  0x00007ffff6e9037f in raise () from /lib64/libc.so.6
   #1  0x00007ffff6e7adb5 in abort () from /lib64/libc.so.6
   #2  0x00007ffff4ec0ed0 in __gnu_cxx::__verbose_terminate_handler () at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/vterminate.cc:95
   #3  0x00007ffff4ebf40c in __cxxabiv1::__terminate (handler=<optimized out>) at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/eh_terminate.cc:48
   #4  0x00007ffff4ebf45e in std::terminate () at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/eh_terminate.cc:58
   #5  0x00007ffff4ebf0d9 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=10, exception_class=0, ue_header=0x7ff979ffbd70, context=0x7ff979ff9990) at ../../../../libstdc++-v3/libsupc++/unwind-pe.h:681
   #6  0x00007ffff7e248ed in _Unwind_ForcedUnwind_Phase2 (exc=exc@entry=0x7ff979ffbd70, context=context@entry=0x7ff979ff9990, frames_p=frames_p@entry=0x7ff979ff9898) at ../../../libgcc/gthr-default.h:183
   #7  0x00007ffff7e24c50 in _Unwind_ForcedUnwind (exc=0x7ff979ffbd70, stop=0x7ffff7bc13c0 <unwind_stop>, stop_argument=<optimized out>) at ../../../libgcc/gthr-default.h:218
   #8  0x00007ffff7bc1556 in __pthread_unwind () from /lib64/libpthread.so.0
   #9  0x00007ffff7bb940b in pthread_exit () from /lib64/libpthread.so.0
   #10 0x00005555556e8bd2 in PyThread_exit_thread () at /usr/local/src/conda/python-3.11.4/Include/internal/object.h:366
   #11 0x0000555555647b09 in take_gil (tstate=<optimized out>) at /usr/local/src/conda/python-3.11.4/Programs/pystate.c:226
   #12 0x000055555573e352 in PyEval_RestoreThread (tstate=0x7ff9680056a0) at /usr/local/src/conda/python-3.11.4/Programs/ceval_gil.h:535
   #13 0x00005555558249cd in PyGILState_Ensure () at /usr/local/src/conda/python-3.11.4/Modules/obmalloc.c:1708
   #14 0x00007ffff689e49e in arrow::py::NumPyBuffer::~NumPyBuffer() () from /miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/libarrow_python.so
   #15 0x00007ffff6878a53 in std::_Sp_counted_ptr_inplace<arrow::ArrayData, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
      from /miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/libarrow_python.so
   #16 0x00007ffff5452c22 in arrow::SimpleRecordBatch::~SimpleRecordBatch() () from /miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow.so.1100
   #17 0x00007ffb186e4ca2 in arrow::dataset::InMemoryFragment::~InMemoryFragment() () from /miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow_dataset.so.1100
   #18 0x00007ffb186e3dda in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release_last_use_cold() () from /miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow_dataset.so.1100
   #19 0x00007ffb186e31d2 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() [clone .part.0] () from /miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow_dataset.so.1100
   #20 0x00007ffb186e7800 in std::_Function_handler<arrow::Future<std::shared_ptr<arrow::RecordBatch> > (), arrow::dataset::InMemoryFragment::ScanBatchesAsync(std::shared_ptr<arrow::dataset::ScanOptions> const&)::Generator>::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation) () from /miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow_dataset.so.1100
   #21 0x00007ffb18740f19 in std::_Sp_counted_ptr_inplace<arrow::DefaultIfEmptyGenerator<std::shared_ptr<arrow::RecordBatch> >::State, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
      from /miniconda3/envs/arrowcrash/lib/python3.11/site-packages/pyarrow/../../../libarrow_dataset.so.1100```
   
   ```
   [Thread debugging using libthread_db enabled]
   Using host libthread_db library "/lib64/libthread_db.so.1".
   [New Thread 0x7ffff21ff700 (LWP 2586858)]
   [New Thread 0x7ffff12d0700 (LWP 2587561)]
   ...
   [New Thread 0x7ff9797fa700 (LWP 2587921)]
   [New Thread 0x7ff963fff700 (LWP 2587922)]
   [New Thread 0x7ff9637fe700 (LWP 2587923)]
   terminate called without an active exception
   
   Thread ... "python3.11" received signal SIGABRT, Aborted.
   [Switching to Thread 0x7ff979ffb700 (LWP 2587920)]
   0x00007ffff6e9037f in raise () from /lib64/libc.so.6
   ```
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #36389: [Python][Parquet] pq.write_to_dataset crash from numpy

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #36389:
URL: https://github.com/apache/arrow/issues/36389#issuecomment-1727768699

   @alippai have you find out the reason or make any workaround?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #36389: [Python][Parquet] pq.write_to_dataset crash from numpy

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #36389:
URL: https://github.com/apache/arrow/issues/36389#issuecomment-1614872958

   I believe it's crashing on exit because the scanner is still running after the `write_to_dataset` call.  This has been a long-standing bug that I haven't been able to find time to work on.
   
   It only affects numpy because normally the scanner is just running some destructors and cleaning up its objects and that is harmless.  However, when the buffers are sourced from numpy, they need to obtain the gil as part of the buffer destruction.  This attempts to obtain the GIL after python has already begun finalizing which causes a crash.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #36389: [Python][Parquet] pq.write_to_dataset crash from numpy

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #36389:
URL: https://github.com/apache/arrow/issues/36389#issuecomment-1614204133

   FWIW, I can't reproduce this locally, not with one of my existing environments, and not with a freshly created one with `conda create -n arrowcrash python=3.11 pyarrow=12.0.1 pandas` (on regular linux / Ubuntu, but laptop with only a few cores)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] pq.write_to_dataset crash from numpy [arrow]

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on issue #36389:
URL: https://github.com/apache/arrow/issues/36389#issuecomment-1753924990

   @mapleFU not a real workaround, only ensuring that exit is slow by adding delay :/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alippai commented on issue #36389: [Python][Parquet] pq.write_to_dataset crash from numpy

Posted by "alippai (via GitHub)" <gi...@apache.org>.
alippai commented on issue #36389:
URL: https://github.com/apache/arrow/issues/36389#issuecomment-1614821264

   @jorisvandenbossche thanks for checking. I've managed to reproduce the error from nix too (python 3.10, pyarrow 12.0).
   There is a good chance it's because of the high number of core.
   
   Do you think this can be a symptom of: https://github.com/apache/arrow/issues/33765 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] pq.write_to_dataset crash from numpy [arrow]

Posted by "jmahlik (via GitHub)" <gi...@apache.org>.
jmahlik commented on issue #36389:
URL: https://github.com/apache/arrow/issues/36389#issuecomment-1966865560

   Just realized I may have accidentally duplicated this with https://github.com/apache/arrow/issues/36980. There's a [docker reproducer](https://github.com/apache/arrow/issues/36980#issue-1831489065) in that issue.  There's a [workaround and possible solution](https://github.com/apache/arrow/issues/36980#issuecomment-1892559630) as well. One can run a full `gc.collect` to ensure the numpy buffer's destructor is called prior to the interpreter finalizing so it doesn't attempt to acquire the GIL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] pq.write_to_dataset crash from numpy [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #36389:
URL: https://github.com/apache/arrow/issues/36389#issuecomment-1853271850

   @jorisvandenbossche Hmm I'm not familiar with Python part, would you mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] pq.write_to_dataset crash from numpy [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #36389:
URL: https://github.com/apache/arrow/issues/36389#issuecomment-1853820548

   As mentioned above, I couldn't reproduce this locally, so that makes it harder to debug / try to fix. But I think @westonpace's comment is a quite probable explanation of the cause:
   
   > the scanner is still running after the `write_to_dataset` call. This has been a long-standing bug that I haven't been able to find time to work on.
   
   I don't know if that long-standing bug is also captured in a Scanner specific issue?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] pq.write_to_dataset crash from numpy [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #36389:
URL: https://github.com/apache/arrow/issues/36389#issuecomment-1855790804

   > I don't know if that long-standing bug is also captured in a Scanner specific issue?
   
   I'll check the unfinished scanner, just a naive question:
   
   ```
   scanner = ...
   scanner.ScanSomeRows()
   exit
   ```
   
   when it exit, would it first call the dtor of scanner?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org