You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/03/16 15:46:00 UTC
[jira] [Commented] (ARROW-11983) [Python] ImportError calling pyarrow from_pandas within ThreadPool

    [ https://issues.apache.org/jira/browse/ARROW-11983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302635#comment-17302635 ] 

Antoine Pitrou commented on ARROW-11983:
----------------------------------------

Related upstream issues: [https://bugs.python.org/issue35943] and [https://bugs.python.org/issue43515]

For now, it should be enough to add the following code at the toplevel:
{code:python}
from concurrent.futures.thread import ThreadPool
{code}


> [Python] ImportError calling pyarrow from_pandas within ThreadPool
> ------------------------------------------------------------------
>
>                 Key: ARROW-11983
>                 URL: https://issues.apache.org/jira/browse/ARROW-11983
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>             Fix For: 4.0.0
>
>
> From https://github.com/dask/dask/issues/7334
> The referenced issue report is about an ImportError they get using Python 3.9 (and I can reproduce it). As far as I know how dask works, it's basically calling `pa.Table.from_pandas` within a ThreadPool, and inside `from_pandas` we do a `with futures.ThreadPoolExecutor`, which then fails with this error:
> {code}
> >>> df2.to_parquet('test99.parquet', engine='pyarrow-dataset')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/core.py", line 4127, in to_parquet
>     return to_parquet(self, path, *args, **kwargs)
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py", line 671, in to_parquet
>     out = out.compute(**compute_kwargs)
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/base.py", line 283, in compute
>     (result,) = compute(self, traverse=False, **kwargs)
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/base.py", line 565, in compute
>     results = schedule(dsk, keys, **kwargs)
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/threaded.py", line 76, in get
>     results = get_async(
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/local.py", line 487, in get_async
>     raise_exception(exc, tb)
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/local.py", line 317, in reraise
>     raise exc
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/local.py", line 222, in execute_task
>     result = _execute_task(task, data)
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/core.py", line 121, in _execute_task
>     return func(*(_execute_task(a, cache) for a in args))
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/utils.py", line 35, in apply
>     return func(*args, **kwargs)
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py", line 841, in write_partition
>     t = cls._pandas_to_arrow_table(df, preserve_index=preserve_index, schema=schema)
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py", line 814, in _pandas_to_arrow_table
>     table = pa.Table.from_pandas(df, preserve_index=preserve_index, schema=schema)
>   File "pyarrow/table.pxi", line 1479, in pyarrow.lib.Table.from_pandas
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 596, in dataframe_to_arrays
>     with futures.ThreadPoolExecutor(nthreads) as executor:
>   File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/concurrent/futures/__init__.py", line 49, in __getattr__
>     from .thread import ThreadPoolExecutor as te
> ImportError: cannot import name 'ThreadPoolExecutor' from partially initialized module 'concurrent.futures.thread' (most likely due to a circular import) (/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/concurrent/futures/thread.py)
> {code}
> We can probably avoid that by moving the import top-level (not inline inside dataframe_to_arrays)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)