You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Kevin Glasson (Jira)" <ji...@apache.org> on 2020/05/22 08:20:00 UTC

[jira] [Created] (ARROW-8888) Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions

Kevin Glasson created ARROW-8888:
------------------------------------

             Summary: Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions
                 Key: ARROW-8888
                 URL: https://issues.apache.org/jira/browse/ARROW-8888
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.16.0
         Environment: MacOS: 10.15.4 (Also happening on windows 10)
Python: 3.7.3
Pyarrow: 0.16.0
Pandas: 0.25.3
            Reporter: Kevin Glasson


When calling pa.Table.from_pandas() the code path that uses the ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the conversion is much much slower.

 
I have a simple example - but the time difference is much worse with a real table.


Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pyarrow as pa

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(\{"A": [0] * 10000000})

In [4]: %timeit table = pa.Table.from_pandas(df)
577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)