You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Krisztian Szucs (Jira)" <ji...@apache.org> on 2020/09/28 09:24:00 UTC
[jira] [Updated] (ARROW-9976) [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe

     [ https://issues.apache.org/jira/browse/ARROW-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Krisztian Szucs updated ARROW-9976:
-----------------------------------
    Fix Version/s: 2.0.0

> [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-9976
>                 URL: https://issues.apache.org/jira/browse/ARROW-9976
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: quentin lhoest
>            Assignee: Krisztian Szucs
>            Priority: Minor
>             Fix For: 2.0.0
>
>
> When calling Table.from_pandas() with a large dataset with a column of vectors (np.array), there is an `ArrowCapacityError`
> To reproduce:
> {code:python}
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> n = 1713614
> df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)})
> pa.Table.from_pandas(df)
> {code}
> With a smaller n it works.
> Error raised:
> {noformat}
> ---------------------------------------------------------------------------
> ArrowCapacityError                        Traceback (most recent call last)
> <ipython-input-7-1a7b68a179a0> in <module>
> ----> 1 _ = pa.Table.from_pandas(df)
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
>     591         for i, maybe_fut in enumerate(arrays):
>     592             if isinstance(maybe_fut, futures.Future):
> --> 593                 arrays[i] = maybe_fut.result()
>     594 
>     595     types = [x.type for x in arrays]
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
>     423                 raise CancelledError()
>     424             elif self._state == FINISHED:
> --> 425                 return self.__get_result()
>     426 
>     427             self._condition.wait(timeout)
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
>     382     def __get_result(self):
>     383         if self._exception:
> --> 384             raise self._exception
>     385         else:
>     386             return self._result
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py in run(self)
>      55 
>      56         try:
> ---> 57             result = self.fn(*self.args, **self.kwargs)
>      58         except BaseException as exc:
>      59             self.future.set_exception(exc)
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
>     557 
>     558         try:
> --> 559             result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>     560         except (pa.ArrowInvalid,
>     561                 pa.ArrowNotImplementedError,
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648
> {noformat}
> I guess one needs to chunk the data before creating the arrays ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)