You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "quentin lhoest (Jira)" <ji...@apache.org> on 2020/09/11 15:25:00 UTC
[jira] [Created] (ARROW-9976) [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe

quentin lhoest created ARROW-9976:
-------------------------------------

             Summary: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe
                 Key: ARROW-9976
                 URL: https://issues.apache.org/jira/browse/ARROW-9976
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 1.0.1
            Reporter: quentin lhoest


When calling Table.from_pandas() with a large dataset with a column of vectors (np.array), there is an `ArrowCapacityError`

To reproduce:

{code:python}
import pandas as pd
import numpy as np
import pyarrow as pa

n = 1713614
df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)})
pa.Table.from_pandas(df)
{code}

With a smaller n it works.

Error raised:

{noformat}
---------------------------------------------------------------------------
ArrowCapacityError                        Traceback (most recent call last)
<ipython-input-7-1a7b68a179a0> in <module>
----> 1 _ = pa.Table.from_pandas(df)

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    591         for i, maybe_fut in enumerate(arrays):
    592             if isinstance(maybe_fut, futures.Future):
--> 593                 arrays[i] = maybe_fut.result()
    594 
    595     types = [x.type for x in arrays]

~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    423                 raise CancelledError()
    424             elif self._state == FINISHED:
--> 425                 return self.__get_result()
    426 
    427             self._condition.wait(timeout)

~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py in run(self)
     55 
     56         try:
---> 57             result = self.fn(*self.args, **self.kwargs)
     58         except BaseException as exc:
     59             self.future.set_exception(exc)

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    557 
    558         try:
--> 559             result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    560         except (pa.ArrowInvalid,
    561                 pa.ArrowNotImplementedError,

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648
{noformat}



I guess one needs to chunk the data before creating the arrays ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)