You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "quentin lhoest (Jira)" <ji...@apache.org> on 2020/09/11 15:25:00 UTC
[jira] [Created] (ARROW-9976) [Python] ArrowCapacityError when
doing Table.from_pandas with large dataframe
quentin lhoest created ARROW-9976:
-------------------------------------
Summary: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe
Key: ARROW-9976
URL: https://issues.apache.org/jira/browse/ARROW-9976
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 1.0.1
Reporter: quentin lhoest
When calling Table.from_pandas() with a large dataset with a column of vectors (np.array), there is an `ArrowCapacityError`
To reproduce:
{code:python}
import pandas as pd
import numpy as np
import pyarrow as pa
n = 1713614
df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)})
pa.Table.from_pandas(df)
{code}
With a smaller n it works.
Error raised:
{noformat}
---------------------------------------------------------------------------
ArrowCapacityError Traceback (most recent call last)
<ipython-input-7-1a7b68a179a0> in <module>
----> 1 _ = pa.Table.from_pandas(df)
~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
591 for i, maybe_fut in enumerate(arrays):
592 if isinstance(maybe_fut, futures.Future):
--> 593 arrays[i] = maybe_fut.result()
594
595 types = [x.type for x in arrays]
~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
423 raise CancelledError()
424 elif self._state == FINISHED:
--> 425 return self.__get_result()
426
427 self._condition.wait(timeout)
~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py in run(self)
55
56 try:
---> 57 result = self.fn(*self.args, **self.kwargs)
58 except BaseException as exc:
59 self.future.set_exception(exc)
~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
557
558 try:
--> 559 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
560 except (pa.ArrowInvalid,
561 pa.ArrowNotImplementedError,
~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648
{noformat}
I guess one needs to chunk the data before creating the arrays ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)