You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Krisztian Szucs (Jira)" <ji...@apache.org> on 2020/09/14 11:34:00 UTC
[jira] [Assigned] (ARROW-9976) [Python] ArrowCapacityError when
doing Table.from_pandas with large dataframe
[ https://issues.apache.org/jira/browse/ARROW-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Krisztian Szucs reassigned ARROW-9976:
--------------------------------------
Assignee: Krisztian Szucs
> [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe
> -----------------------------------------------------------------------------
>
> Key: ARROW-9976
> URL: https://issues.apache.org/jira/browse/ARROW-9976
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.1
> Reporter: quentin lhoest
> Assignee: Krisztian Szucs
> Priority: Minor
>
> When calling Table.from_pandas() with a large dataset with a column of vectors (np.array), there is an `ArrowCapacityError`
> To reproduce:
> {code:python}
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> n = 1713614
> df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)})
> pa.Table.from_pandas(df)
> {code}
> With a smaller n it works.
> Error raised:
> {noformat}
> ---------------------------------------------------------------------------
> ArrowCapacityError Traceback (most recent call last)
> <ipython-input-7-1a7b68a179a0> in <module>
> ----> 1 _ = pa.Table.from_pandas(df)
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 591 for i, maybe_fut in enumerate(arrays):
> 592 if isinstance(maybe_fut, futures.Future):
> --> 593 arrays[i] = maybe_fut.result()
> 594
> 595 types = [x.type for x in arrays]
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
> 423 raise CancelledError()
> 424 elif self._state == FINISHED:
> --> 425 return self.__get_result()
> 426
> 427 self._condition.wait(timeout)
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
> 382 def __get_result(self):
> 383 if self._exception:
> --> 384 raise self._exception
> 385 else:
> 386 return self._result
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py in run(self)
> 55
> 56 try:
> ---> 57 result = self.fn(*self.args, **self.kwargs)
> 58 except BaseException as exc:
> 59 self.future.set_exception(exc)
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
> 557
> 558 try:
> --> 559 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
> 560 except (pa.ArrowInvalid,
> 561 pa.ArrowNotImplementedError,
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowCapacityError: List array cannot contain more than 2147483646 child elements, have 2147483648
> {noformat}
> I guess one needs to chunk the data before creating the arrays ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)