You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Michal Glaus (Jira)" <ji...@apache.org> on 2021/02/12 12:37:00 UTC
[jira] [Created] (ARROW-11607) [Python] Error when reading table
with list values from parquet
Michal Glaus created ARROW-11607:
------------------------------------
Summary: [Python] Error when reading table with list values from parquet
Key: ARROW-11607
URL: https://issues.apache.org/jira/browse/ARROW-11607
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 3.0.0, 2.0.0, 1.0.1, 1.0.0
Environment: Python 3.7
Reporter: Michal Glaus
I'm getting unexpected results when reading tables containing list values and a large number of rows from a parquet file.
Example code (pyarrow 2.0.0 and 3.0.0):
{code:java}
from pyarrow import parquet, Table
data = [None] * (1 << 20)
data.append([1])
table = Table.from_arrays([data], ['column'])
print('Expected: %s' % table['column'][-1])
parquet.write_table(table, 'table.parquet')
table2 = parquet.read_table('table.parquet')
print('Actual: %s' % table2['column'][-1]{code}
Output:
{noformat}
Expected: [1]
Actual: [0]{noformat}
When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:
{noformat}
Expected: [1]
Actual: [1]{noformat}
For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.
It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime):
{noformat}
data.append([{'a': 0.1, 'b': datetime.now()}])
{noformat}
I'm getting this exception after calling table2.to_pandas() :
{noformat}
/arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default memory pool{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)