You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Micah Kornfield (Jira)" <ji...@apache.org> on 2021/02/14 21:26:00 UTC
[jira] [Comment Edited] (ARROW-11607) [Python] Error when reading
table with list values from parquet
[ https://issues.apache.org/jira/browse/ARROW-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284499#comment-17284499 ]
Micah Kornfield edited comment on ARROW-11607 at 2/14/21, 9:25 PM:
-------------------------------------------------------------------
So I now have a repro in C++ Unfortunately most of our unit tests are written against an API that doesn't use RecordBatchReader and thus this edge case wasn't caught there.
was (Author: emkornfield):
So I now have a repro in C++ Unfortunately most of our unit tests are written against an API that doesn't use RecordBatchReader and thus row size.
> [Python] Error when reading table with list values from parquet
> ---------------------------------------------------------------
>
> Key: ARROW-11607
> URL: https://issues.apache.org/jira/browse/ARROW-11607
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
> Environment: Python 3.7
> Reporter: Michal Glaus
> Assignee: Micah Kornfield
> Priority: Major
> Fix For: 4.0.0
>
>
> I'm getting unexpected results when reading tables containing list values and a large number of rows from a parquet file.
> Example code (pyarrow 2.0.0 and 3.0.0):
> {code:java}
> from pyarrow import parquet, Table
> data = [None] * (1 << 20)
> data.append([1])
> table = Table.from_arrays([data], ['column'])
> print('Expected: %s' % table['column'][-1])
> parquet.write_table(table, 'table.parquet')
> table2 = parquet.read_table('table.parquet')
> print('Actual: %s' % table2['column'][-1]{code}
> Output:
> {noformat}
> Expected: [1]
> Actual: [0]{noformat}
> When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:
> {noformat}
> Expected: [1]
> Actual: [1]{noformat}
> For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.
> It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime):
> {noformat}
> data.append([{'a': 0.1, 'b': datetime.now()}])
> {noformat}
> I'm getting this exception after calling table2.to_pandas() :
> {noformat}
> /arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default memory pool{noformat}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)