You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Micah Kornfield (Jira)" <ji...@apache.org> on 2021/02/13 17:48:00 UTC

[jira] [Assigned] (ARROW-11607) [Python] Error when reading table with list values from parquet

     [ https://issues.apache.org/jira/browse/ARROW-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Micah Kornfield reassigned ARROW-11607:
---------------------------------------

    Assignee: Micah Kornfield

> [Python] Error when reading table with list values from parquet
> ---------------------------------------------------------------
>
>                 Key: ARROW-11607
>                 URL: https://issues.apache.org/jira/browse/ARROW-11607
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
>         Environment: Python 3.7
>            Reporter: Michal Glaus
>            Assignee: Micah Kornfield
>            Priority: Major
>             Fix For: 4.0.0
>
>
> I'm getting unexpected results when reading tables containing list values and a large number of rows from a parquet file.
> Example code (pyarrow 2.0.0 and 3.0.0):
> {code:java}
> from pyarrow import parquet, Table
> data = [None] * (1 << 20)
> data.append([1])
> table = Table.from_arrays([data], ['column'])
> print('Expected: %s' % table['column'][-1])
> parquet.write_table(table, 'table.parquet')
> table2 = parquet.read_table('table.parquet')
> print('Actual:   %s' % table2['column'][-1]{code}
> Output:
> {noformat}
> Expected: [1]
> Actual:   [0]{noformat}
> When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:
> {noformat}
> Expected: [1]
> Actual:   [1]{noformat}
> For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.
> It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime):
> {noformat}
> data.append([{'a': 0.1, 'b': datetime.now()}])
> {noformat}
> I'm getting this exception after calling table2.to_pandas() :
> {noformat}
> /arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default memory pool{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)