You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Robin Kåveland (JIRA)" <ji...@apache.org> on 2019/08/02 22:27:00 UTC

[jira] [Comment Edited] (ARROW-6060) [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True

    [ https://issues.apache.org/jira/browse/ARROW-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899250#comment-16899250 ] 

Robin Kåveland edited comment on ARROW-6060 at 8/2/19 10:26 PM:
----------------------------------------------------------------

I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files that we could load just fine with 16GB of RAM earlier fail to load using VMs with 28GB of RAM. Unfortunately, I can't disclose any of the data either. We are using {{parquet.ParquetDataset.read()}}, but observe the problem even if we read single pieces of the parquet data sets (the pieces are between 100MB and 200MB). Most of our columns are unicode and probably would be friendly to dictionary encoding. The files have been written by spark. Normally, these datasets would take a while to load, so memory consumption would grow steadily for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few seconds, so allocation seems very spiky.


was (Author: kaaveland):
I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files that we could load just fine with 16GB of RAM fail to load using VMs with 28GB of RAM. Unfortunately, I can't disclose any of the data either. We are using {{parquet.ParquetDataset.read()}}, but observe the problem even if we read single pieces of the parquet data sets (the pieces are between 100MB and 200MB). Most of our columns are unicode and probably would be friendly to dictionary encoding. The files have been written by spark. Normally, these datasets would take a while to load, so memory consumption would grow steadily for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few seconds, so allocation seems very spiky.

> [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-6060
>                 URL: https://issues.apache.org/jira/browse/ARROW-6060
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.14.1
>            Reporter: Kun Liu
>            Priority: Major
>
>  I tried to load a parquet file of about 1.8Gb using the following code. It crashed due to out of memory issue.
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('/tmp/test.parquet'){code}
>  However, it worked well with use_threads=True as follows
> {code:java}
> pq.read_table('/tmp/test.parquet', use_threads=False){code}
> If pyarrow is downgraded to 0.12.1, there is no such problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)