You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Olivier Giboin (JIRA)" <ji...@apache.org> on 2019/08/04 18:20:00 UTC
[jira] [Comment Edited] (ARROW-6059) [Python] Regression memory
issue when calling pandas.read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896260#comment-16896260 ]
Olivier Giboin edited comment on ARROW-6059 at 8/4/19 6:19 PM:
---------------------------------------------------------------
Just did a test with use_threads = False --> same error.
--> correct, different symptoms than ARROW-6060, but same root cause?
Interestingly, the way memory gets allocated during read_table looks different between 0.13 and 0.14.1
v0.13 - smooth memory allocation (took ~40s to load file). Load successful
!Memory_profile_0.13_rs.png!
v0.14.1 - weird spiky memory allocation (shown here with use_threads = False - gets spikier when switching to True)
Load failed: I get a malloc error at the end of the command
!Memory_profile_0.14.1_use_thread_false_rs.png!
was (Author: gggibs):
Just did a test with use_threads = False --> same error.
--> correct, not directly linked with ARROW-6060
Interestingly, the way memory gets allocated during read_table looks different between 0.13 and 0.14.1
v0.13 - smooth memory allocation (took ~40s to load file). Load successful
!Memory_profile_0.13_rs.png!
v0.14.1 - weird spiky memory allocation (shown here with use_threads = False - gets spikier when switching to True)
Load failed: I get a malloc error at the end of the command
!Memory_profile_0.14.1_use_thread_false_rs.png!
> [Python] Regression memory issue when calling pandas.read_parquet
> -----------------------------------------------------------------
>
> Key: ARROW-6059
> URL: https://issues.apache.org/jira/browse/ARROW-6059
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.14.0, 0.14.1
> Reporter: Francisco Sanchez
> Priority: Major
> Attachments: Memory_profile_0.13.png, Memory_profile_0.13_rs.png, Memory_profile_0.14.1_use_thread_FALSE.png, Memory_profile_0.14.1_use_thread_false_rs.png, Memory_profile_0.14.1_use_thread_true.png
>
>
> I have a ~3MB parquet file with the next schema:
> {code:java}
> bag_stamp: timestamp[ns]
> transforms_[]_.header.seq: list<item: int64>
> child 0, item: int64
> transforms_[]_.header.stamp: list<item: timestamp[ns]>
> child 0, item: timestamp[ns]
> transforms_[]_.header.frame_id: list<item: string>
> child 0, item: string
> transforms_[]_.child_frame_id: list<item: string>
> child 0, item: string
> transforms_[]_.transform.translation.x: list<item: double>
> child 0, item: double
> transforms_[]_.transform.translation.y: list<item: double>
> child 0, item: double
> transforms_[]_.transform.translation.z: list<item: double>
> child 0, item: double
> transforms_[]_.transform.rotation.x: list<item: double>
> child 0, item: double
> transforms_[]_.transform.rotation.y: list<item: double>
> child 0, item: double
> transforms_[]_.transform.rotation.z: list<item: double>
> child 0, item: double
> transforms_[]_.transform.rotation.w: list<item: double>
> child 0, item: double
> {code}
> If I read it with *pandas.read_parquet()* using pyarrow 0.13.0 all seems fine and it takes no time to load. If I try the same with 0.14.0 or 0.14.1 it takes a lot of time and uses ~10GB of RAM. Many times if I don't have enough available memory it will just be killed OOM. Now, if I use the next code snippet instead it works perfectly with all the versions:
> {code}
> parquet_file = pq.ParquetFile(input_file)
> tables = []
> for row_group in range(parquet_file.num_row_groups):
> tables.append(parquet_file.read_row_group(row_group, columns=columns, use_pandas_metadata=True))
> df = pa.concat_tables(tables).to_pandas()
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)