You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Igor Yastrebov (Jira)" <ji...@apache.org> on 2019/08/30 09:40:00 UTC

[jira] [Commented] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14

    [ https://issues.apache.org/jira/browse/ARROW-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919370#comment-16919370 ] 

Igor Yastrebov commented on ARROW-6380:
---------------------------------------

Is it a duplicate of [ARROW-6059|https://issues.apache.org/jira/browse/ARROW-6059]?

> Method pyarrow.parquet.read_table has memory spikes from version 0.14
> ---------------------------------------------------------------------
>
>                 Key: ARROW-6380
>                 URL: https://issues.apache.org/jira/browse/ARROW-6380
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 0.14.0, 0.14.1
>         Environment: ubuntu 18, 16GB ram, 4 cpus
>            Reporter: Renan Alves Fonseca
>            Priority: Major
>             Fix For: 0.13.0
>
>
> Method pyarrow.parquet.read_table is very slow and cause RAM spikes from version 0.14.0
> Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x
> This impact in performance is easily measured. However, there is another problem that I could only detect on htop screen. While opening a 40MB parquet, the process occupies almost 16GB for some miliseconds. The pyarrow table will result in around 300MB in the python process (registered using memory-profiler). This does not happens in versions 0.13 and previous ones.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)