You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Axel (Jira)" <ji...@apache.org> on 2019/11/26 14:21:00 UTC

[jira] [Commented] (ARROW-6876) [Python] Reading parquet file with many columns becomes slow for 0.15.0

    [ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982524#comment-16982524 ] 

Axel commented on ARROW-6876:
-----------------------------

Hi, I am still experiencing some very slow load times with version 0.15.1.

With the reproducer above:

{{0.14.1:}}
{{282 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)}}

{{0.15.1}}

{{5.06 s ± 288 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)}}

 

From reading the github issue, I expected it to be slower than 0.14.1 but not by this much.

> [Python] Reading parquet file with many columns becomes slow for 0.15.0
> -----------------------------------------------------------------------
>
>                 Key: ARROW-6876
>                 URL: https://issues.apache.org/jira/browse/ARROW-6876
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.0
>         Environment: python3.7
>            Reporter: Bob
>            Assignee: Wes McKinney
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.0.0, 0.15.1
>
>         Attachments: image-2019-10-14-18-10-42-850.png, image-2019-10-14-18-12-07-652.png
>
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)