You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/04/27 02:47:00 UTC

[jira] [Commented] (ARROW-7556) [Python][Parquet] Performance regression in pyarrow-0.15.1 vs pyarrow-0.12.1 when reading a "partitioned parquet table" ?

    [ https://issues.apache.org/jira/browse/ARROW-7556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092951#comment-17092951 ] 

Wes McKinney commented on ARROW-7556:
-------------------------------------

It would be nice to assess this using the new datasets API

> [Python][Parquet] Performance regression in pyarrow-0.15.1 vs pyarrow-0.12.1 when reading a "partitioned parquet table" ?
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7556
>                 URL: https://issues.apache.org/jira/browse/ARROW-7556
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Benchmarking, Python
>    Affects Versions: 0.15.1
>            Reporter: Julien MASSIOT
>            Priority: Major
>              Labels: dataset-parquet-read, parquet
>             Fix For: 1.0.0
>
>         Attachments: arrow-7756.tar.gz, load_df_from_pyarrow.py, load_df_pyarrow-0-15.1.png, load_df_pyarrow-0.12.1.png, load_pyarrow_0.12.1.cprof, load_pyarrow_0.15.1.cprof
>
>
> Hi,
> I am currently running a small test with pyarrow to load 2 "partitioned" parquet tables.
> The performance seems to be 2 times less with pyarrow-0.15.1 than with pyarrow-0.12.1.
> In my test:
>  * the 'parquet' tables I am loading are called 'reports' & 'memory'
>  * they have been generated through pandas.to_parquet by specifying the partitions columns
>  * they are both partitioned by 2 columns 'p_type' and 'p_start'
>  * it is small tables:
>  ** reports
>  *** 90 partitions (1 parquet file / partition)
>  *** total size: 6.2MB
>  ** memory
>  *** 105 partitions (1 parquet file / partition)
>  *** total size: 9.1MB
>  
> Here is the code of my simple test that tries to read them (I'm using a filter on the p_start partition):
>  
> {code:java}
> // code placeholder
> import os
> import sys
> import time
> import pyarrow
> from pyarrow.parquet import ParquetDataset
> def load_dataframe(data_dir, table, start_date, end_date):
>     return ParquetDataset(os.path.join(data_dir, table),
>                           filters=[('p_start', '>=', start_date),
>                                    ('p_start', '<=', end_date)
>                                    ]).read().to_pandas()
> print(f'pyarrow version;{pyarrow.__version__}')
> data_dir = sys.argv[1]
> for i in range(1, 10):
>     start = time.time()
>     start_date = '201912230000'
>     end_date = '202001080000'
>     load_dataframe(sys.argv[1], 'reports', start_date, end_date)
>     load_dataframe(sys.argv[1], 'memory', start_date, end_date)
>     print(f'loaded;in;{time.time()-start}')
> {code}
>  
>  
> Here are the results:
>  * with pyarrow-0.12.1
> $ python -m cProfile -o load_pyarrow_0.12.1.cprof load_df_from_pyarrow.py parquet/
> pyarrow version;0.12.1
> loaded;in;0.5566098690032959
> loaded;in;0.32605648040771484
> loaded;in;0.28951501846313477
> loaded;in;0.29279112815856934
> loaded;in;0.3474299907684326
> loaded;in;0.4075736999511719
> loaded;in;0.425199031829834
> loaded;in;0.34653329849243164
> loaded;in;0.300839900970459
> (~350ms to load the 2 tables)
>  
> !load_df_pyarrow-0.12.1.png!
>  
>  * with pyarrow-0.15.1
>  
> $ python -m cProfile -o load_pyarrow_0.15.1.cprof load_df_from_pyarrow.py parquet/
> pyarrow version;0.15.1
> loaded;in;1.1126022338867188
> loaded;in;0.8931224346160889
> loaded;in;1.3298325538635254
> loaded;in;0.8584625720977783
> loaded;in;0.9232609272003174
> loaded;in;1.0619215965270996
> loaded;in;0.8619768619537354
> loaded;in;0.8686420917510986
> loaded;in;1.1183602809906006
> (>800ms to load the 2 tables)
>  
> !load_df_pyarrow-0-15.1.png!
>  
> Is there a performance regression here ?
> Am I missing something ?
>  
> In attachment, you can find the 2 .cprof files.
> [^load_pyarrow_0.12.1.cprof]
> [^load_pyarrow_0.15.1.cprof]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)