You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Alessandro Molina (Jira)" <ji...@apache.org> on 2022/06/08 13:12:00 UTC

[jira] [Updated] (ARROW-16775) pyarrow's read_table is way slower than iter_batches

     [ https://issues.apache.org/jira/browse/ARROW-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alessandro Molina updated ARROW-16775:
--------------------------------------
    Priority: Critical  (was: Major)

> pyarrow's read_table is way slower than iter_batches
> ----------------------------------------------------
>
>                 Key: ARROW-16775
>                 URL: https://issues.apache.org/jira/browse/ARROW-16775
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 8.0.0
>         Environment: pyarrow 8.0.0
> pandas 1.4.2
> numpy 1.22.4
> python 3.9
> I reproduced this behaviour on two machines: 
> * macbook pro with m1 max 32 gb and cpython 3.9.12 from conda miniforge
> * pytorch docker container on standard linux machine
>            Reporter: Satoshi Nakamoto
>            Priority: Critical
>
> Hi!
> Loading a table created from DataFrame  `pyarrow.parquet.read_table()` is taking 3x  much time as loading it as batches
>  
> {code:java}
> pyarrow.Table.from_batches(
>     list(pyarrow.parquet.ParquetFile.iter_batches()
> ){code}
>  
> h4. Minimal example
>  
> {code:java}
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(
>     {
>         "a": np.random.random(10**9), 
>         "b": np.random.random(10**9)
>     }
> )
> df.to_parquet("file.parquet")
> table_of_whole_file = pq.read_table("file.parquet")
> table_of_batches = pa.Table.from_batches(
>     list(
>         pq.ParquetFile("file.parquet").iter_batches()
>     )
> )
> table_of_one_batch = pa.Table.from_batches(
>     [
>         next(pq.ParquetFile("file.parquet")
>         .iter_batches(batch_size=10**9))
>     ]
> ){code}
>  
> _table_of_batches_ reading time is 11.5 seconds, _table_of_whole_file_ read time is 33.2s.
> Also loading table as one batch _table_of_one_batch_ is slightly faster: 9.8s.
> h4. Parquet file metadata
>  
> {code:java}
> <pyarrow._parquet.FileMetaData object at 0x129ab83b0>
>   created_by: parquet-cpp-arrow version 8.0.0
>   num_columns: 2
>   num_rows: 1000000000
>   num_row_groups: 15
>   format_version: 1.0
>   serialized_size: 5680 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)