You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Satoshi Nakamoto (Jira)" <ji...@apache.org> on 2022/06/07 13:35:00 UTC

[jira] [Created] (ARROW-16775) pyarrow's read_table is way slower than iter_batches

Satoshi Nakamoto created ARROW-16775:
----------------------------------------

             Summary: pyarrow's read_table is way slower than iter_batches
                 Key: ARROW-16775
                 URL: https://issues.apache.org/jira/browse/ARROW-16775
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet, Python
    Affects Versions: 8.0.0
         Environment: pyarrow 8.0.0
pandas 1.4.2
numpy 1.22.4
python 3.9

I reproduced this behaviour on two machines: 
* macbook pro with m1 max 32 gb and cpython 3.9.12 from conda miniforge
* pytorch docker container on standard linux machine
            Reporter: Satoshi Nakamoto


Hi!

Loading a table created from DataFrame  `pyarrow.parquet.read_table()` is taking 3x  much time as loading it as batches

 
{code:java}
pyarrow.Table.from_batches(
    list(pyarrow.parquet.ParquetFile.iter_batches()
){code}
 
h4. Minimal example

 
{code:java}
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(
    {
        "a": np.random.random(10**9), 
        "b": np.random.random(10**9)
    }
)

df.to_parquet("file.parquet")

table_of_whole_file = pq.read_table("file.parquet")

table_of_batches = pa.Table.from_batches(
    list(
        pq.ParquetFile("file.parquet").iter_batches()
    )
)

table_of_one_batch = pa.Table.from_batches(
    [
        next(pq.ParquetFile("file.parquet")
        .iter_batches(batch_size=10**9))
    ]
){code}
 

_table_of_batches_ reading time is 11.5 seconds, _table_of_whole_file_ read time is 33.2s.

Also loading table as one batch _table_of_one_batch_ is slightly faster: 9.8s.
h4. Parquet file metadata

 
{code:java}
<pyarrow._parquet.FileMetaData object at 0x129ab83b0>
  created_by: parquet-cpp-arrow version 8.0.0
  num_columns: 2
  num_rows: 1000000000
  num_row_groups: 15
  format_version: 1.0
  serialized_size: 5680 {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)