You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Satoshi Nakamoto (Jira)" <ji...@apache.org> on 2022/06/07 13:35:00 UTC
[jira] [Created] (ARROW-16775) pyarrow's read_table is way slower than iter_batches
Satoshi Nakamoto created ARROW-16775:
----------------------------------------
Summary: pyarrow's read_table is way slower than iter_batches
Key: ARROW-16775
URL: https://issues.apache.org/jira/browse/ARROW-16775
Project: Apache Arrow
Issue Type: Bug
Components: Parquet, Python
Affects Versions: 8.0.0
Environment: pyarrow 8.0.0
pandas 1.4.2
numpy 1.22.4
python 3.9
I reproduced this behaviour on two machines:
* macbook pro with m1 max 32 gb and cpython 3.9.12 from conda miniforge
* pytorch docker container on standard linux machine
Reporter: Satoshi Nakamoto
Hi!
Loading a table created from DataFrame `pyarrow.parquet.read_table()` is taking 3x much time as loading it as batches
{code:java}
pyarrow.Table.from_batches(
list(pyarrow.parquet.ParquetFile.iter_batches()
){code}
h4. Minimal example
{code:java}
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(
{
"a": np.random.random(10**9),
"b": np.random.random(10**9)
}
)
df.to_parquet("file.parquet")
table_of_whole_file = pq.read_table("file.parquet")
table_of_batches = pa.Table.from_batches(
list(
pq.ParquetFile("file.parquet").iter_batches()
)
)
table_of_one_batch = pa.Table.from_batches(
[
next(pq.ParquetFile("file.parquet")
.iter_batches(batch_size=10**9))
]
){code}
_table_of_batches_ reading time is 11.5 seconds, _table_of_whole_file_ read time is 33.2s.
Also loading table as one batch _table_of_one_batch_ is slightly faster: 9.8s.
h4. Parquet file metadata
{code:java}
<pyarrow._parquet.FileMetaData object at 0x129ab83b0>
created_by: parquet-cpp-arrow version 8.0.0
num_columns: 2
num_rows: 1000000000
num_row_groups: 15
format_version: 1.0
serialized_size: 5680 {code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)