You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alessandro Molina (Jira)" <ji...@apache.org> on 2022/06/08 13:12:00 UTC
[jira] [Updated] (ARROW-16775) pyarrow's read_table is way slower than iter_batches
[ https://issues.apache.org/jira/browse/ARROW-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alessandro Molina updated ARROW-16775:
--------------------------------------
Priority: Critical (was: Major)
> pyarrow's read_table is way slower than iter_batches
> ----------------------------------------------------
>
> Key: ARROW-16775
> URL: https://issues.apache.org/jira/browse/ARROW-16775
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 8.0.0
> Environment: pyarrow 8.0.0
> pandas 1.4.2
> numpy 1.22.4
> python 3.9
> I reproduced this behaviour on two machines:
> * macbook pro with m1 max 32 gb and cpython 3.9.12 from conda miniforge
> * pytorch docker container on standard linux machine
> Reporter: Satoshi Nakamoto
> Priority: Critical
>
> Hi!
> Loading a table created from DataFrame `pyarrow.parquet.read_table()` is taking 3x much time as loading it as batches
>
> {code:java}
> pyarrow.Table.from_batches(
> list(pyarrow.parquet.ParquetFile.iter_batches()
> ){code}
>
> h4. Minimal example
>
> {code:java}
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(
> {
> "a": np.random.random(10**9),
> "b": np.random.random(10**9)
> }
> )
> df.to_parquet("file.parquet")
> table_of_whole_file = pq.read_table("file.parquet")
> table_of_batches = pa.Table.from_batches(
> list(
> pq.ParquetFile("file.parquet").iter_batches()
> )
> )
> table_of_one_batch = pa.Table.from_batches(
> [
> next(pq.ParquetFile("file.parquet")
> .iter_batches(batch_size=10**9))
> ]
> ){code}
>
> _table_of_batches_ reading time is 11.5 seconds, _table_of_whole_file_ read time is 33.2s.
> Also loading table as one batch _table_of_one_batch_ is slightly faster: 9.8s.
> h4. Parquet file metadata
>
> {code:java}
> <pyarrow._parquet.FileMetaData object at 0x129ab83b0>
> created_by: parquet-cpp-arrow version 8.0.0
> num_columns: 2
> num_rows: 1000000000
> num_row_groups: 15
> format_version: 1.0
> serialized_size: 5680 {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)