You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Satoshi Nakamoto (Jira)" <ji...@apache.org> on 2022/07/08 12:02:00 UTC
[jira] [Commented] (ARROW-16775) [Python] pyarrow's read_table is way slower than iter_batches
[ https://issues.apache.org/jira/browse/ARROW-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564252#comment-17564252 ]
Satoshi Nakamoto commented on ARROW-16775:
------------------------------------------
[~westonpace] here are results on my linux server machine with python3.9
{code:java}
Multiplier: 0.5
table_of_whole_file: 0.48801302909851074
table_of_batches: 0.2905139923095703
difference: -0.19749903678894043
svmem(total=34359738368, available=15275065344, percent=55.5, used=16666951680, free=1070137344, active=14219853824, inactive=14177533952, wired=2447097856)
Multiplier: 1
table_of_whole_file: 0.7520689964294434
table_of_batches: 0.6367530822753906
difference: -0.11531591415405273
svmem(total=34359738368, available=15246065664, percent=55.6, used=15617949696, free=2043740160, active=13216202752, inactive=13194248192, wired=2401746944)
Multiplier: 1.5
table_of_whole_file: 0.7303309440612793
table_of_batches: 0.7717642784118652
difference: 0.04143333435058594
svmem(total=34359738368, available=15102951424, percent=56.0, used=14017642496, free=3168894976, active=11958730752, inactive=11807539200, wired=2058911744)
Multiplier: 2
table_of_whole_file: 1.0269582271575928
table_of_batches: 1.0449450016021729
difference: 0.017986774444580078
svmem(total=34359738368, available=14586134528, percent=57.5, used=11820531712, free=4926554112, active=9668657152, inactive=9565945856, wired=2151874560)
Multiplier: 2.5
table_of_whole_file: 2.2550718784332275
table_of_batches: 1.280627965927124
difference: -0.9744439125061035
svmem(total=34359738368, available=13351124992, percent=61.1, used=9824157696, free=5161664512, active=7733215232, inactive=8085258240, wired=2090942464)
Multiplier: 3
table_of_whole_file: 3.1672561168670654
table_of_batches: 1.7320072650909424
difference: -1.435248851776123
svmem(total=34359738368, available=13742899200, percent=60.0, used=9232990208, free=5884755968, active=7187152896, inactive=7759429632, wired=2045837312) {code}
As you can see, {{batches}} consistently are faster than {{{}whole_file{}}}, especially with higher multipliers. Multipler of 3 has x1.5 difference. As I said earlier, on multiplier of 10+ the difference is x3+.
Which env were you testing on?
> [Python] pyarrow's read_table is way slower than iter_batches
> -------------------------------------------------------------
>
> Key: ARROW-16775
> URL: https://issues.apache.org/jira/browse/ARROW-16775
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 8.0.0
> Environment: pyarrow 8.0.0
> pandas 1.4.2
> numpy 1.22.4
> python 3.9
> I reproduced this behaviour on two machines:
> * macbook pro with m1 max 32 gb and cpython 3.9.12 from conda miniforge
> * pytorch docker container on standard linux machine
> Reporter: Satoshi Nakamoto
> Priority: Major
> Attachments: image-2022-06-16-03-04-25-158.png
>
>
> Hi!
> Loading a table created from DataFrame `pyarrow.parquet.read_table()` is taking 3x much time as loading it as batches
>
> {code:java}
> pyarrow.Table.from_batches(
> list(pyarrow.parquet.ParquetFile.iter_batches()
> ){code}
>
> h4. Minimal example
>
> {code:java}
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(
> {
> "a": np.random.random(10**9),
> "b": np.random.random(10**9)
> }
> )
> df.to_parquet("file.parquet")
> table_of_whole_file = pq.read_table("file.parquet")
> table_of_batches = pa.Table.from_batches(
> list(
> pq.ParquetFile("file.parquet").iter_batches()
> )
> )
> table_of_one_batch = pa.Table.from_batches(
> [
> next(pq.ParquetFile("file.parquet")
> .iter_batches(batch_size=10**9))
> ]
> ){code}
>
> _table_of_batches_ reading time is 11.5 seconds, _table_of_whole_file_ read time is 33.2s.
> Also loading table as one batch _table_of_one_batch_ is slightly faster: 9.8s.
> h4. Parquet file metadata
>
> {code:java}
> <pyarrow._parquet.FileMetaData object at 0x129ab83b0>
> created_by: parquet-cpp-arrow version 8.0.0
> num_columns: 2
> num_rows: 1000000000
> num_row_groups: 15
> format_version: 1.0
> serialized_size: 5680 {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)