You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Satoshi Nakamoto (Jira)" <ji...@apache.org> on 2022/07/08 12:02:00 UTC
[jira] [Commented] (ARROW-16775) [Python] pyarrow's read_table is way slower than iter_batches

    [ https://issues.apache.org/jira/browse/ARROW-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564252#comment-17564252 ] 

Satoshi Nakamoto commented on ARROW-16775:
------------------------------------------

[~westonpace] here are results on my linux server machine with python3.9
{code:java}
Multiplier: 0.5
  table_of_whole_file: 0.48801302909851074
  table_of_batches: 0.2905139923095703
  difference: -0.19749903678894043
  svmem(total=34359738368, available=15275065344, percent=55.5, used=16666951680, free=1070137344, active=14219853824, inactive=14177533952, wired=2447097856)
Multiplier: 1
  table_of_whole_file: 0.7520689964294434
  table_of_batches: 0.6367530822753906
  difference: -0.11531591415405273
  svmem(total=34359738368, available=15246065664, percent=55.6, used=15617949696, free=2043740160, active=13216202752, inactive=13194248192, wired=2401746944)
Multiplier: 1.5
  table_of_whole_file: 0.7303309440612793
  table_of_batches: 0.7717642784118652
  difference: 0.04143333435058594
  svmem(total=34359738368, available=15102951424, percent=56.0, used=14017642496, free=3168894976, active=11958730752, inactive=11807539200, wired=2058911744)
Multiplier: 2
  table_of_whole_file: 1.0269582271575928
  table_of_batches: 1.0449450016021729
  difference: 0.017986774444580078
  svmem(total=34359738368, available=14586134528, percent=57.5, used=11820531712, free=4926554112, active=9668657152, inactive=9565945856, wired=2151874560)
Multiplier: 2.5
  table_of_whole_file: 2.2550718784332275
  table_of_batches: 1.280627965927124
  difference: -0.9744439125061035
  svmem(total=34359738368, available=13351124992, percent=61.1, used=9824157696, free=5161664512, active=7733215232, inactive=8085258240, wired=2090942464)
Multiplier: 3
  table_of_whole_file: 3.1672561168670654
  table_of_batches: 1.7320072650909424
  difference: -1.435248851776123
  svmem(total=34359738368, available=13742899200, percent=60.0, used=9232990208, free=5884755968, active=7187152896, inactive=7759429632, wired=2045837312) {code}
As you can see, {{batches}} consistently are faster than {{{}whole_file{}}}, especially with higher multipliers. Multipler of 3 has x1.5 difference. As I said earlier, on multiplier of 10+ the difference is x3+.

Which env were you testing on?

> [Python] pyarrow's read_table is way slower than iter_batches
> -------------------------------------------------------------
>
>                 Key: ARROW-16775
>                 URL: https://issues.apache.org/jira/browse/ARROW-16775
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 8.0.0
>         Environment: pyarrow 8.0.0
> pandas 1.4.2
> numpy 1.22.4
> python 3.9
> I reproduced this behaviour on two machines: 
> * macbook pro with m1 max 32 gb and cpython 3.9.12 from conda miniforge
> * pytorch docker container on standard linux machine
>            Reporter: Satoshi Nakamoto
>            Priority: Major
>         Attachments: image-2022-06-16-03-04-25-158.png
>
>
> Hi!
> Loading a table created from DataFrame  `pyarrow.parquet.read_table()` is taking 3x  much time as loading it as batches
>  
> {code:java}
> pyarrow.Table.from_batches(
>     list(pyarrow.parquet.ParquetFile.iter_batches()
> ){code}
>  
> h4. Minimal example
>  
> {code:java}
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(
>     {
>         "a": np.random.random(10**9), 
>         "b": np.random.random(10**9)
>     }
> )
> df.to_parquet("file.parquet")
> table_of_whole_file = pq.read_table("file.parquet")
> table_of_batches = pa.Table.from_batches(
>     list(
>         pq.ParquetFile("file.parquet").iter_batches()
>     )
> )
> table_of_one_batch = pa.Table.from_batches(
>     [
>         next(pq.ParquetFile("file.parquet")
>         .iter_batches(batch_size=10**9))
>     ]
> ){code}
>  
> _table_of_batches_ reading time is 11.5 seconds, _table_of_whole_file_ read time is 33.2s.
> Also loading table as one batch _table_of_one_batch_ is slightly faster: 9.8s.
> h4. Parquet file metadata
>  
> {code:java}
> <pyarrow._parquet.FileMetaData object at 0x129ab83b0>
>   created_by: parquet-cpp-arrow version 8.0.0
>   num_columns: 2
>   num_rows: 1000000000
>   num_row_groups: 15
>   format_version: 1.0
>   serialized_size: 5680 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)