You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/06/29 17:46:00 UTC
[jira] [Commented] (ARROW-16775) [Python] pyarrow's read_table is way slower than iter_batches

    [ https://issues.apache.org/jira/browse/ARROW-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560600#comment-17560600 ] 

Weston Pace commented on ARROW-16775:
-------------------------------------

I'm still not having any luck reproducing this.  read_table is consistently faster than iter_batches for me.  Here is the script I'm running:

{noformat}
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import time
import math
import psutil

for multiplier in [0.5, 1, 1.5, 2, 2.5, 3]:
  batch_size = multiplier * (10**8)
  df = pd.DataFrame(
      {
          "a": np.random.random(math.ceil(batch_size)),
          "b": np.random.random(math.ceil(batch_size))
      }
   )

  df.to_parquet("file.parquet")

  print(f'Multiplier: {multiplier}')
  start = time.time()
  table_of_whole_file = pq.read_table("file.parquet")
  table_of_whole_file_time = time.time() - start
  print(f'  table_of_whole_file: {table_of_whole_file_time}')

  start = time.time()
  table_of_batches = pa.Table.from_batches(
      list(
          pq.ParquetFile("file.parquet").iter_batches(batch_size=10**9)
      )
   )
  table_of_batches_time = time.time() - start
  print(f'  table_of_batches: {table_of_batches_time}')

  print(f'  difference: {table_of_batches_time - table_of_whole_file_time}')
  virt_mem = psutil.virtual_memory()
  print(f'  {virt_mem}')
{noformat}

...and sample results...

{noformat}
Multiplier: 0.5
  table_of_whole_file: 0.36187195777893066
  table_of_batches: 0.48351168632507324
  difference: 0.12163972854614258
  svmem(total=33526784000, available=22198300672, percent=33.8, used=10641412096, free=20502384640, active=1426874368, inactive=10684268544, buffers=184336384, cached=2198650880, shared=202821632, slab=481918976)
Multiplier: 1
  table_of_whole_file: 0.6138112545013428
  table_of_batches: 0.8580827713012695
  difference: 0.24427151679992676
  svmem(total=33526784000, available=18544037888, percent=44.7, used=14295646208, free=16026877952, active=2184822784, inactive=14370017280, buffers=184406016, cached=3019853824, shared=202821632, slab=502353920)
Multiplier: 1.5
  table_of_whole_file: 0.84134840965271
  table_of_batches: 1.1889660358428955
  difference: 0.34761762619018555
  svmem(total=33526784000, available=14790352896, percent=55.9, used=19936378880, free=9563262976, active=2951610368, inactive=20027539456, buffers=184483840, cached=3842658304, shared=202821632, slab=523534336)
Multiplier: 2
  table_of_whole_file: 1.1325435638427734
  table_of_batches: 1.5530588626861572
  difference: 0.4205152988433838
  svmem(total=33526784000, available=11027169280, percent=67.1, used=21811699712, free=6840262656, active=3731607552, inactive=21917872128, buffers=185122816, cached=4689698816, shared=203624448, slab=549113856)
Multiplier: 2.5
  table_of_whole_file: 1.3614778518676758
  table_of_batches: 1.9443469047546387
  difference: 0.5828690528869629
  svmem(total=33526784000, available=7724249088, percent=77.0, used=25151610880, free=2676121600, active=4467277824, inactive=25309102080, buffers=185188352, cached=5513863168, shared=203624448, slab=574017536)
Multiplier: 3
  table_of_whole_file: 1.8207118511199951
  table_of_batches: 27.461324453353882
  difference: 25.640612602233887
  svmem(total=33526784000, available=5755801600, percent=82.8, used=27084144640, free=615559168, active=2758205440, inactive=29104529408, buffers=62664704, cached=5764415488, shared=203624448, slab=517799936)
{noformat}

I will try and find someone with an M1 to test this for me.  I'll also try with timeit to see if running in a loop is important.

> [Python] pyarrow's read_table is way slower than iter_batches
> -------------------------------------------------------------
>
>                 Key: ARROW-16775
>                 URL: https://issues.apache.org/jira/browse/ARROW-16775
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 8.0.0
>         Environment: pyarrow 8.0.0
> pandas 1.4.2
> numpy 1.22.4
> python 3.9
> I reproduced this behaviour on two machines: 
> * macbook pro with m1 max 32 gb and cpython 3.9.12 from conda miniforge
> * pytorch docker container on standard linux machine
>            Reporter: Satoshi Nakamoto
>            Priority: Major
>         Attachments: image-2022-06-16-03-04-25-158.png
>
>
> Hi!
> Loading a table created from DataFrame  `pyarrow.parquet.read_table()` is taking 3x  much time as loading it as batches
>  
> {code:java}
> pyarrow.Table.from_batches(
>     list(pyarrow.parquet.ParquetFile.iter_batches()
> ){code}
>  
> h4. Minimal example
>  
> {code:java}
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(
>     {
>         "a": np.random.random(10**9), 
>         "b": np.random.random(10**9)
>     }
> )
> df.to_parquet("file.parquet")
> table_of_whole_file = pq.read_table("file.parquet")
> table_of_batches = pa.Table.from_batches(
>     list(
>         pq.ParquetFile("file.parquet").iter_batches()
>     )
> )
> table_of_one_batch = pa.Table.from_batches(
>     [
>         next(pq.ParquetFile("file.parquet")
>         .iter_batches(batch_size=10**9))
>     ]
> ){code}
>  
> _table_of_batches_ reading time is 11.5 seconds, _table_of_whole_file_ read time is 33.2s.
> Also loading table as one batch _table_of_one_batch_ is slightly faster: 9.8s.
> h4. Parquet file metadata
>  
> {code:java}
> <pyarrow._parquet.FileMetaData object at 0x129ab83b0>
>   created_by: parquet-cpp-arrow version 8.0.0
>   num_columns: 2
>   num_rows: 1000000000
>   num_row_groups: 15
>   format_version: 1.0
>   serialized_size: 5680 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)