You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/04/16 19:36:00 UTC
[jira] [Commented] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True

    [ https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324041#comment-17324041 ] 

David Li commented on ARROW-12428:
----------------------------------

Here's a quick comparison between Pandas/S3FS and PyArrow with a pre_buffer option implemented:

{noformat}
Python: 3.9.2
Pandas: 1.2.3
PyArrow: 5.0.0 master (9c1e5bd19347635ea9f373bcf93f2cea0231d50a)

Pandas/S3FS: 107.31099020410329 seconds
Pandas/S3FS (no readahead): 676.9701101030223 seconds
PyArrow: 213.81073790509254 seconds
PyArrow (pre-buffer): 29.330630503827706 seconds
Pandas/S3FS (pre-buffer): 54.61801828909665 seconds
Pandas/S3FS (pre-buffer, no readahead): 46.7531590978615 seconds {noformat}

{code:python}
import time
import pandas as pd
import pyarrow.parquet as pq

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet")
duration = time.monotonic() - start
print("Pandas/S3FS:", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
    'default_block_size': 1,  # 0 is ignored
    'default_fill_cache': False,
})
duration = time.monotonic() - start
print("Pandas/S3FS (no readahead):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet")
duration = time.monotonic() - start
print("PyArrow:", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True)
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={
    'default_block_size': 1,  # 0 is ignored
    'default_fill_cache': False,
}, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds")
{code}

> [Python] pyarrow.parquet.read_* should use pre_buffer=True
> ----------------------------------------------------------
>
>                 Key: ARROW-12428
>                 URL: https://issues.apache.org/jira/browse/ARROW-12428
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: David Li
>            Assignee: David Li
>            Priority: Major
>             Fix For: 5.0.0
>
>
> If the user is synchronously reading a single file, we should try to read it as fast as possible. The one sticking point might be whether it's beneficial to enable this no matter the filesystem or whether we should try to only enable it on high-latency filesystems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)