You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Volker Lorrmann (Jira)" <ji...@apache.org> on 2022/10/07 09:10:00 UTC

[jira] [Created] (ARROW-17961) Add read/write optimization for pyarrow.fs.S3FileSystem

Volker Lorrmann created ARROW-17961:
---------------------------------------

             Summary: Add read/write optimization for pyarrow.fs.S3FileSystem
                 Key: ARROW-17961
                 URL: https://issues.apache.org/jira/browse/ARROW-17961
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Volker Lorrmann


I found large differences in loading time, when loading data  from aws s3 using {{pyarrows.fs.S3FileSystem}} compared to {{s3fs.S3FileSystem}} See example below.

The difference comes from {{s3fs}} optimization, which {{pyarrow.fs}} is not (yet) using.
{code:python}
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pyarrow.fs as pafs
import s3fs
import load_credentials

credentials = load_credentials()
path = "path/to/data" # folder with about 300 small (~10kb) files

fs1 = s3fs.S3FileSystem(
    anon=False,
    key=credentials["accessKeyId"],
    secret=credentials["secretAccessKey"],
    token=credentials["sessionToken"],
)

fs2 = pafs.S3FileSystem(
    access_key=credentials["accessKeyId"],
    secret_key=credentials["secretAccessKey"],
    session_token=credentials["sessionToken"],
   
)

_ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds

_ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds

_ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds

_ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)