You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Volker Lorrmann (Jira)" <ji...@apache.org> on 2022/10/07 09:10:00 UTC
[jira] [Created] (ARROW-17961) Add read/write optimization for pyarrow.fs.S3FileSystem
Volker Lorrmann created ARROW-17961:
---------------------------------------
Summary: Add read/write optimization for pyarrow.fs.S3FileSystem
Key: ARROW-17961
URL: https://issues.apache.org/jira/browse/ARROW-17961
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Reporter: Volker Lorrmann
I found large differences in loading time, when loading data from aws s3 using {{pyarrows.fs.S3FileSystem}} compared to {{s3fs.S3FileSystem}} See example below.
The difference comes from {{s3fs}} optimization, which {{pyarrow.fs}} is not (yet) using.
{code:python}
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pyarrow.fs as pafs
import s3fs
import load_credentials
credentials = load_credentials()
path = "path/to/data" # folder with about 300 small (~10kb) files
fs1 = s3fs.S3FileSystem(
anon=False,
key=credentials["accessKeyId"],
secret=credentials["secretAccessKey"],
token=credentials["sessionToken"],
)
fs2 = pafs.S3FileSystem(
access_key=credentials["accessKeyId"],
secret_key=credentials["secretAccessKey"],
session_token=credentials["sessionToken"],
)
_ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds
_ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds
_ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds
_ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)