You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2022/10/07 09:16:00 UTC
[jira] [Commented] (ARROW-17961) Add read/write optimization for pyarrow.fs.S3FileSystem
[ https://issues.apache.org/jira/browse/ARROW-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613978#comment-17613978 ]
Antoine Pitrou commented on ARROW-17961:
----------------------------------------
Which "optimization" is that?
> Add read/write optimization for pyarrow.fs.S3FileSystem
> -------------------------------------------------------
>
> Key: ARROW-17961
> URL: https://issues.apache.org/jira/browse/ARROW-17961
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Volker Lorrmann
> Priority: Minor
>
> I found large differences in loading time, when loading data from aws s3 using {{pyarrows.fs.S3FileSystem}} compared to {{s3fs.S3FileSystem}} See example below.
> The difference comes from {{s3fs}} optimization, which {{pyarrow.fs}} is not (yet) using.
> {code:python}
> import pyarrow.dataset as ds
> import pyarrow.parquet as pq
> import pyarrow.fs as pafs
> import s3fs
> import load_credentials
> credentials = load_credentials()
> path = "path/to/data" # folder with about 300 small (~10kb) files
> fs1 = s3fs.S3FileSystem(
> anon=False,
> key=credentials["accessKeyId"],
> secret=credentials["secretAccessKey"],
> token=credentials["sessionToken"],
> )
> fs2 = pafs.S3FileSystem(
> access_key=credentials["accessKeyId"],
> secret_key=credentials["secretAccessKey"],
> session_token=credentials["sessionToken"],
>
> )
> _ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds
> _ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds
> _ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds
> _ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)