You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2022/10/07 09:16:00 UTC

[jira] [Commented] (ARROW-17961) Add read/write optimization for pyarrow.fs.S3FileSystem

    [ https://issues.apache.org/jira/browse/ARROW-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613978#comment-17613978 ] 

Antoine Pitrou commented on ARROW-17961:
----------------------------------------

Which "optimization" is that?

> Add read/write optimization for pyarrow.fs.S3FileSystem
> -------------------------------------------------------
>
>                 Key: ARROW-17961
>                 URL: https://issues.apache.org/jira/browse/ARROW-17961
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Volker Lorrmann
>            Priority: Minor
>
> I found large differences in loading time, when loading data  from aws s3 using {{pyarrows.fs.S3FileSystem}} compared to {{s3fs.S3FileSystem}} See example below.
> The difference comes from {{s3fs}} optimization, which {{pyarrow.fs}} is not (yet) using.
> {code:python}
> import pyarrow.dataset as ds
> import pyarrow.parquet as pq
> import pyarrow.fs as pafs
> import s3fs
> import load_credentials
> credentials = load_credentials()
> path = "path/to/data" # folder with about 300 small (~10kb) files
> fs1 = s3fs.S3FileSystem(
>     anon=False,
>     key=credentials["accessKeyId"],
>     secret=credentials["secretAccessKey"],
>     token=credentials["sessionToken"],
> )
> fs2 = pafs.S3FileSystem(
>     access_key=credentials["accessKeyId"],
>     secret_key=credentials["secretAccessKey"],
>     session_token=credentials["sessionToken"],
>    
> )
> _ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds
> _ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds
> _ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds
> _ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)