You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2022/04/21 19:36:00 UTC
[jira] [Assigned] (ARROW-16272) [C++][Python] Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`
[ https://issues.apache.org/jira/browse/ARROW-16272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou reassigned ARROW-16272:
--------------------------------------
Assignee: Antoine Pitrou
> [C++][Python] Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`
> ------------------------------------------------------------------------------------------------
>
> Key: ARROW-16272
> URL: https://issues.apache.org/jira/browse/ARROW-16272
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 4.0.1, 5.0.0, 7.0.0
> Environment: MacOS 12.1
> MacBook Pro
> Intel x86
> Reporter: Sahil Gupta
> Assignee: Antoine Pitrou
> Priority: Major
> Labels: S3FileSystem, csv, pandas, s3
> Fix For: 9.0.0
>
>
> `pyarrow.fs.S3FileSystem.open_input_file` and `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with Pandas' `read_csv`.
> {code:python}
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> def load_parking_tickets():
> print("Running...")
> t0 = time.time()
> fs = S3FileSystem(
> anonymous=True,
> region="us-east-2",
> endpoint_override=None,
> proxy_options=None,
> )
> print("Time to create fs: ", time.time() - t0)
> t0 = time.time()
> # fhandler = fs.open_input_stream(
> # "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
> # )
> fhandler = fs.open_input_file(
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
> )
> print("Time to create fhandler: ", time.time() - t0)
> t0 = time.time()
> year_2016_df = pd.read_csv(
> fhandler,
> nrows=100,
> )
> print("read time:", time.time() - t0)
> return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> Time to create fs: 0.0003612041473388672
> Time to create fhandler: 0.22461509704589844
> read time: 105.76488208770752
> total time: 105.99135684967041
> {code}
> This is with `pandas==1.4.2`.
> Getting similar performance with `fs.open_input_stream` as well (commented out in the code).
> {code}
> Running...
> Time to create fs: 0.0002570152282714844
> Time to create fhandler: 0.18540692329406738
> read time: 186.8419930934906
> total time: 187.03169012069702
> {code}
> When running it with just pandas (which uses `s3fs` under the hood), it's much faster:
> {code:python}
> import pandas as pd
> import time
> def load_parking_tickets():
> print("Running...")
> t0 = time.time()
> year_2016_df = pd.read_csv(
> "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
> nrows=100,
> )
> print("read time:", time.time() - t0)
> return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> read time: 1.1012001037597656
> total time: 1.101264238357544
> {code}
> Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs performance:
> {code:python}
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> from fsspec.implementations.arrow import ArrowFSWrapper
> def load_parking_tickets():
> print("Running...")
> t0 = time.time()
> fs = ArrowFSWrapper(
> S3FileSystem(
> anonymous=True,
> region="us-east-2",
> endpoint_override=None,
> proxy_options=None,
> )
> )
> print("Time to create fs: ", time.time() - t0)
> t0 = time.time()
> fhandler = fs._open(
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
> )
> print("Time to create fhandler: ", time.time() - t0)
> t0 = time.time()
> year_2016_df = pd.read_csv(
> fhandler,
> nrows=100,
> )
> print("read time:", time.time() - t0)
> return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> Time to create fs: 0.0002467632293701172
> Time to create fhandler: 0.1858382225036621
> read time: 0.13701486587524414
> total time: 0.3232450485229492
> {code}
> Packages:
> {code}
> pyarrow=7.0.0
> pandas : 1.4.2
> numpy : 1.20.3
> {code}
> I tested it with 4.0.1, 5.0.0 as well and saw similar results.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)