You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Sahil Gupta (Jira)" <ji...@apache.org> on 2022/04/21 18:35:00 UTC
[jira] [Created] (ARROW-16272) Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`
Sahil Gupta created ARROW-16272:
-----------------------------------
Summary: Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`
Key: ARROW-16272
URL: https://issues.apache.org/jira/browse/ARROW-16272
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 7.0.0, 5.0.0, 4.0.1
Environment: MacOS 12.1
MacBook Pro
Intel x86
Reporter: Sahil Gupta
`pyarrow.fs.S3FileSystem.open_input_file` and `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with Pandas' `read_csv`.
```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem
def load_parking_tickets():
print("Running...")
t0 = time.time()
fs = S3FileSystem(
anonymous=True,
region="us-east-2",
endpoint_override=None,
proxy_options=None,
)
print("Time to create fs: ", time.time() - t0)
t0 = time.time()
# fhandler = fs.open_input_stream(
# "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
# )
fhandler = fs.open_input_file(
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
)
print("Time to create fhandler: ", time.time() - t0)
t0 = time.time()
year_2016_df = pd.read_csv(
fhandler,
nrows=100,
)
print("read time:", time.time() - t0)
return year_2016_df
t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
Time to create fs: 0.0003612041473388672
Time to create fhandler: 0.22461509704589844
read time: 105.76488208770752
total time: 105.99135684967041
```
This is with `pandas==1.4.2`.
Getting similar performance with `fs.open_input_stream` as well (commented out in the code).
```shell
Running...
Time to create fs: 0.0002570152282714844
Time to create fhandler: 0.18540692329406738
read time: 186.8419930934906
total time: 187.03169012069702
```
When running it with just pandas (which uses `s3fs` under the hood), it's much faster:
```python
import pandas as pd
import time
def load_parking_tickets():
print("Running...")
t0 = time.time()
year_2016_df = pd.read_csv(
"s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
nrows=100,
)
print("read time:", time.time() - t0)
return year_2016_df
t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
read time: 1.1012001037597656
total time: 1.101264238357544
```
Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs performance:
```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem
from fsspec.implementations.arrow import ArrowFSWrapper
def load_parking_tickets():
print("Running...")
t0 = time.time()
fs = ArrowFSWrapper(
S3FileSystem(
anonymous=True,
region="us-east-2",
endpoint_override=None,
proxy_options=None,
)
)
print("Time to create fs: ", time.time() - t0)
t0 = time.time()
fhandler = fs._open(
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
)
print("Time to create fhandler: ", time.time() - t0)
t0 = time.time()
year_2016_df = pd.read_csv(
fhandler,
nrows=100,
)
print("read time:", time.time() - t0)
return year_2016_df
t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
Time to create fs: 0.0002467632293701172
Time to create fhandler: 0.1858382225036621
read time: 0.13701486587524414
total time: 0.3232450485229492
```
Packages:
```
pyarrow=7.0.0
pandas : 1.4.2
numpy : 1.20.3
```
--
This message was sent by Atlassian Jira
(v8.20.7#820007)