You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Sahil Gupta (Jira)" <ji...@apache.org> on 2022/04/21 18:35:00 UTC
[jira] [Created] (ARROW-16272) Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`

Sahil Gupta created ARROW-16272:
-----------------------------------

             Summary: Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`
                 Key: ARROW-16272
                 URL: https://issues.apache.org/jira/browse/ARROW-16272
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 7.0.0, 5.0.0, 4.0.1
         Environment: MacOS 12.1
MacBook Pro
Intel x86
            Reporter: Sahil Gupta


`pyarrow.fs.S3FileSystem.open_input_file` and `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with Pandas' `read_csv`.

 

```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem


def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = S3FileSystem(
        anonymous=True,
        region="us-east-2",
        endpoint_override=None,
        proxy_options=None,
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    # fhandler = fs.open_input_stream(
    #     "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    # )
    fhandler = fs.open_input_file(
        "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df


t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```

Output:
```shell
Running...
Time to create fs:  0.0003612041473388672
Time to create fhandler:  0.22461509704589844
read time: 105.76488208770752
total time: 105.99135684967041
```
This is with `pandas==1.4.2`.

Getting similar performance with `fs.open_input_stream` as well (commented out in the code).
```shell
Running...
Time to create fs:  0.0002570152282714844
Time to create fhandler:  0.18540692329406738
read time: 186.8419930934906
total time: 187.03169012069702
```

When running it with just pandas (which uses `s3fs` under the hood), it's much faster:
```python
import pandas as pd
import time


def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    year_2016_df = pd.read_csv(
        "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df


t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
read time: 1.1012001037597656
total time: 1.101264238357544
```

Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs performance:
```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem
from fsspec.implementations.arrow import ArrowFSWrapper


def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = ArrowFSWrapper(
        S3FileSystem(
            anonymous=True,
            region="us-east-2",
            endpoint_override=None,
            proxy_options=None,
        )
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    fhandler = fs._open(
        "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df


t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
Time to create fs:  0.0002467632293701172
Time to create fhandler:  0.1858382225036621
read time: 0.13701486587524414
total time: 0.3232450485229492
```

 

Packages:

```

pyarrow=7.0.0

pandas : 1.4.2

numpy : 1.20.3

```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)