You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2022/04/21 19:12:00 UTC
[jira] [Commented] (ARROW-16272) [C++][Python] Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`

    [ https://issues.apache.org/jira/browse/ARROW-16272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526000#comment-17526000 ] 

David Li commented on ARROW-16272:
----------------------------------

Off the top of my head this is possibly because s3fs adds some readahead by default, which helps CSV a lot, and PyArrow's filesystem does not do this. PyArrow's CSV reader doesn't really need this since it's multithreaded (which effectively gives readahead) but Pandas's CSV reader may not do this.

> [C++][Python] Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16272
>                 URL: https://issues.apache.org/jira/browse/ARROW-16272
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 4.0.1, 5.0.0, 7.0.0
>         Environment: MacOS 12.1
> MacBook Pro
> Intel x86
>            Reporter: Sahil Gupta
>            Priority: Major
>              Labels: S3FileSystem, csv, pandas, s3
>
> `pyarrow.fs.S3FileSystem.open_input_file` and `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with Pandas' `read_csv`.
> {code:python}
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     fs = S3FileSystem(
>         anonymous=True,
>         region="us-east-2",
>         endpoint_override=None,
>         proxy_options=None,
>     )
>     print("Time to create fs: ", time.time() - t0)
>     t0 = time.time()
>     # fhandler = fs.open_input_stream(
>     #     "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     # )
>     fhandler = fs.open_input_file(
>         "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     )
>     print("Time to create fhandler: ", time.time() - t0)
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         fhandler,
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> Time to create fs:  0.0003612041473388672
> Time to create fhandler:  0.22461509704589844
> read time: 105.76488208770752
> total time: 105.99135684967041
> {code}
> This is with `pandas==1.4.2`.
> Getting similar performance with `fs.open_input_stream` as well (commented out in the code).
> {code}
> Running...
> Time to create fs:  0.0002570152282714844
> Time to create fhandler:  0.18540692329406738
> read time: 186.8419930934906
> total time: 187.03169012069702
> {code}
> When running it with just pandas (which uses `s3fs` under the hood), it's much faster:
> {code:python}
> import pandas as pd
> import time
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> read time: 1.1012001037597656
> total time: 1.101264238357544
> {code}
> Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs performance:
> {code:python}
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> from fsspec.implementations.arrow import ArrowFSWrapper
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     fs = ArrowFSWrapper(
>         S3FileSystem(
>             anonymous=True,
>             region="us-east-2",
>             endpoint_override=None,
>             proxy_options=None,
>         )
>     )
>     print("Time to create fs: ", time.time() - t0)
>     t0 = time.time()
>     fhandler = fs._open(
>         "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     )
>     print("Time to create fhandler: ", time.time() - t0)
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         fhandler,
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> Time to create fs:  0.0002467632293701172
> Time to create fhandler:  0.1858382225036621
> read time: 0.13701486587524414
> total time: 0.3232450485229492
> {code}
> Packages:
> {code}
> pyarrow=7.0.0
> pandas : 1.4.2
> numpy : 1.20.3
> {code}
> I tested it with 4.0.1, 5.0.0 as well and saw similar results.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)