You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "JonnyWaffles (via GitHub)" <gi...@apache.org> on 2023/03/02 15:46:12 UTC
[GitHub] [arrow] JonnyWaffles opened a new issue, #34414: slow s3 parquet reads when using fsspec S3FileSystem
JonnyWaffles opened a new issue, #34414:
URL: https://github.com/apache/arrow/issues/34414
### Describe the bug, including details regarding any error messages, version, and platform.
Hi team, I hope you are well. Apologies if this is documented somewhere, I searched around but could not find anything related.
I am using Pyarrow 10.0.1 inside a Ubuntu 20.04 container. For reasons I need to use the third party [fsspec.S3FileSystem](https://github.com/fsspec/s3fs) in lieu of the native `pyarrow.fs.S3FileSystem`. When trying to read a single 220MB Parquet file from S3 using the fsspec s3fs, the PyArrow Dataset API takes north of 3 minutes!
Using s3fs to read the bytes without leveraging Pyarrow, or using the Pyarrow native S3FileSystem, works within 16 seconds as expected. My understanding is that leveraging `fsspec.S3FileSystem` should not decrease performance in Pyarrow.
Test example
```python
pa_s3fs = fs.S3FileSystem()
time_start = time.perf_counter()
ds.dataset(path, format="parquet", filesystem=pa_s3fs).to_table()
print("pa S3Filesystem!: ", time.perf_counter() - time_start)
time_start = time.perf_counter()
s3fs.read_bytes(key)
print("fsspec raw_bytes:!", time.perf_counter() - time_start)
time_start = time.perf_counter()
ds.dataset(path, format="parquet", filesystem=s3fs).to_table()
print("PA w/ fsspec S3Filesystem!: ", time.perf_counter() - time_start)
```
Output
```
airflow-worker | Wrote key='mybucket/mypath/part-0.parquet' size=211661451
airflow-worker | Landed total 201 MB to s3!
airflow-worker | pa S3Filesystem!: 14.232619059999706
airflow-worker | fsspec raw_bytes:! 15.852554664001218
airflow-worker | PA w/ fsspec S3Filesystem!: 194.75488279599995
```
I found this phrase in the Pyarrow docs
> pre_buffer[bool](https://docs.python.org/3/library/stdtypes.html#bltin-boolean-values), default [True](https://docs.python.org/3/library/constants.html#True)
Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool. This option is only supported for use_legacy_dataset=False. If using a filesystem layer that itself performs readahead (e.g. fsspecās S3FS), disable readahead for best results.
So I tried setting `pre_buffer=False` but the performance issue remains.
Any idea what could be causing the performance issue when leveraging fsspec inside of Pyarrow?
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] assignUser commented on issue #34414: slow s3 parquet reads when using fsspec S3FileSystem
Posted by "assignUser (via GitHub)" <gi...@apache.org>.
assignUser commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1452590646
Hmm is this expected behaviour @jorisvandenbossche ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] JonnyWaffles commented on issue #34414: slow s3 parquet reads when using fsspec S3FileSystem
Posted by "JonnyWaffles (via GitHub)" <gi...@apache.org>.
JonnyWaffles commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1452405066
Oh I just wanted to add one more note, when using the dataset API with the above `fsspec` I noticed it's much slower to use the dataset API versus `pq.read_table`
Example
```python
def read_table_pq(s3fs):
time_start = time.perf_counter()
num_bytes = pq.read_table(s3_key, filesystem=s3fs).nbytes
print(f"{num_bytes} Took {time.perf_counter() - time_start}")
def read_table_ds(s3fs):
time_start = time.perf_counter()
num_bytes = ds.dataset(s3_key, filesystem=s3fs, format="parquet").to_table().nbytes
print(f"{num_bytes} Took {time.perf_counter() - time_start}")
```
Result
```
read_table_pq(s3fs1)
301340480 Took 25.171526359998097
read_table_ds(s3fs1)
301340480 Took 46.74551891900046
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] xu2011 commented on issue #34414: [Python] slow s3 parquet reads when using fsspec S3FileSystem
Posted by "xu2011 (via GitHub)" <gi...@apache.org>.
xu2011 commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1455341709
I bumped into a similar issue with pd.read_parquet to read from s3. Read performance is slow compare to pq.read_table().to_pandas() though underneath the pandas.read_parquet using the same function call.
I found the issue could cause by [_ensure_filesystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/parquet/core.py#L2431) in _ParquetDataset class. Which [check object type for the given filesystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/fs.py#L99). When the check failed, it will reconstruct the filesystem with [PyFileSystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/fs.py#L119) and slow down the read.
With pq.read_table().to_pandas(), it will [parse the file system from s3 path](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/parquet/core.py#L2459) and get `pyarrow._s3fs.S3FileSystem`
```
def _ensure_filesystem_checkinstance():
import s3fs
from pyarrow._fs import FileSystem
s3 = s3fs.S3FileSystem()
print(isinstance(s3,FileSystem))
def fs_pandas():
import s3fs
from pyarrow._fs import (FileSystem, _ensure_filesystem)
s3 = s3fs.S3FileSystem()
fs = _ensure_filesystem(s3)
print(fs)
def fs_pq():
filesystem, path_or_paths = FileSystem.from_uri(
s3_path)
print(filesystem)
_ensure_filesystem_checkinstance()
fs_pandas()
fs_pq()
```
Result
```
_ensure_filesystem_checkinstance()
False
fs_pandas()
pyarrow._fs.FileSystem
fs_pq()
pyarrow._s3.S3FileSystem
```
I don't think it's expected behavior and I suggest reopen the issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] JonnyWaffles commented on issue #34414: [Python] slow s3 parquet reads when using fsspec S3FileSystem
Posted by "JonnyWaffles (via GitHub)" <gi...@apache.org>.
JonnyWaffles commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1464416272
@jorisvandenbossche technically a "directory" but there was only a single ~220 MB parquet file inside it, so I guess a single file would be the correct answer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] assignUser closed issue #34414: slow s3 parquet reads when using fsspec S3FileSystem
Posted by "assignUser (via GitHub)" <gi...@apache.org>.
assignUser closed issue #34414: slow s3 parquet reads when using fsspec S3FileSystem
URL: https://github.com/apache/arrow/issues/34414
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] assignUser commented on issue #34414: slow s3 parquet reads when using fsspec S3FileSystem
Posted by "assignUser (via GitHub)" <gi...@apache.org>.
assignUser commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1452146538
Thanks for posting the solution!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] JonnyWaffles commented on issue #34414: slow s3 parquet reads when using fsspec S3FileSystem
Posted by "JonnyWaffles (via GitHub)" <gi...@apache.org>.
JonnyWaffles commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1452102013
Hi friends just a quick update, I fixed it like so
```python
tbl = pq.read_table("mybucket/mypath/part-0.parquet", filesystem=s3fs.S3FileSystem(default_cache_type="none"))
```
I don't understand what's going on, but this works! I think we can lose this out.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #34414: [Python] slow s3 parquet reads when using fsspec S3FileSystem
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1464097361
@xu2011 what you see is essentially behaviour controlled by pandas (passing an fsspec filesystem to pyarrow). It suffers from the same issue that apparently using an fsspec wrapped filesystem in pyarrow can be a lot slower, but avoiding doing that is something that needs to be done in pandas (pyarrow only sees the filesystem being passed to it).
Now, there is actually an open pandas PR to avoid using fsspec when using pyarrow in the pandas read_parquet (https://github.com/pandas-dev/pandas/pull/51609)
> Oh I just wanted to add one more note, when using the dataset API with the above `fsspec` I noticed it's much slower to use the dataset API versus `pq.read_table`
@JonnyWaffles hmm, that's strange, as that should essentially do the same these days. Is the `s3 _key` pointing to a single file or to a directory?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org