You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "JonnyWaffles (via GitHub)" <gi...@apache.org> on 2023/03/02 15:46:12 UTC

[GitHub] [arrow] JonnyWaffles opened a new issue, #34414: slow s3 parquet reads when using fsspec S3FileSystem

JonnyWaffles opened a new issue, #34414:
URL: https://github.com/apache/arrow/issues/34414

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Hi team, I hope you are well. Apologies if this is documented somewhere, I searched around but could not find anything related.
   
   I am using Pyarrow 10.0.1 inside a Ubuntu 20.04 container. For reasons I need to use the third party [fsspec.S3FileSystem](https://github.com/fsspec/s3fs) in lieu of the native `pyarrow.fs.S3FileSystem`. When trying to read a single 220MB Parquet file from S3 using the fsspec s3fs, the PyArrow Dataset API takes north of 3 minutes!
   
   Using s3fs to read the bytes without leveraging Pyarrow, or using the Pyarrow native S3FileSystem, works within 16 seconds as expected. My understanding is that leveraging `fsspec.S3FileSystem` should not decrease performance in Pyarrow.
   
   Test example
   
   ```python
       pa_s3fs = fs.S3FileSystem()
       time_start = time.perf_counter()
       ds.dataset(path, format="parquet", filesystem=pa_s3fs).to_table()
       print("pa S3Filesystem!: ", time.perf_counter() - time_start)
   
       time_start = time.perf_counter()
       s3fs.read_bytes(key)
       print("fsspec raw_bytes:!", time.perf_counter() - time_start)
   
       time_start = time.perf_counter()
       ds.dataset(path, format="parquet", filesystem=s3fs).to_table()
       print("PA w/ fsspec S3Filesystem!: ", time.perf_counter() - time_start)
   ```
   Output
   ```
   airflow-worker  | Wrote key='mybucket/mypath/part-0.parquet' size=211661451
   airflow-worker  | Landed total 201 MB to s3!
   airflow-worker  | pa S3Filesystem!:  14.232619059999706
   airflow-worker  | fsspec raw_bytes:! 15.852554664001218
   airflow-worker  | PA w/ fsspec S3Filesystem!:  194.75488279599995
   ```
   
   I found this phrase in the Pyarrow docs
   
   > pre_buffer[bool](https://docs.python.org/3/library/stdtypes.html#bltin-boolean-values), default [True](https://docs.python.org/3/library/constants.html#True)
   Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool. This option is only supported for use_legacy_dataset=False. If using a filesystem layer that itself performs readahead (e.g. fsspecā€™s S3FS), disable readahead for best results.
   
   So I tried setting `pre_buffer=False` but the performance issue remains.
   
   Any idea what could be causing the performance issue when leveraging fsspec inside of Pyarrow?
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] assignUser commented on issue #34414: slow s3 parquet reads when using fsspec S3FileSystem

Posted by "assignUser (via GitHub)" <gi...@apache.org>.
assignUser commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1452590646

   Hmm is this expected behaviour @jorisvandenbossche ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] JonnyWaffles commented on issue #34414: slow s3 parquet reads when using fsspec S3FileSystem

Posted by "JonnyWaffles (via GitHub)" <gi...@apache.org>.
JonnyWaffles commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1452405066

   Oh I just wanted to add one more note, when using the dataset API with the above `fsspec` I noticed it's much slower to use the dataset API versus `pq.read_table`
   
   Example
   ```python
   def read_table_pq(s3fs):
       time_start = time.perf_counter()
       num_bytes = pq.read_table(s3_key, filesystem=s3fs).nbytes
       print(f"{num_bytes} Took {time.perf_counter() - time_start}")
   
   
   def read_table_ds(s3fs):
       time_start = time.perf_counter()
       num_bytes = ds.dataset(s3_key, filesystem=s3fs, format="parquet").to_table().nbytes
       print(f"{num_bytes} Took {time.perf_counter() - time_start}")
   ```
   Result
   ```
   read_table_pq(s3fs1)
   301340480 Took 25.171526359998097
   read_table_ds(s3fs1)
   301340480 Took 46.74551891900046
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] xu2011 commented on issue #34414: [Python] slow s3 parquet reads when using fsspec S3FileSystem

Posted by "xu2011 (via GitHub)" <gi...@apache.org>.
xu2011 commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1455341709

   I bumped into a similar issue with pd.read_parquet to read from s3. Read performance is slow compare to pq.read_table().to_pandas() though underneath the pandas.read_parquet using the same function call.
   
   I found the issue could cause by [_ensure_filesystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/parquet/core.py#L2431) in _ParquetDataset class. Which [check object type for the given filesystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/fs.py#L99). When the check failed, it will reconstruct the filesystem with [PyFileSystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/fs.py#L119) and slow down the read.
   
   With pq.read_table().to_pandas(), it will [parse the file system from s3 path](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/parquet/core.py#L2459) and get `pyarrow._s3fs.S3FileSystem`
   
   
   
   ```
   def _ensure_filesystem_checkinstance():
       import s3fs
       from pyarrow._fs import FileSystem
       s3 = s3fs.S3FileSystem()
       print(isinstance(s3,FileSystem))
   
   
   def fs_pandas():
       import s3fs
       from pyarrow._fs import (FileSystem, _ensure_filesystem)
       s3 = s3fs.S3FileSystem()
       fs = _ensure_filesystem(s3)
       print(fs)
      
   
   def fs_pq():
       filesystem, path_or_paths = FileSystem.from_uri(
                                   s3_path)
       print(filesystem)
   
   _ensure_filesystem_checkinstance()
   fs_pandas()
   fs_pq()
   ```
   Result
   ```
   _ensure_filesystem_checkinstance()
   False
   
   fs_pandas()
   pyarrow._fs.FileSystem
   
   fs_pq()
   pyarrow._s3.S3FileSystem
   ```
   
   I don't think it's expected behavior and I suggest reopen the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] JonnyWaffles commented on issue #34414: [Python] slow s3 parquet reads when using fsspec S3FileSystem

Posted by "JonnyWaffles (via GitHub)" <gi...@apache.org>.
JonnyWaffles commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1464416272

   @jorisvandenbossche technically a "directory" but there was only a single ~220 MB parquet file inside it, so I guess a single file would be the correct answer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] assignUser closed issue #34414: slow s3 parquet reads when using fsspec S3FileSystem

Posted by "assignUser (via GitHub)" <gi...@apache.org>.
assignUser closed issue #34414: slow s3 parquet reads when using fsspec S3FileSystem
URL: https://github.com/apache/arrow/issues/34414


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] assignUser commented on issue #34414: slow s3 parquet reads when using fsspec S3FileSystem

Posted by "assignUser (via GitHub)" <gi...@apache.org>.
assignUser commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1452146538

   Thanks for posting the solution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] JonnyWaffles commented on issue #34414: slow s3 parquet reads when using fsspec S3FileSystem

Posted by "JonnyWaffles (via GitHub)" <gi...@apache.org>.
JonnyWaffles commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1452102013

   Hi friends just a quick update, I fixed it like so
   ```python
   tbl = pq.read_table("mybucket/mypath/part-0.parquet", filesystem=s3fs.S3FileSystem(default_cache_type="none"))
   ```
   I don't understand what's going on, but this works! I think we can lose this out.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34414: [Python] slow s3 parquet reads when using fsspec S3FileSystem

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34414:
URL: https://github.com/apache/arrow/issues/34414#issuecomment-1464097361

   @xu2011 what you see is essentially behaviour controlled by pandas (passing an fsspec filesystem to pyarrow). It suffers from the same issue that apparently using an fsspec wrapped filesystem in pyarrow can be a lot slower, but avoiding doing that is something that needs to be done in pandas (pyarrow only sees the filesystem being passed to it). 
   Now, there is actually an open pandas PR to avoid using fsspec when using pyarrow in the pandas read_parquet (https://github.com/pandas-dev/pandas/pull/51609)
   
   > Oh I just wanted to add one more note, when using the dataset API with the above `fsspec` I noticed it's much slower to use the dataset API versus `pq.read_table`
   
   @JonnyWaffles hmm, that's strange, as that should essentially do the same these days. Is the `s3 _key` pointing to a single file or to a directory?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org