You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "gforsyth (via GitHub)" <gi...@apache.org> on 2023/03/22 13:29:29 UTC

[GitHub] [arrow] gforsyth opened a new issue, #34683: ArrowInvalid error when trying to read list of s3 paths in pyarrow.dataset

gforsyth opened a new issue, #34683:
URL: https://github.com/apache/arrow/issues/34683

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Using a recent nightly on Ubuntu 22.04.2.
   
   `pyarrow.dataset.dataset` can load a parquet directory from an `s3`-prefixed URI but throws an error if you pass it a list of individual parquet files that share the same schema.
   
   ```python
   [ins] In [1]: import pyarrow
   
   [ins] In [2]: pyarrow.__version__
   Out[2]: '12.0.0.dev266'
   
   [ins] In [3]: import pyarrow.dataset as ds
   /home/gil/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/compute.py:206: RuntimeWarning: Python binding for RunEndEncodeOptions not exposed
     warnings.warn("Python binding for {} not exposed"
   
   [ins] In [4]: files = [
            ...:     f"s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/00000{i}"
            ...:     for i in range(3)
            ...: ]
   
   [ins] In [5]: ds.dataset(files)
   ---------------------------------------------------------------------------
   ArrowInvalid                              Traceback (most recent call last)
   Cell In[5], line 1
   ----> 1 ds.dataset(files)
   
   File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/dataset.py:765, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
       763 elif isinstance(source, (tuple, list)):
       764     if all(_is_path_like(elem) for elem in source):
   --> 765         return _filesystem_dataset(source, **kwargs)
       766     elif all(isinstance(elem, Dataset) for elem in source):
       767         return _union_dataset(source, **kwargs)
   
   File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/dataset.py:443, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
       440 partitioning = _ensure_partitioning(partitioning)
       442 if isinstance(source, (list, tuple)):
   --> 443     fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
       444 else:
       445     fs, paths_or_selector = _ensure_single_source(source, filesystem)
   
   File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/dataset.py:351, in _ensure_multiple_sources(paths, filesystem)
       344 is_local = (
       345     isinstance(filesystem, (LocalFileSystem, _MockFileSystem)) or
       346     (isinstance(filesystem, SubTreeFileSystem) and
       347      isinstance(filesystem.base_fs, LocalFileSystem))
       348 )
       350 # allow normalizing irregular paths such as Windows local paths
   --> 351 paths = [filesystem.normalize_path(_stringify_path(p)) for p in paths]
       353 # validate that all of the paths are pointing to existing *files*
       354 # possible improvement is to group the file_infos by type and raise for
       355 # multiple paths per error category
       356 if is_local:
   
   File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/dataset.py:351, in <listcomp>(.0)
       344 is_local = (
       345     isinstance(filesystem, (LocalFileSystem, _MockFileSystem)) or
       346     (isinstance(filesystem, SubTreeFileSystem) and
       347      isinstance(filesystem.base_fs, LocalFileSystem))
       348 )
       350 # allow normalizing irregular paths such as Windows local paths
   --> 351 paths = [filesystem.normalize_path(_stringify_path(p)) for p in paths]
       353 # validate that all of the paths are pointing to existing *files*
       354 # possible improvement is to group the file_infos by type and raise for
       355 # multiple paths per error category
       356 if is_local:
   
   File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/_fs.pyx:967, in pyarrow._fs.FileSystem.normalize_path()
   
   File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
   
   File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()
   
   ArrowInvalid: Expected a local filesystem path, got a URI: 's3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/000000'
   ```
   
   You can work around this with union dataset, but this seems like it should work?
   
   workaround:
   ```
   [ins] In [6]: ds.dataset(list(map(ds.dataset, files)))
   ```
   
   The performance of  `ds.dataset` loading files from s3 is _much_ improved compared with Arrow 11 -- for the files above I was seeing a pretty consistent ~4s per-file and that's dropped down by at least 5x on this nightly!
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] gforsyth commented on issue #34683: [Python] ArrowInvalid error when trying to read list of s3 paths in pyarrow.dataset

Posted by "gforsyth (via GitHub)" <gi...@apache.org>.

gforsyth commented on issue #34683:
URL: https://github.com/apache/arrow/issues/34683#issuecomment-1479945863

   Ahh, my bad, I saw the first line of the argument documentation and misread it and missed the clarification below:
   ```
       source : path, list of paths, dataset, list of datasets, (list of) \
   RecordBatch or Table, iterable of RecordBatch, RecordBatchReader, or URI
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #34683: [Python] ArrowInvalid error when trying to read list of s3 paths in pyarrow.dataset

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #34683:
URL: https://github.com/apache/arrow/issues/34683#issuecomment-1479953699

   Yeah, for those docs, we should probably move URI to the front, so it reads something like "path or URI, list of paths, ..."


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #34683: [Python] ArrowInvalid error when trying to read list of s3 paths in pyarrow.dataset

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #34683:
URL: https://github.com/apache/arrow/issues/34683#issuecomment-1479937703

   We actually explicitly document this limitation, I see now:
   
   https://github.com/apache/arrow/blob/532b9a57cab4fa1f88438fd0bd79cb7eb8aa2df3/python/pyarrow/dataset.py#L581-L586
   
   And if you pass a list of strings (without explicit filesystem), the current code also assumes a LocalFileSystem and thus tries to interpret the values in the list as local file paths, so that explains the error you see.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #34683: [Python] ArrowInvalid error when trying to read list of s3 paths in pyarrow.dataset

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #34683:
URL: https://github.com/apache/arrow/issues/34683#issuecomment-1479930917

   I _think_ that was a remaining todo when this was initially implemented to support parsing URIs when passes as a list. What certainly already works is passing a list of _paths_ (in which case of S3 needs to be accompanied by a filesystem object then). Using your example:
   
   ```
   In [4]: from pyarrow.fs import FileSystem
   
   In [5]: fs, _ = FileSystem.from_uri(files[0])
   
   In [6]: ds.dataset([f.lstrip("s3://") for f in files], filesystem=fs)
   Out[6]: <pyarrow._dataset.FileSystemDataset at 0x7f658f8a4280>
   ```
   
   I think the reason that we initially punted on that, is because when using a list of URIs, in principle every item in that list could point to a different filesystem (and so that also needs to be validated). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org