You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "gforsyth (via GitHub)" <gi...@apache.org> on 2023/03/22 13:29:29 UTC
[GitHub] [arrow] gforsyth opened a new issue, #34683: ArrowInvalid error when trying to read list of s3 paths in pyarrow.dataset
gforsyth opened a new issue, #34683:
URL: https://github.com/apache/arrow/issues/34683
### Describe the bug, including details regarding any error messages, version, and platform.
Using a recent nightly on Ubuntu 22.04.2.
`pyarrow.dataset.dataset` can load a parquet directory from an `s3`-prefixed URI but throws an error if you pass it a list of individual parquet files that share the same schema.
```python
[ins] In [1]: import pyarrow
[ins] In [2]: pyarrow.__version__
Out[2]: '12.0.0.dev266'
[ins] In [3]: import pyarrow.dataset as ds
/home/gil/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/compute.py:206: RuntimeWarning: Python binding for RunEndEncodeOptions not exposed
warnings.warn("Python binding for {} not exposed"
[ins] In [4]: files = [
...: f"s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/00000{i}"
...: for i in range(3)
...: ]
[ins] In [5]: ds.dataset(files)
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[5], line 1
----> 1 ds.dataset(files)
File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/dataset.py:765, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
763 elif isinstance(source, (tuple, list)):
764 if all(_is_path_like(elem) for elem in source):
--> 765 return _filesystem_dataset(source, **kwargs)
766 elif all(isinstance(elem, Dataset) for elem in source):
767 return _union_dataset(source, **kwargs)
File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/dataset.py:443, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
440 partitioning = _ensure_partitioning(partitioning)
442 if isinstance(source, (list, tuple)):
--> 443 fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
444 else:
445 fs, paths_or_selector = _ensure_single_source(source, filesystem)
File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/dataset.py:351, in _ensure_multiple_sources(paths, filesystem)
344 is_local = (
345 isinstance(filesystem, (LocalFileSystem, _MockFileSystem)) or
346 (isinstance(filesystem, SubTreeFileSystem) and
347 isinstance(filesystem.base_fs, LocalFileSystem))
348 )
350 # allow normalizing irregular paths such as Windows local paths
--> 351 paths = [filesystem.normalize_path(_stringify_path(p)) for p in paths]
353 # validate that all of the paths are pointing to existing *files*
354 # possible improvement is to group the file_infos by type and raise for
355 # multiple paths per error category
356 if is_local:
File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/dataset.py:351, in <listcomp>(.0)
344 is_local = (
345 isinstance(filesystem, (LocalFileSystem, _MockFileSystem)) or
346 (isinstance(filesystem, SubTreeFileSystem) and
347 isinstance(filesystem.base_fs, LocalFileSystem))
348 )
350 # allow normalizing irregular paths such as Windows local paths
--> 351 paths = [filesystem.normalize_path(_stringify_path(p)) for p in paths]
353 # validate that all of the paths are pointing to existing *files*
354 # possible improvement is to group the file_infos by type and raise for
355 # multiple paths per error category
356 if is_local:
File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/_fs.pyx:967, in pyarrow._fs.FileSystem.normalize_path()
File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File ~/mambaforge/envs/pyarrow_nightly/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()
ArrowInvalid: Expected a local filesystem path, got a URI: 's3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/000000'
```
You can work around this with union dataset, but this seems like it should work?
workaround:
```
[ins] In [6]: ds.dataset(list(map(ds.dataset, files)))
```
The performance of `ds.dataset` loading files from s3 is _much_ improved compared with Arrow 11 -- for the files above I was seeing a pretty consistent ~4s per-file and that's dropped down by at least 5x on this nightly!
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] gforsyth commented on issue #34683: [Python] ArrowInvalid error when trying to read list of s3 paths in pyarrow.dataset
Posted by "gforsyth (via GitHub)" <gi...@apache.org>.
gforsyth commented on issue #34683:
URL: https://github.com/apache/arrow/issues/34683#issuecomment-1479945863
Ahh, my bad, I saw the first line of the argument documentation and misread it and missed the clarification below:
```
source : path, list of paths, dataset, list of datasets, (list of) \
RecordBatch or Table, iterable of RecordBatch, RecordBatchReader, or URI
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #34683: [Python] ArrowInvalid error when trying to read list of s3 paths in pyarrow.dataset
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34683:
URL: https://github.com/apache/arrow/issues/34683#issuecomment-1479953699
Yeah, for those docs, we should probably move URI to the front, so it reads something like "path or URI, list of paths, ..."
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #34683: [Python] ArrowInvalid error when trying to read list of s3 paths in pyarrow.dataset
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34683:
URL: https://github.com/apache/arrow/issues/34683#issuecomment-1479937703
We actually explicitly document this limitation, I see now:
https://github.com/apache/arrow/blob/532b9a57cab4fa1f88438fd0bd79cb7eb8aa2df3/python/pyarrow/dataset.py#L581-L586
And if you pass a list of strings (without explicit filesystem), the current code also assumes a LocalFileSystem and thus tries to interpret the values in the list as local file paths, so that explains the error you see.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #34683: [Python] ArrowInvalid error when trying to read list of s3 paths in pyarrow.dataset
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34683:
URL: https://github.com/apache/arrow/issues/34683#issuecomment-1479930917
I _think_ that was a remaining todo when this was initially implemented to support parsing URIs when passes as a list. What certainly already works is passing a list of _paths_ (in which case of S3 needs to be accompanied by a filesystem object then). Using your example:
```
In [4]: from pyarrow.fs import FileSystem
In [5]: fs, _ = FileSystem.from_uri(files[0])
In [6]: ds.dataset([f.lstrip("s3://") for f in files], filesystem=fs)
Out[6]: <pyarrow._dataset.FileSystemDataset at 0x7f658f8a4280>
```
I think the reason that we initially punted on that, is because when using a list of URIs, in principle every item in that list could point to a different filesystem (and so that also needs to be validated).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org