You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Prem Sagar Gali (Jira)" <ji...@apache.org> on 2022/05/02 17:59:00 UTC
[jira] [Created] (ARROW-16438) pyarrow dataset API fails to read s3 directory
Prem Sagar Gali created ARROW-16438:
---------------------------------------
Summary: pyarrow dataset API fails to read s3 directory
Key: ARROW-16438
URL: https://issues.apache.org/jira/browse/ARROW-16438
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 7.0.0
Reporter: Prem Sagar Gali
When an s3 file system as `file_system` is passed to [pyarrow.dataset.dataset|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset] API and the `source` is a directory name with bucket, there is an error:
```python
In [5]: from fsspec.core import get_fs_token_paths
In [6]: fs, _, path = get_fs_token_paths("s3://prem-rapids-test/folder/", mode="rb")
In [7]: fs
Out[7]: <s3fs.core.S3FileSystem at 0x7f3d02cc1460>
In [8]: path
Out[8]: ['prem-rapids-test/folder']
In [10]: pa.dataset.dataset(path, filesystem=fs, format="parquet")
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 pa.dataset.dataset(path, filesystem=fs, format="parquet")
File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/dataset.py:670, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
668 elif isinstance(source, (tuple, list)):
669 if all(_is_path_like(elem) for elem in source):
--> 670 return _filesystem_dataset(source, **kwargs)
671 elif all(isinstance(elem, Dataset) for elem in source):
672 return _union_dataset(source, **kwargs)
File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/dataset.py:422, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
414 options = FileSystemFactoryOptions(
415 partitioning=partitioning,
416 partition_base_dir=partition_base_dir,
417 exclude_invalid_files=exclude_invalid_files,
418 selector_ignore_prefixes=selector_ignore_prefixes
419 )
420 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
--> 422 return factory.finish(schema)
File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/_dataset.pyx:1680, in pyarrow._dataset.DatasetFactory.finish()
File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/error.pxi:143, in pyarrow.lib.pyarrow_internal_check_status()
File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/_fs.pyx:1179, in pyarrow._fs._cb_open_input_file()
File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/fs.py:394, in FSSpecHandler.open_input_file(self, path)
391 from pyarrow import PythonFile
393 if not self.fs.isfile(path):
--> 394 raise FileNotFoundError(path)
396 return PythonFile(self.fs.open(path, mode="rb"), mode="r")
FileNotFoundError: prem-rapids-test/folder
```
But it works only if the folder is passed as a full string:
```python
In [3]: import pyarrow.dataset
In [4]: pa.dataset.dataset("s3://prem-rapids-test/folder/", format="parquet")
Out[4]: <pyarrow._dataset.FileSystemDataset at 0x7f3ce502d870>
```
--
This message was sent by Atlassian Jira
(v8.20.7#820007)