You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/16 16:40:59 UTC
[GitHub] [arrow] damart93 opened a new issue, #33707: [Python] Creating dataset with an S3 Path using fsspec filesystem does not work

damart93 opened a new issue, #33707:
URL: https://github.com/apache/arrow/issues/33707

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I was testing loading datasets using fsspec filesystem instead of the arrow ones, just for the sake of testing it out. What I found is:
   
   Given a S3 server, working keys and a dataset_path that exists and works OK.
   
   If I try to create a dataset based on a dataset_path (directory) as a relative path from the root directory
   
   Let us use a fake value in the code as an example (cannot share original path) so we can see the issue easier:
   
   ```
   import pyarrow.dataset as pd
   import fsspec
   
   dataset_path = "fake_directory/fake_dataset"
   
   app_keys = {
       'client_kwargs':{"endpoint_url": s3_host},
       'key': access_key,
       'secret': secret_key,
   }
   s3_fs = fsspec.filesystem("s3", **app_keys)
   dataset = pd.dataset(dataset_path, filesystem=s3_fs, format="parquet", partitioning="hive")
   #Traceback (most recent call last):
   # ...
   #    dataset = pd.dataset(f"{dataset_path}", filesystem=s3_fs, format="parquet", partitioning="hive")
   #  File "C:\Repos\project\venv\lib\site-packages\pyarrow\dataset.py", line 752, in dataset
   #    return _filesystem_dataset(source, **kwargs)
   #  File "C:\Repos\project\venv\lib\site-packages\pyarrow\dataset.py", line 454, in _filesystem_dataset
   #    return factory.finish(schema)
   #  File "pyarrow\_dataset.pyx", line 1940, in pyarrow._dataset.DatasetFactory.finish
   #  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
   #  File "pyarrow\_fs.pyx", line 1551, in pyarrow._fs._cb_open_input_file
   #  File "C:\Repos\project\venv\lib\site-packages\pyarrow\fs.py", line 419, in open_input_file
   #    raise FileNotFoundError(path)
   #FileNotFoundError: fake_directory/fake_dataset
   ```
   If the dataset_path is given as absolute (s3://{dataset_path} or simply /{dataset_path}), the same code throws this error:
   
   ```
   import pyarrow.dataset as pd
   import fsspec
   
   dataset_path = "/fake_directory/fake_dataset"
   
   app_keys = {
       'client_kwargs':{"endpoint_url": s3_host},
       'key': access_key,
       'secret': secret_key,
   }
   s3_fs = fsspec.filesystem("s3", **app_keys)
   dataset = pd.dataset(dataset_path, filesystem=s3_fs, format="parquet", partitioning="hive")
   #Traceback (most recent call last):
   # ...
   #    dataset = pd.dataset(f"/{dataset_path}", filesystem=s3_fs, format="parquet", partitioning="hive")
   #  File "C:\Repos\project\venv\lib\site-packages\pyarrow\dataset.py", line 752, in dataset
   #    return _filesystem_dataset(source, **kwargs)
   #  File "C:\Repos\project\venv\lib\site-packages\pyarrow\dataset.py", line 452, in _filesystem_dataset
   #    factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
   #  File "pyarrow\_dataset.pyx", line 2114, in pyarrow._dataset.FileSystemDatasetFactory.__init__
   #  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
   #  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
   #pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'fake_directory/fake_dataset/{rest_of_path_to_a_file}', which is outside base dir '/fake_directory/fake_dataset'
   ```
   
   First thought was this just may not be an intended feature. However, if instead of a directory, we pass a file or a list of files, the dataset is created just fine. So seeing this is supported led me to open this issue.
   
   Also, the dataset is working fine using dataset_path using SubTreeFileSystem, so it is not any kind of access problem, and the fsspec filesystem object created is able to list files just fine.
   
   To me, it looks like some kind of issue in the interaction between fsspec and arrow when discovering the dataset composition, which leads to confuse a directory to a file. 
   
   Using Python 3.10.0, fsspec 2022.11.0 and tested in on both pyarrow 9.0.0 and 10.0.1
   
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org