You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "maubarsom (via GitHub)" <gi...@apache.org> on 2023/09/26 19:41:31 UTC
[GitHub] [arrow] maubarsom opened a new issue, #37888: pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials
maubarsom opened a new issue, #37888:
URL: https://github.com/apache/arrow/issues/37888
### Describe the bug, including details regarding any error messages, version, and platform.
Bug seen in pyarrow version 12.0.0 on macOS Ventura 13.6, Apple M1 Pro.
# Description
The error was detected in `pandas` originally, but traced to `pyarrow`, as described in the screenshot. Basically, if I try to read an existing file from `S3` when my credentials are stored in the ~/.aws/credentials and config directory, pyarrow returns the error .
```
OSError: When getting information for key 'XXX/YYY.parquet' in bucket 'ZZZZZ': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
```
**Expected result**: The file is succesfully read
**Note:** This error DOES NOT occur if the credentials are set as environment variables (instead of being read from ~/.aws/credentials). If they are set as env variables, pyarrow succesfully reads the parquet file.
**Note 2:** As shown in the screenshot, I managed to circunvent the issue in **pandas** by passing the `storage_options={"anon":False}` explicitly. However, trying a similar approach in `pyarrow`, by setting explicitly `filesystem=S3Filesystem(anonymous=False)` did not succeed, and resulted in the same error.
# Screenshot
![pyarrow_bug_report](https://github.com/apache/arrow/assets/5690589/af2d00de-ce4e-4214-9e89-b9fa11e10043)
The traceback:
```
File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/parquet/core.py:2939, in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
2932 raise ValueError(
2933 "The 'metadata' keyword is no longer supported with the new "
2934 "datasets-based implementation. Specify "
2935 "'use_legacy_dataset=True' to temporarily recover the old "
2936 "behaviour."
2937 )
2938 try:
-> 2939 dataset = _ParquetDatasetV2(
2940 source,
2941 schema=schema,
2942 filesystem=filesystem,
2943 partitioning=partitioning,
2944 memory_map=memory_map,
2945 read_dictionary=read_dictionary,
2946 buffer_size=buffer_size,
2947 filters=filters,
2948 ignore_prefixes=ignore_prefixes,
2949 pre_buffer=pre_buffer,
2950 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
2951 thrift_string_size_limit=thrift_string_size_limit,
2952 thrift_container_size_limit=thrift_container_size_limit,
2953 )
2954 except ImportError:
2955 # fall back on ParquetFile for simple cases when pyarrow.dataset
2956 # module is not available
2957 if filters is not None:
File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/parquet/core.py:2465, in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
2463 except ValueError:
2464 filesystem = LocalFileSystem(use_mmap=memory_map)
-> 2465 finfo = filesystem.get_file_info(path_or_paths)
2466 if finfo.is_file:
2467 single_file = path_or_paths
File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/_fs.pyx:571, in pyarrow._fs.FileSystem.get_file_info()
File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()
```
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials [arrow]
Posted by "maubarsom (via GitHub)" <gi...@apache.org>.
maubarsom commented on issue #37888:
URL: https://github.com/apache/arrow/issues/37888#issuecomment-1764317740
Hi! thanks for the reply. Maybe it wasn't so clear from my description above, but for `pandas` I did find a workaround, which is to supply the `storage_options={ "anon": False }` to the `pandas.read_parquet()` call (which I took from the `s3fs` documentation btw). I'm guessing this workaround is equally performant as without the param.
Above, I was mostly trying to help to get to the bottom of the issue, and that's as far as I managed. Maybe other wrappers of pyarrow might be affected? dunno.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials [arrow]
Posted by "rdbisme (via GitHub)" <gi...@apache.org>.
rdbisme commented on issue #37888:
URL: https://github.com/apache/arrow/issues/37888#issuecomment-1764202414
As a workaround, wrapping the `read_parquet` call with `fsspec` works:
```
_native_read_parquet = pd.read_parquet
def read_parquet(f, *args, **kwargs):
if isinstance(f, BytesIO):
return _native_read_parquet(f, *args, **kwargs)
kwargs.pop("filesystem", None)
fs = fsspec.open(f).fs
return _native_read_parquet(f, filesystem=fs, *args, **kwargs)
pd.read_parquet = read_parquet
```
but it's probably slower and more memory hungry.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] maubarsom commented on issue #37888: pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials
Posted by "maubarsom (via GitHub)" <gi...@apache.org>.
maubarsom commented on issue #37888:
URL: https://github.com/apache/arrow/issues/37888#issuecomment-1737541579
Update: I asked a colleague to run this in linux with `13.0.0`, same error occurs, same conditions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials [arrow]
Posted by "afonso-stuart (via GitHub)" <gi...@apache.org>.
afonso-stuart commented on issue #37888:
URL: https://github.com/apache/arrow/issues/37888#issuecomment-2059517889
I reproduced the same bug for pyarrow version `12.0.0` and above all the way to `15.0.2`. I'm on a macOS Sonoma 14.4, Apple M1 Max chip. Rolling pyarrow back to version `11.0.0` fixes it for me, as well as the solution suggested by @maubarsom
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org