You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "maubarsom (via GitHub)" <gi...@apache.org> on 2023/09/26 19:41:31 UTC

[GitHub] [arrow] maubarsom opened a new issue, #37888: pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials

maubarsom opened a new issue, #37888:
URL: https://github.com/apache/arrow/issues/37888

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Bug seen in pyarrow version 12.0.0 on macOS Ventura 13.6, Apple M1 Pro.
   
   # Description
   
   The error was detected in `pandas` originally, but traced to `pyarrow`, as described in the screenshot. Basically, if I try to read an existing file from `S3` when my credentials are stored in the ~/.aws/credentials and config directory, pyarrow returns the error .
   
   ```
   OSError: When getting information for key 'XXX/YYY.parquet' in bucket 'ZZZZZ': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
   ```
   
   **Expected result**: The file is succesfully read
   
   **Note:** This error DOES NOT occur if the credentials are set as environment variables (instead of being read from ~/.aws/credentials). If they are set as env variables, pyarrow succesfully reads the parquet file.
   
   **Note 2:** As shown in the screenshot, I managed to circunvent the issue in **pandas** by passing the `storage_options={"anon":False}` explicitly. However, trying a similar approach in `pyarrow`, by setting explicitly `filesystem=S3Filesystem(anonymous=False)` did not succeed, and resulted in the same error.
   
   # Screenshot
   
   ![pyarrow_bug_report](https://github.com/apache/arrow/assets/5690589/af2d00de-ce4e-4214-9e89-b9fa11e10043)
   
   
   The traceback:
   
   ```
   File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/parquet/core.py:2939, in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
      2932     raise ValueError( 
      2933         "The 'metadata' keyword is no longer supported with the new "
      2934         "datasets-based implementation. Specify "
      2935         "'use_legacy_dataset=True' to temporarily recover the old "
      2936         "behaviour."
      2937     )
      2938 try:
   -> 2939     dataset = _ParquetDatasetV2(
      2940         source,
      2941         schema=schema,
      2942         filesystem=filesystem,
      2943         partitioning=partitioning,
      2944         memory_map=memory_map,
      2945         read_dictionary=read_dictionary,
      2946         buffer_size=buffer_size,
      2947         filters=filters,
      2948         ignore_prefixes=ignore_prefixes,
      2949         pre_buffer=pre_buffer,
      2950         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
      2951         thrift_string_size_limit=thrift_string_size_limit,
      2952         thrift_container_size_limit=thrift_container_size_limit,
      2953     )
      2954 except ImportError:
      2955     # fall back on ParquetFile for simple cases when pyarrow.dataset
      2956     # module is not available
      2957     if filters is not None:
   
   File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/parquet/core.py:2465, in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
      2463     except ValueError:
      2464         filesystem = LocalFileSystem(use_mmap=memory_map)
   -> 2465 finfo = filesystem.get_file_info(path_or_paths)
      2466 if finfo.is_file:
      2467     single_file = path_or_paths
   
   File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/_fs.pyx:571, in pyarrow._fs.FileSystem.get_file_info()
   
   File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
   
   File ~/.mambaforge/envs/datasci/lib/python3.11/site-packages/pyarrow/error.pxi:115, in pyarrow.lib.check_status()
   ```
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials [arrow]

Posted by "maubarsom (via GitHub)" <gi...@apache.org>.
maubarsom commented on issue #37888:
URL: https://github.com/apache/arrow/issues/37888#issuecomment-1764317740

   Hi! thanks for the reply. Maybe it wasn't so clear from my description above, but for `pandas` I did find a workaround, which  is to supply the `storage_options={ "anon": False }`  to the `pandas.read_parquet()` call (which I took from the `s3fs` documentation btw).  I'm guessing  this workaround is equally performant as without the param. 
   
   
   Above, I was mostly trying to help to get to  the bottom of the issue, and that's as far as I managed.  Maybe other wrappers of pyarrow might be affected? dunno. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials [arrow]

Posted by "rdbisme (via GitHub)" <gi...@apache.org>.
rdbisme commented on issue #37888:
URL: https://github.com/apache/arrow/issues/37888#issuecomment-1764202414

   As a workaround, wrapping the `read_parquet` call with `fsspec` works: 
   
   ```
   _native_read_parquet = pd.read_parquet
   
   
   def read_parquet(f, *args, **kwargs):
       if isinstance(f, BytesIO):
           return _native_read_parquet(f, *args, **kwargs)
   
       kwargs.pop("filesystem", None)
       fs = fsspec.open(f).fs
       return _native_read_parquet(f, filesystem=fs, *args, **kwargs)
   
   
   pd.read_parquet = read_parquet
   ```
   
   but it's probably slower and more memory hungry. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] maubarsom commented on issue #37888: pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials

Posted by "maubarsom (via GitHub)" <gi...@apache.org>.
maubarsom commented on issue #37888:
URL: https://github.com/apache/arrow/issues/37888#issuecomment-1737541579

   Update: I asked a colleague to run this in linux with `13.0.0`, same error occurs, same conditions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] pyarrow fails to read from AWS S3 when credentials come from ~/.aws/credentials [arrow]

Posted by "afonso-stuart (via GitHub)" <gi...@apache.org>.
afonso-stuart commented on issue #37888:
URL: https://github.com/apache/arrow/issues/37888#issuecomment-2059517889

   I reproduced the same bug for pyarrow version `12.0.0` and above all the way to `15.0.2`. I'm on a macOS Sonoma 14.4, Apple M1 Max chip. Rolling pyarrow back to version `11.0.0` fixes it for me, as well as the solution suggested by @maubarsom 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org