You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/06/13 17:46:43 UTC

[GitHub] [arrow] westonpace commented on issue #36007: [R] read_parquet from s3 is slow and often flakey

westonpace commented on issue #36007:
URL: https://github.com/apache/arrow/issues/36007#issuecomment-1589766322

   Are you accessing a public file?  My guess is that you're running into the default AWS credentials checker.  By default, even if the file is public, AWS is going to try and figure out what credentials you are using before making any request.
   
   Unfortunately, if you don't have a credentials file stored somewhere that it expects (e.g. `~/.aws/credentials`) then it will start trying some unpleasant alternatives.  The main problematic alternative is to make a request to `169.254.169.254` which is a special magic IP address that all EC2 containers are configured to serve that contains EC2 metadata (e.g. in case your EC2 instance has credentials assigned to it).  Depending on your system's networking configuration it may take some time for this request to fail.
   
   So, for example, if I delete my credentials file, turn on AWS trace logging, and run your test then the first option takes 13 seconds and my logs are filled with messages like this:
   
   ```
   [INFO] 2023-06-13 17:28:26.917 ProcessCredentialsProvider [140191537006528] Failed to find credential process's profile: default
   [TRACE] 2023-06-13 17:28:26.917 FileSystemUtils [140191537006528] Checking HOME for the home directory.
   [DEBUG] 2023-06-13 17:28:26.917 FileSystemUtils [140191537006528] Environment value for variable HOME is /home/pace
   [DEBUG] 2023-06-13 17:28:26.917 FileSystemUtils [140191537006528] Home directory is missing the final / appending one to normalize
   [DEBUG] 2023-06-13 17:28:26.917 FileSystemUtils [140191537006528] Final Home Directory is /home/pace/
   [DEBUG] 2023-06-13 17:28:26.917 SSOCredentialsProvider [140191537006528] Loading token from: /home/pace/.aws/sso/cache/da39a3ee5e6b4b0d3255bfef95601890afd80709.json
   [DEBUG] 2023-06-13 17:28:26.917 SSOCredentialsProvider [140191537006528] Preparing to load token from: /home/pace/.aws/sso/cache/da39a3ee5e6b4b0d3255bfef95601890afd80709.json
   [INFO] 2023-06-13 17:28:26.917 SSOCredentialsProvider [140191537006528] Unable to open token file on path: /home/pace/.aws/sso/cache/da39a3ee5e6b4b0d3255bfef95601890afd80709.json
   [TRACE] 2023-06-13 17:28:26.917 SSOCredentialsProvider [140191537006528] Access token for SSO not available
   [DEBUG] 2023-06-13 17:28:26.917 InstanceProfileCredentialsProvider [140191537006528] Checking if latest credential pull has expired.
   [INFO] 2023-06-13 17:28:26.917 InstanceProfileCredentialsProvider [140191537006528] Credentials have expired attempting to re-pull from EC2 Metadata Service.
   [TRACE] 2023-06-13 17:28:26.917 EC2MetadataClient [140191537006528] Getting default credentials for ec2 instance from http://169.254.169.254
   [TRACE] 2023-06-13 17:28:26.917 EC2MetadataClient [140191537006528] Retrieving credentials from http://169.254.169.254/latest/meta-data/iam/security-credentials
   [TRACE] 2023-06-13 17:28:26.917 CurlHttpClient [140191537006528] Making request to http://169.254.169.254/latest/meta-data/iam/security-credentials
   [TRACE] 2023-06-13 17:28:26.917 CurlHttpClient [140191537006528] Including headers:
   [TRACE] 2023-06-13 17:28:26.917 CurlHttpClient [140191537006528] host: 169.254.169.254
   [TRACE] 2023-06-13 17:28:26.917 CurlHttpClient [140191537006528] user-agent: aws-sdk-cpp/1.10.13 Linux/5.19.0-43-generic x86_64 GCC/10.4.0
   [DEBUG] 2023-06-13 17:28:26.917 CurlHandleContainer [140191537006528] Attempting to acquire curl connection.
   [INFO] 2023-06-13 17:28:26.917 CurlHandleContainer [140191537006528] Connection has been released. Continuing.
   [DEBUG] 2023-06-13 17:28:26.917 CurlHandleContainer [140191537006528] Returning connection handle 0x563be150dc20
   [DEBUG] 2023-06-13 17:28:26.917 CurlHttpClient [140191537006528] Obtained connection handle 0x563be150dc20
   [DEBUG] 2023-06-13 17:28:26.918 CURL [140191537006528] (Text)   Trying 169.254.169.254:80...
   [DEBUG] 2023-06-13 17:28:27.919 CURL [140191537006528] (Text) After 1000ms connect time, move on!
   [DEBUG] 2023-06-13 17:28:27.919 CURL [140191537006528] (Text) connect to 169.254.169.254 port 80 failed: Connection timed out
   [DEBUG] 2023-06-13 17:28:27.919 CURL [140191537006528] (Text) Connection timeout after 1001 ms
   [DEBUG] 2023-06-13 17:28:27.919 CURL [140191537006528] (Text) Closing connection 0\
   [ERROR] 2023-06-13 17:28:27.919 CurlHttpClient [140191537006528] Curl returned error code 28 - Timeout was reached
   ```
   
   There is a way to specify anonymous credentials when creating an S3 filesystem (which should prevent S3 from trying to recalculate credentials on each request) but I don't know enough of the R library to know how to plumb that through to something like `read_parquet`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org