You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "amoeba (via GitHub)" <gi...@apache.org> on 2023/06/09 20:48:33 UTC

[GitHub] [arrow] amoeba commented on issue #36007: [R] read_parquet from s3 is slow and often flakey

amoeba commented on issue #36007:
URL: https://github.com/apache/arrow/issues/36007#issuecomment-1585110561

   Thanks for the report. One difference between your `arrow` and `aws.s3` code is that the `arrow` code first has to find the region(s) the object exists in so it will always be a bit slower. But 70sec is extreme. Can you add `?region=eu-west-1` to your S3 URI which I think will skip the bit where we try to find the region? Like:
   
   `read_parquet("s3://arrow-s3-testing-eu-west-1/starwars.parquet?region=eu-west-1")`
   
   That said, the below error seems like something else:
   
   > Error in url(file, open = "rb") : URL scheme unsupported by this method
   
   I first tested to make sure `read_parquet` was working for me, which it was, but I noticed you're in eu-west-1 so I made a bucket in eu-west-1 and tried again and got the above error. I knew I didn't set a bucket ACL so I thought it might be related to that so I stepped through it in the debugger and found the issue: `read_parquet` is eating an AWS authentication error in this code:
   
   https://github.com/apache/arrow/blob/8b6688acca6d14ea69c533c150b2cc05a1403f91/r/R/io.R#L253-L260
   
   ```r
   > fs_and_path$fs$OpenInputFile(fs_and_path$path)
   Error: IOError: When reading information for key 'starwars.parquet' in bucket 'arrow-s3-testing-eu-west-1': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
   ```
   
   Once I set an ACL the error went away and reads were against fast (1-2sec). So I guess my question is why you might be intermittently getting an error that `read_parquet` is eating and what is that error? Unfortunately #35398 isn't merged just yet but once it is it might help. No matter what we find here, it would be good to tweak the code so errors aren't eaten.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org