You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ahmar Suhail (Jira)" <ji...@apache.org> on 2022/03/31 15:42:00 UTC

[jira] [Comment Edited] (HADOOP-14837) Handle S3A "glacier" data

    [ https://issues.apache.org/jira/browse/HADOOP-14837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515405#comment-17515405 ] 

Ahmar Suhail edited comment on HADOOP-14837 at 3/31/22, 3:41 PM:
-----------------------------------------------------------------

[~stevel@apache.org] Object summaries include the storage class, which means we can filter without any additional HEAD calls. 

For getBlockLocations(), I was looking at how it's used in Spark, and found that it's called[ here|[https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala#L307]]

If we implement getBlockLocations() in S3FS to return storage type, we would have to do an HEAD call which would slow down the above usage, not sure if that's something we should do? 

If we do want to implement getBlockLocations(), we could have a configuration option like `fs.s3a.get.file.locations` which when enabled would make the head call, otherwise just return the default location. 


was (Author: JIRAUSER283484):
[~stevel@apache.org] Object summaries include the storage class, which means we can filter without any additional HEAD calls. 

For getBlockLocations(), I was looking at how it's used in Spark, and found that it's called[ here|[https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala#L307].|https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala#L307]]

If we implement getBlockLocations() in S3FS to return storage type, we would have to do an HEAD call which would slow down the above usage, not sure if that's something we should do? 

If we do want to implement getBlockLocations(), we could have a configuration option like `fs.s3a.get.file.locations` which when enabled would make the head call, otherwise just return the default location. 

> Handle S3A "glacier" data
> -------------------------
>
>                 Key: HADOOP-14837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14837
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.0.0-beta1
>            Reporter: Steve Loughran
>            Priority: Minor
>
> SPARK-21797 covers how if you have AWS S3 set to copy some files to glacier, they appear in the listing but GETs fail, and so does everything else
> We should think about how best to handle this.
> # report better
> # if listings can identify files which are glaciated then maybe we could have an option to filter them out
> # test & see what happens



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org