You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2022/03/22 14:04:00 UTC
[jira] [Commented] (HADOOP-14837) Handle S3A "glacier" data

    [ https://issues.apache.org/jira/browse/HADOOP-14837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510508#comment-17510508 ] 

Steve Loughran commented on HADOOP-14837:
-----------------------------------------

good questions, -I have no idea what the right answers are

bq. For reporting better, do we want to add in a new statistic, something like `objects_in_glacier` which will have the count of objects currently in glacier?

why not?

bq. In listings, we can add in a new option to filter out glacier files by doing something like `!summary.getStorageClass().equals("GLACIER")` in the acceptor here? After we do this and call `getContentSummary()` it won't return glacier files in the fileCount. 

I'm not worried about that. is the storage type returned in the list call. allowing it to be filtered there? i wouldn't want to do any HEAD requests here

bq. getBlockLocations()

there's special handling in spark for that location, which says "run your work anywnere". we doin't want to break that.

I think the best tactic here is to work out what people I want to do here and provide the bare minimum. Looking at some of the JIRAs there's no consensus as to what people want. Do they want glaciated files to be skipped in queries? or for recovery to be triggered (somehow). Returning the storage type ARCHIVE would be enough for anyone who wants to identify these files (distcp?) and at least then know there's a cost in accessing them. 

> Handle S3A "glacier" data
> -------------------------
>
>                 Key: HADOOP-14837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14837
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.0.0-beta1
>            Reporter: Steve Loughran
>            Priority: Minor
>
> SPARK-21797 covers how if you have AWS S3 set to copy some files to glacier, they appear in the listing but GETs fail, and so does everything else
> We should think about how best to handle this.
> # report better
> # if listings can identify files which are glaciated then maybe we could have an option to filter them out
> # test & see what happens



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org