You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2022/11/22 11:36:00 UTC

[jira] [Resolved] (HADOOP-18523) Allow to retrieve an object from MinIO (S3 API) with a very restrictive policy

     [ https://issues.apache.org/jira/browse/HADOOP-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Loughran resolved HADOOP-18523.
-------------------------------------
    Resolution: Won't Fix

> Allow to retrieve an object from MinIO (S3 API) with a very restrictive policy
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-18523
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18523
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Sébastien Burton
>            Priority: Major
>
> Hello,
> We're using Spark ({{{}"org.apache.spark:spark-[catalyst|core|sql]_2.12:3.2.2"{}}}) and Hadoop ({{{}"org.apache.hadoop:hadoop-common:3.3.3"{}}}) and want to retrieve an object stored in a MinIO bucket (MinIO implements the S3 API). Spark relies on Hadoop for this operation.
> The MinIO bucket (that we don't manage) is configured with a very restrictive policy that only allows us to retrieve the object (and nothing else). Something like:
> {code:java}
> {
>   "statement": [
>     {
>       "effect": "Allow",
>       "action": [ "s3:GetObject" ],
>       "resource": [ "arn:aws:s3:::minio-bucket/object" ]
>     }
>   ]
> }{code}
> And using the AWS CLI, we can well retrieve the object.
> When we try with Spark's {{{}DataFrameReader{}}}, we receive an HTTP 403 response (access denied) from MinIO:
> {code:java}
> java.nio.file.AccessDeniedException: s3a://minio-bucket/object: getFileStatus on s3a://minio-bucket/object: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied. (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; ...
>     at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:255)
>     at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3858)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$isDirectory$35(S3AFileSystem.java:4724)
>     at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
>     at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4722)
>     at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
>     at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
>     at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
>     at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>     at scala.Option.getOrElse(Option.scala:189)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
>     at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:571)
>     at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:481)
>     at com.soprabanking.dxp.pure.bf.dataaccess.S3Storage.loadDataset(S3Storage.java:55)
>     at com.soprabanking.dxp.pure.bf.business.step.DatasetLoader.lambda$doLoad$3(DatasetLoader.java:148)
>     at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:125)
>     at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
>     at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:151)
>     at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
>     at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
>     at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
>     at reactor.core.publisher.MonoZip$ZipCoordinator.signal(MonoZip.java:251)
>     at reactor.core.publisher.MonoZip$ZipInner.onNext(MonoZip.java:336)
>     at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2398)
>     at reactor.core.publisher.MonoZip$ZipInner.onSubscribe(MonoZip.java:325)
>     at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:55)
>     at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
>     at reactor.core.publisher.MonoZip.subscribe(MonoZip.java:128)
>     at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
>     at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
>     at reactor.core.publisher.FluxFilter$FilterSubscriber.onNext(FluxFilter.java:113)
>     at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
>     at reactor.core.publisher.FluxFilterFuseable$FilterFuseableSubscriber.onNext(FluxFilterFuseable.java:118)
>     at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onNext(MonoPeekTerminal.java:180)
>     at reactor.core.publisher.FluxPeekFuseable$PeekFuseableConditionalSubscriber.onNext(FluxPeekFuseable.java:503)
>     at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:137)
>     at reactor.core.publisher.Operators$MonoInnerProducerBase.complete(Operators.java:2664)
>     at reactor.core.publisher.MonoSingle$SingleSubscriber.onComplete(MonoSingle.java:180)
>     at com.jakewharton.retrofit2.adapter.reactor.BodyFlux$BodySubscriber.onComplete(BodyFlux.java:80)
>     at reactor.core.publisher.StrictSubscriber.onComplete(StrictSubscriber.java:123)
>     at reactor.core.publisher.FluxCreate$BaseSink.complete(FluxCreate.java:439)
>     at reactor.core.publisher.FluxCreate$LatestAsyncSink.drain(FluxCreate.java:945)
>     at reactor.core.publisher.FluxCreate$LatestAsyncSink.complete(FluxCreate.java:892)
>     at reactor.core.publisher.FluxCreate$SerializedFluxSink.drainLoop(FluxCreate.java:240)
>     at reactor.core.publisher.FluxCreate$SerializedFluxSink.drain(FluxCreate.java:206)
>     at reactor.core.publisher.FluxCreate$SerializedFluxSink.complete(FluxCreate.java:197)
>     at com.jakewharton.retrofit2.adapter.reactor.EnqueueSinkConsumer$DisposableCallback.onResponse(EnqueueSinkConsumer.java:52)
>     at retrofit2.OkHttpCall$1.onResponse(OkHttpCall.java:161)
>     at brave.okhttp3.TraceContextCall$TraceContextCallback.onResponse(TraceContextCall.java:95)
>     at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
>     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>     at java.base/java.lang.Thread.run(Unknown Source){code}
> The credentials are well set but under the hood Hadoop calls MinIO to check whether the object is a directory (which we don't want), and this results in a failure.
> We can well retrieve the object by changing MinIO's policy - but this isn't an option to us - to something like:
> {code:java}
> {
>   "statement": [
>     {
>       "effect": "Allow",
>       "action": [ "s3:GetObject" ],
>       "resource": [ "arn:aws:s3:::minio-bucket/object" ]
>     },
>     {
>       "effect": "Allow",
>       "action": [ "s3:ListBucket" ],
>       "resource": [ "arn:aws:s3:::minio-bucket/" ],
>       "condition": {
>         "StringLike": {
>           "s3:prefix": [ "object", "object/" ]
>         }
>       }
>     }
>   ]
> }{code}
> We couldn't find any way to configure Hadoop so that it just attempts to retrieve the object. Reading HADOOP-17454, it feels like it could be possible to provide options to fine-tune Hadoop's behaviour.
> Are there such options? If not, is it a reasonable behaviour to put in place?
> Regards,
> Sébastien
> Please note this is my first time here: I hope I picked the right project, issue type and priority (I tried my best looking around). If not, I'm very sorry about that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org