You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2022/11/09 11:33:00 UTC

[jira] [Commented] (HADOOP-18523) Allow to retrieve an object from MinIO (S3 API) with a very restrictive policy

    [ https://issues.apache.org/jira/browse/HADOOP-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631009#comment-17631009 ] 

Steve Loughran commented on HADOOP-18523:
-----------------------------------------

don't blame the s3a code here, spark is calling fs.isDirectory(hdfsPath)
going to have to close this as a wontfix

in the theoretical world of open source, anything is fixable. here I'd recommed you comment out that bit of org.apache.spark.sql.execution.streaming.FileStreamSink.hasMetadata  in the private fork of spark you will have to do.

that or hack around the s3a connector. it is written for  AWS s3 where ListObjects to the entire bucket is expected. 

leaving it as your homework, i'm afraid

> Allow to retrieve an object from MinIO (S3 API) with a very restrictive policy
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-18523
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18523
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Sébastien Burton
>            Priority: Major
>
> Hello,
> We're using Spark ({{{}"org.apache.spark:spark-[catalyst|core|sql]_2.12:3.2.2"{}}}) and Hadoop ({{{}"org.apache.hadoop:hadoop-common:3.3.3"{}}}) and want to retrieve an object stored in a MinIO bucket (MinIO implements the S3 API). Spark relies on Hadoop for this operation.
> The MinIO bucket (that we don't manage) is configured with a very restrictive policy that only allows us to retrieve the object (and nothing else). Something like:
> {code:java}
> {
>   "statement": [
>     {
>       "effect": "Allow",
>       "action": [ "s3:GetObject" ],
>       "resource": [ "arn:aws:s3:::minio-bucket/object" ]
>     }
>   ]
> }{code}
> And using the AWS CLI, we can well retrieve the object.
> When we try with Spark's {{{}DataFrameReader{}}}, we receive an HTTP 403 response (access denied) from MinIO:
> {code:java}
> java.nio.file.AccessDeniedException: s3a://minio-bucket/object: getFileStatus on s3a://minio-bucket/object: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied. (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; ...
>     at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:255)
>     at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3858)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$isDirectory$35(S3AFileSystem.java:4724)
>     at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
>     at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
>     at org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4722)
>     at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
>     at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
>     at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
>     at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>     at scala.Option.getOrElse(Option.scala:189)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
>     at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:571)
>     at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:481)
>     at com.soprabanking.dxp.pure.bf.dataaccess.S3Storage.loadDataset(S3Storage.java:55)
>     at com.soprabanking.dxp.pure.bf.business.step.DatasetLoader.lambda$doLoad$3(DatasetLoader.java:148)
>     at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:125)
>     at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
>     at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:151)
>     at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
>     at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
>     at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
>     at reactor.core.publisher.MonoZip$ZipCoordinator.signal(MonoZip.java:251)
>     at reactor.core.publisher.MonoZip$ZipInner.onNext(MonoZip.java:336)
>     at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2398)
>     at reactor.core.publisher.MonoZip$ZipInner.onSubscribe(MonoZip.java:325)
>     at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:55)
>     at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
>     at reactor.core.publisher.MonoZip.subscribe(MonoZip.java:128)
>     at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
>     at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
>     at reactor.core.publisher.FluxFilter$FilterSubscriber.onNext(FluxFilter.java:113)
>     at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
>     at reactor.core.publisher.FluxFilterFuseable$FilterFuseableSubscriber.onNext(FluxFilterFuseable.java:118)
>     at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onNext(MonoPeekTerminal.java:180)
>     at reactor.core.publisher.FluxPeekFuseable$PeekFuseableConditionalSubscriber.onNext(FluxPeekFuseable.java:503)
>     at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:137)
>     at reactor.core.publisher.Operators$MonoInnerProducerBase.complete(Operators.java:2664)
>     at reactor.core.publisher.MonoSingle$SingleSubscriber.onComplete(MonoSingle.java:180)
>     at com.jakewharton.retrofit2.adapter.reactor.BodyFlux$BodySubscriber.onComplete(BodyFlux.java:80)
>     at reactor.core.publisher.StrictSubscriber.onComplete(StrictSubscriber.java:123)
>     at reactor.core.publisher.FluxCreate$BaseSink.complete(FluxCreate.java:439)
>     at reactor.core.publisher.FluxCreate$LatestAsyncSink.drain(FluxCreate.java:945)
>     at reactor.core.publisher.FluxCreate$LatestAsyncSink.complete(FluxCreate.java:892)
>     at reactor.core.publisher.FluxCreate$SerializedFluxSink.drainLoop(FluxCreate.java:240)
>     at reactor.core.publisher.FluxCreate$SerializedFluxSink.drain(FluxCreate.java:206)
>     at reactor.core.publisher.FluxCreate$SerializedFluxSink.complete(FluxCreate.java:197)
>     at com.jakewharton.retrofit2.adapter.reactor.EnqueueSinkConsumer$DisposableCallback.onResponse(EnqueueSinkConsumer.java:52)
>     at retrofit2.OkHttpCall$1.onResponse(OkHttpCall.java:161)
>     at brave.okhttp3.TraceContextCall$TraceContextCallback.onResponse(TraceContextCall.java:95)
>     at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
>     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>     at java.base/java.lang.Thread.run(Unknown Source){code}
> The credentials are well set but under the hood Hadoop calls MinIO to check whether the object is a directory (which we don't want), and this results in a failure.
> We can well retrieve the object by changing MinIO's policy - but this isn't an option to us - to something like:
> {code:java}
> {
>   "statement": [
>     {
>       "effect": "Allow",
>       "action": [ "s3:GetObject" ],
>       "resource": [ "arn:aws:s3:::minio-bucket/object" ]
>     },
>     {
>       "effect": "Allow",
>       "action": [ "s3:ListBucket" ],
>       "resource": [ "arn:aws:s3:::minio-bucket/" ],
>       "condition": {
>         "StringLike": {
>           "s3:prefix": [ "object", "object/" ]
>         }
>       }
>     }
>   ]
> }{code}
> We couldn't find any way to configure Hadoop so that it just attempts to retrieve the object. Reading HADOOP-17454, it feels like it could be possible to provide options to fine-tune Hadoop's behaviour.
> Are there such options? If not, is it a reasonable behaviour to put in place?
> Regards,
> Sébastien
> Please note this is my first time here: I hope I picked the right project, issue type and priority (I tried my best looking around). If not, I'm very sorry about that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org