You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2022/11/22 11:36:00 UTC
[jira] [Resolved] (HADOOP-18523) Allow to retrieve an object from MinIO (S3 API) with a very restrictive policy
[ https://issues.apache.org/jira/browse/HADOOP-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran resolved HADOOP-18523.
-------------------------------------
Resolution: Won't Fix
> Allow to retrieve an object from MinIO (S3 API) with a very restrictive policy
> ------------------------------------------------------------------------------
>
> Key: HADOOP-18523
> URL: https://issues.apache.org/jira/browse/HADOOP-18523
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs/s3
> Reporter: Sébastien Burton
> Priority: Major
>
> Hello,
> We're using Spark ({{{}"org.apache.spark:spark-[catalyst|core|sql]_2.12:3.2.2"{}}}) and Hadoop ({{{}"org.apache.hadoop:hadoop-common:3.3.3"{}}}) and want to retrieve an object stored in a MinIO bucket (MinIO implements the S3 API). Spark relies on Hadoop for this operation.
> The MinIO bucket (that we don't manage) is configured with a very restrictive policy that only allows us to retrieve the object (and nothing else). Something like:
> {code:java}
> {
> "statement": [
> {
> "effect": "Allow",
> "action": [ "s3:GetObject" ],
> "resource": [ "arn:aws:s3:::minio-bucket/object" ]
> }
> ]
> }{code}
> And using the AWS CLI, we can well retrieve the object.
> When we try with Spark's {{{}DataFrameReader{}}}, we receive an HTTP 403 response (access denied) from MinIO:
> {code:java}
> java.nio.file.AccessDeniedException: s3a://minio-bucket/object: getFileStatus on s3a://minio-bucket/object: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied. (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; ...
> at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:255)
> at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3858)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$isDirectory$35(S3AFileSystem.java:4724)
> at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
> at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4722)
> at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
> at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
> at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
> at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
> at scala.Option.getOrElse(Option.scala:189)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:571)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:481)
> at com.soprabanking.dxp.pure.bf.dataaccess.S3Storage.loadDataset(S3Storage.java:55)
> at com.soprabanking.dxp.pure.bf.business.step.DatasetLoader.lambda$doLoad$3(DatasetLoader.java:148)
> at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:125)
> at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
> at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:151)
> at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
> at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
> at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
> at reactor.core.publisher.MonoZip$ZipCoordinator.signal(MonoZip.java:251)
> at reactor.core.publisher.MonoZip$ZipInner.onNext(MonoZip.java:336)
> at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2398)
> at reactor.core.publisher.MonoZip$ZipInner.onSubscribe(MonoZip.java:325)
> at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:55)
> at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
> at reactor.core.publisher.MonoZip.subscribe(MonoZip.java:128)
> at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
> at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
> at reactor.core.publisher.FluxFilter$FilterSubscriber.onNext(FluxFilter.java:113)
> at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
> at reactor.core.publisher.FluxFilterFuseable$FilterFuseableSubscriber.onNext(FluxFilterFuseable.java:118)
> at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onNext(MonoPeekTerminal.java:180)
> at reactor.core.publisher.FluxPeekFuseable$PeekFuseableConditionalSubscriber.onNext(FluxPeekFuseable.java:503)
> at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:137)
> at reactor.core.publisher.Operators$MonoInnerProducerBase.complete(Operators.java:2664)
> at reactor.core.publisher.MonoSingle$SingleSubscriber.onComplete(MonoSingle.java:180)
> at com.jakewharton.retrofit2.adapter.reactor.BodyFlux$BodySubscriber.onComplete(BodyFlux.java:80)
> at reactor.core.publisher.StrictSubscriber.onComplete(StrictSubscriber.java:123)
> at reactor.core.publisher.FluxCreate$BaseSink.complete(FluxCreate.java:439)
> at reactor.core.publisher.FluxCreate$LatestAsyncSink.drain(FluxCreate.java:945)
> at reactor.core.publisher.FluxCreate$LatestAsyncSink.complete(FluxCreate.java:892)
> at reactor.core.publisher.FluxCreate$SerializedFluxSink.drainLoop(FluxCreate.java:240)
> at reactor.core.publisher.FluxCreate$SerializedFluxSink.drain(FluxCreate.java:206)
> at reactor.core.publisher.FluxCreate$SerializedFluxSink.complete(FluxCreate.java:197)
> at com.jakewharton.retrofit2.adapter.reactor.EnqueueSinkConsumer$DisposableCallback.onResponse(EnqueueSinkConsumer.java:52)
> at retrofit2.OkHttpCall$1.onResponse(OkHttpCall.java:161)
> at brave.okhttp3.TraceContextCall$TraceContextCallback.onResponse(TraceContextCall.java:95)
> at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.base/java.lang.Thread.run(Unknown Source){code}
> The credentials are well set but under the hood Hadoop calls MinIO to check whether the object is a directory (which we don't want), and this results in a failure.
> We can well retrieve the object by changing MinIO's policy - but this isn't an option to us - to something like:
> {code:java}
> {
> "statement": [
> {
> "effect": "Allow",
> "action": [ "s3:GetObject" ],
> "resource": [ "arn:aws:s3:::minio-bucket/object" ]
> },
> {
> "effect": "Allow",
> "action": [ "s3:ListBucket" ],
> "resource": [ "arn:aws:s3:::minio-bucket/" ],
> "condition": {
> "StringLike": {
> "s3:prefix": [ "object", "object/" ]
> }
> }
> }
> ]
> }{code}
> We couldn't find any way to configure Hadoop so that it just attempts to retrieve the object. Reading HADOOP-17454, it feels like it could be possible to provide options to fine-tune Hadoop's behaviour.
> Are there such options? If not, is it a reasonable behaviour to put in place?
> Regards,
> Sébastien
> Please note this is my first time here: I hope I picked the right project, issue type and priority (I tried my best looking around). If not, I'm very sorry about that.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org