You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2021/10/29 12:15:00 UTC
[jira] [Resolved] (HADOOP-17984) Hadoop-aws jar is unable to read
file from S3 if used with third party like MINIO
[ https://issues.apache.org/jira/browse/HADOOP-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran resolved HADOOP-17984.
-------------------------------------
Resolution: Invalid
> Hadoop-aws jar is unable to read file from S3 if used with third party like MINIO
> ---------------------------------------------------------------------------------
>
> Key: HADOOP-17984
> URL: https://issues.apache.org/jira/browse/HADOOP-17984
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 3.2.0
> Reporter: Naresh
> Priority: Minor
>
> Unable to read a file from S3 from spark if end point url is pointing to MINIO within EKS kubernetes cluster. We are able to do read/write from other clients and minio console. But when we read using spark I see empty data frame coming. If I use dataframe.show() it displays like below.
>
> ++
>
> ++
> ++
>
> *Spark Config:*
> .config("spark.hadoop.fs.s3a.endpoint", "http://127.0.0.1:9000") // minio url or port-forward to local
> .config("spark.hadoop.fs.s3a.access.key",<myaccesskey>)
> .config("spark.hadoop.fs.s3a.secret.key",<mysecretkey>)
>
> "spark.hadoop.fs.s3a.secret.key"
> "spark.hadoop.fs.s3a.secret.key"
> .config("spark.hadoop.fs.s3a.path.style.access", *true*)
> .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
> .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
> .config("fs.s3a.committer.staging.conflict-mode", "replace")
> .config("fs.s3a.committer.name", "file")
> .config("fs.s3a.committer.threads", "20")
> .config("fs.s3a.threads.max", "20")
> .config("fs.s3a.fast.upload.buffer", "bytebuffer")
> .config("fs.s3a.fast.upload.active.blocks", "8")
> .config("fs.s3a.block.size", "128M")
> .config("mapred.input.dir.recursive","true")
> .config("spark.sql.parquet.binaryAsString", "true")
>
>
> *JAR files:*
> hadoop-aws:3.2.0
> aws-java-sdk:1.12.30
> spark-core_2.12:3.1.2
> spark-sql_2.12:3.1.2
>
> *Logs:*
> DEBUG S3AFileSystem:2121: Getting path status for s3a://<mybucket>/<myfolder>/2021/test1_2021-03-23_15_21_31.592.csv (2021/test1_2021-03-23_15_21_31.592.csv)
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 1 -> 1
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_exists += 1 -> 1
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1 -> 2
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for s3a://mybbucket/myfolder/test1_2021-03-23_15_21_31.592.csv (2021/test1_2021-03-23_15_21_31.592.csv)
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 1 -> 2
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file
> 21/10/28 16:52:34 DEBUG S3AFileSystem:1899: List status for path: s3a://mybbucket/myfolder/test1_2021-03-23_15_21_31.592.csv
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_list_status += 1 -> 1
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1 -> 3
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv (2021/test1_2021-03-23_15_21_31.592.csv)
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 1 -> 3
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file
> 21/10/28 16:52:34 DEBUG S3AFileSystem:1930: Adding: rd (not a dir): s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_is_directory += 1 -> 2
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1 -> 4
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv (2021/test1_2021-03-23_15_21_31.592.csv)
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 1 -> 4
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file
> 21/10/28 16:52:34 DEBUG S3AFileSystem:1899: List status for path: s3a://mybbucket/myfolder//test1_2021-03-23_15_21_31.592.csv
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_list_status += 1 -> 2
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: op_get_file_status += 1 -> 5
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2121: Getting path status for s3a://mybbucket/myfolder/test1_2021-03-23_15_21_31.592.csv (2021/test1_2021-03-23_15_21_31.592.csv)
> 21/10/28 16:52:34 DEBUG S3AStorageStatistics:63: object_metadata_requests += 1 -> 5
> 21/10/28 16:52:34 DEBUG S3AFileSystem:2189: Found exact file: normal file
>
> ++
> ||
> ++
> ++
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org