You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2020/09/02 14:43:00 UTC

[jira] [Commented] (SPARK-32766) s3a: bucket names with dots cannot be used

    [ https://issues.apache.org/jira/browse/SPARK-32766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189279#comment-17189279 ] 

Steve Loughran commented on SPARK-32766:
----------------------------------------

not going to be fixed in the s3a code, even if there was an easy way. By the end of the month, it will be impossible to talk to any newly created S3 bucket containing a . in their name. Existing ones may work, but not in this case where the mixing of hostnames and numbers confuses the java URI parser

https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-plan-the-rest-of-the-story/


> s3a: bucket names with dots cannot be used
> ------------------------------------------
>
>                 Key: SPARK-32766
>                 URL: https://issues.apache.org/jira/browse/SPARK-32766
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 3.0.0
>            Reporter: Ondrej Kokes
>            Priority: Minor
>
> Running vanilla spark with
> {noformat}
> --packages=org.apache.hadoop:hadoop-aws:x.y.z{noformat}
> I cannot read from S3, if the bucket name contains a dot (a valid name).
> A minimal reproducible example looks like this
> {{from pyspark.sql import SparkSession}}
> {{import pyspark.sql.functions as f}}
> {{if __name__ == '__main__':}}
> {{  spark = (SparkSession}}
> {{    .builder}}
> {{    .appName('my_app')}}
> {{    .master("local[*]")}}
> {{    .getOrCreate()}}
> {{  )}}
> {{  spark.read.csv("s3a://test-bucket-name-v1.0/foo.csv")}}
> Or just launch a spark-shell with `--packages=(...)hadoop-aws(...)` and read that CSV. I created the same bucket without the period and it worked fine.
> *Now I'm not sure whether this is a thing of prepping the path names and passing them to the aws-sdk, or whether the fault is within the SDK itself. I am not Java savvy to investigate the issue further, but I tried to make the repro as short as possible.*
> ----
> I get different errors depending on which Hadoop distributions I use. If I use the default PySpark distribution (which includes Hadoop 2), I get the following (using hadoop-aws:2.7.4)
> {{scala> spark.read.csv("s3a://okokes-test-v2.5/foo.csv").show()}}
> {{java.lang.IllegalArgumentException: The bucketName parameter must be specified.}}
> {{ at com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)}}
> {{ at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)}}
> {{ at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)}}
> {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)}}
> {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)}}
> {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)}}
> {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)}}
> {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)}}
> {{ at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
> {{ at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
> {{ at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
> {{ at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
> {{ at scala.Option.getOrElse(Option.scala:189)}}
> {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}}
> {{ ... 47 elided}}
> When I downloaded 3.0.0 with Hadoop 3 and ran a spark-shell there, I got this error (with hadoop-aws:3.2.0):
> {{java.lang.NullPointerException: null uri host.}}
> {{ at java.base/java.util.Objects.requireNonNull(Objects.java:246)}}
> {{ at org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)}}
> {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)}}
> {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)}}
> {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)}}
> {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)}}
> {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)}}
> {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)}}
> {{ at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
> {{ at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
> {{ at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
> {{ at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
> {{ at scala.Option.getOrElse(Option.scala:189)}}
> {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}}
> {{ ... 47 elided}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org