You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michel Lemay (JIRA)" <ji...@apache.org> on 2017/03/23 11:55:42 UTC
[jira] [Comment Edited] (SPARK-20061) Reading a file with colon (:) from S3 fails with URISyntaxException

    [ https://issues.apache.org/jira/browse/SPARK-20061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15938152#comment-15938152 ] 

Michel Lemay edited comment on SPARK-20061 at 3/23/17 11:54 AM:
----------------------------------------------------------------

I don't know about hdfs api but we've never had any issues accessing our files with regular aws s3 cli and python scripts.

[S3 specifications|http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html] says colon falls into the "supported but may require special handling".

Spark supports it as well when not using wildcards in the path.  For instance, sc.textFile("s3://mybucket/path/subfolder1/").count will work.  So what is the difference?



was (Author: flamingmike):
I don't know about hdfs api but we've never had any issues accessing our files with regular aws s3 cli and python scripts.

[S3 specifications|http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html] says colon is falls into the "supported but may require special handling".

Spark supports it as well when not using wildcards in the path.  For instance, sc.textFile("s3://mybucket/path/subfolder1/").count will work.  So what is the difference?


> Reading a file with colon (:) from S3 fails with URISyntaxException
> -------------------------------------------------------------------
>
>                 Key: SPARK-20061
>                 URL: https://issues.apache.org/jira/browse/SPARK-20061
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>         Environment: EC2, AWS
>            Reporter: Michel Lemay
>
> When reading a bunch of files from s3 using wildcards, it fails with the following exception:
> {code}
> scala> val fn = "s3a://mybucket/path/*/"
> scala> val ds = spark.readStream.schema(schema).json(fn)
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
>   at org.apache.hadoop.fs.Path.initialize(Path.java:205)
>   at org.apache.hadoop.fs.Path.<init>(Path.java:171)
>   at org.apache.hadoop.fs.Path.<init>(Path.java:93)
>   at org.apache.hadoop.fs.Globber.glob(Globber.java:241)
>   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
>   at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:237)
>   at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:243)
>   at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:131)
>   at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:127)
>   at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:127)
>   at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$tempFileIndex$1(DataSource.scala:124)
>   at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:138)
>   at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:229)
>   at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
>   at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
>   at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
>   at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:133)
>   at org.apache.spark.sql.streaming.DataStreamReader.json(DataStreamReader.scala:181)
>   ... 50 elided
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.<init>(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:202)
>   ... 73 more
> {code}
> The file in question sits at the root of s3a://mybucket/path/
> {code}
> aws s3 ls s3://mybucket/path/
>                            PRE subfolder1/
>                            PRE subfolder2/
> ...
> 2017-01-06 20:33:46       1383 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
> ...
> {code}
> Removing the wildcard from path make it work but it obviously does misses all files in subdirectories.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org