You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/10/26 05:09:00 UTC

[jira] [Commented] (SPARK-37111) RDD file loading APIs throw URISyntaxException when there is a colon in the file path

    [ https://issues.apache.org/jira/browse/SPARK-37111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434112#comment-17434112 ] 

Hyukjin Kwon commented on SPARK-37111:
--------------------------------------

This is from Hadoop's limitation I believe.

> RDD file loading APIs throw URISyntaxException when there is a colon in the file path
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-37111
>                 URL: https://issues.apache.org/jira/browse/SPARK-37111
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: Brady Tello
>            Priority: Major
>
> When a colon is present in a path to a file, many of Spark's RDD file loading APIs (textFile, wholeTextFile, possible others), throw a URISyntaxException.  The following Scala code and stack trace example was generated on my laptop running Spark 3.2.0.   I've verified that this issue also affects Python, and SQL and I'm assuming it probably also affects Java.
> {code:java}
> scala> val df = sc.wholeTextFiles("/Users/brady.tello/test:me/test.txt").take(1) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: test:me at org.apache.hadoop.fs.Path.initialize(Path.java:259) at org.apache.hadoop.fs.Path.<init>(Path.java:217) at org.apache.hadoop.fs.Path.<init>(Path.java:125) at org.apache.hadoop.fs.Globber.doGlob(Globber.java:229) at org.apache.hadoop.fs.Globber.glob(Globber.java:149) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:303) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274) at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.take(RDD.scala:1422) ... 47 elided Caused by: java.net.URISyntaxException: Relative path in absolute URI: test:me at java.base/java.net.URI.checkPath(URI.java:1990) at java.base/java.net.URI.<init>(URI.java:780) at org.apache.hadoop.fs.Path.initialize(Path.java:256) ... 68 more
> {code}
> Why can't I just not use colons in my paths you ask?  I'm running Spark on top of an S3 environment in which users are only permitted to read and write data to their personal S3 workspace and the path to their personal workspace contains a colon.  Removing that colon would be a major architectural change to the entire authentication architecture for several apps outside of our Spark app and thus we don't really have the flexibility to remove it.  Without a fix to this bug, users simply cannot use the RDD APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org