You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 03:59:23 UTC
[jira] [Updated] (SPARK-22691) Custom HttpFileSystem, issue with question-marks in path

     [ https://issues.apache.org/jira/browse/SPARK-22691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-22691:
---------------------------------
    Labels: bulk-closed  (was: )

> Custom HttpFileSystem, issue with question-marks in path
> --------------------------------------------------------
>
>                 Key: SPARK-22691
>                 URL: https://issues.apache.org/jira/browse/SPARK-22691
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.2.0
>            Reporter: Jussi-Pekka Partanen
>            Priority: Minor
>              Labels: bulk-closed
>
> I'm working with a use case, which requires several files to be loaded from HTTP locations using different file formats (CSV, JSON etc.) using different compression methods. I'm using a custom HTTP FileSystem implementation. I'm running into an issue, where a question mark character (?) in the HTTP URL causes spark to fail with following error. 
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: http://someserverhere.com/getresults?results=300&format=CSV;
> 	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:355)
> 	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:348)
> 	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> 	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> 	at scala.collection.immutable.List.foreach(List.scala:381)
> 	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
> 	at scala.collection.immutable.List.flatMap(List.scala:344)
> 	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:348)
> 	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> 	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
> 	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
> 	at com.whereos.engine.Sessions.main(Sessions.java:320)
> If the HTTP URL doesn't contain ?-character, i.e. it's a HTTP URL for example in format http://someserverhere.com/getresults/results=3500/format=CSV it works without any problems. 
> The data is read with a very simple statement like this one below:
> 		Dataset<Row> df = context.read()
> 				.option("inferSchema", "true")
> 				.option("header", "true")
> 				.option("quote", "\"")
> 				.csv(url);
> The custom file system is registered by setting "fs.http.impl" to com.test.MyHttpFileSystem.class.getName()
> On the MyHttpFileSystem the calls to fs.exist() and fs.getFileStatus() seem to be different between the two different cases above (working and failing). The working one only checks first if URL/_spark_metadata exists (obviously not), and then properly makes a call to exists('http://someserverhere.com/getresults/results=3500/format=CSV') and fs.getFileStatus('http://someserverhere.com/getresults/results=3500/format=CSV') with full URL. 
> The failing case first checks for _spark_metadata as well, but the following call to exists() and fs.getFileStatus() doesn't anymore include the full path, but where the URL path element with '?'-characted is omitted, i.e. the system makes a call to fs.exists('http://someserverhere.com/'), instead of fs.exists('http://someserverhere.com/getresults?results=300&format=CSV').



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org