You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "koert kuipers (JIRA)" <ji...@apache.org> on 2015/08/24 19:48:46 UTC

[jira] [Created] (SPARK-10185) Spark SQL does not handle comma separates paths on Hadoop FileSystem

koert kuipers created SPARK-10185:
-------------------------------------

             Summary: Spark SQL does not handle comma separates paths on Hadoop FileSystem
                 Key: SPARK-10185
                 URL: https://issues.apache.org/jira/browse/SPARK-10185
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.1
            Reporter: koert kuipers


Spark SQL uses a Map[String, String] for data source settings. As a consequence the only way to pass in multiple paths (something that hadoop file input format supports) is to do pass in a comma separated list. For example:
sqlContext.format("json").load("dir1,dir22")
or
sqlContext.format("json").option("path", "dir1,dir2").load

However in this case ResolvedDataSource does not handle the comma delimited paths correctly for a HadoopFsRelationProvider. It treats the multiple comma delimited paths as single path.

For example if i pass in for path "dir1,dir2" it will make dir1 qualified but ignore dir2 (presumably because it simply treats it as part of dir1). If globs are involved then it simply always returns an empty array of paths (because the glob with comma in it doesn’t match anything).

I think its important to handle commas to pass in multiple paths, since the framework does not provide an alternative. In some cases like parquet the code simply bypasses ResolvedDataSource to support multiple paths but to me this is a workaround that should be discouraged.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org