You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/08/25 15:28:46 UTC

[jira] [Commented] (SPARK-10185) Spark SQL does not handle comma separates paths on Hadoop FileSystem

    [ https://issues.apache.org/jira/browse/SPARK-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711256#comment-14711256 ] 

Apache Spark commented on SPARK-10185:
--------------------------------------

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/8416

> Spark SQL does not handle comma separates paths on Hadoop FileSystem
> --------------------------------------------------------------------
>
>                 Key: SPARK-10185
>                 URL: https://issues.apache.org/jira/browse/SPARK-10185
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1
>            Reporter: koert kuipers
>
> Spark SQL uses a Map[String, String] for data source settings. As a consequence the only way to pass in multiple paths (something that hadoop file input format supports) is to do pass in a comma separated list. For example:
> sqlContext.format("json").load("dir1,dir22")
> or
> sqlContext.format("json").option("path", "dir1,dir2").load
> However in this case ResolvedDataSource does not handle the comma delimited paths correctly for a HadoopFsRelationProvider. It treats the multiple comma delimited paths as single path.
> For example if i pass in for path "dir1,dir2" it will make dir1 qualified but ignore dir2 (presumably because it simply treats it as part of dir1). If globs are involved then it simply always returns an empty array of paths (because the glob with comma in it doesn’t match anything).
> I think its important to handle commas to pass in multiple paths, since the framework does not provide an alternative. In some cases like parquet the code simply bypasses ResolvedDataSource to support multiple paths but to me this is a workaround that should be discouraged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org