You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sameeragarwal <gi...@git.apache.org> on 2017/05/03 22:20:05 UTC

[GitHub] spark pull request #17847: [SPARK-20590] Map default input data source forma...

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/17847

    [SPARK-20590] Map default input data source formats to inlined classes

    ## What changes were proposed in this pull request?
    
    One of the common usability problems around reading data in spark (particularly CSV) is that there can often be a conflict between different readers in the classpath.
    
    As an example, if someone launches a 2.x spark shell with the spark-csv package in the classpath, Spark currently fails in an extremely unfriendly way (see https://github.com/databricks/spark-csv/issues/367):
    
    ```scala
    ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
    scala> val df = spark.read.csv("/foo/bar.csv")
    java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name.
      at scala.sys.package$.error(package.scala:27)
      at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
      at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
      at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
      at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
      at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
      at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
      ... 48 elided
    ```
    
    This patch proposes a simple way of fixing this error by always mapping default input data source formats to inlined classes (that exist in Spark):
    
    ```scala
    ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
    scala> val df = spark.read.csv("/foo/bar.csv")
    df: org.apache.spark.sql.DataFrame = [_c0: string]
    ```
    
    ## How was this patch tested?
    
    Existing Tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark csv-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17847.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17847
    
----
commit 1af4675707a6fe1a1acaff0a30e8ef6c2ed5ff46
Author: Sameer Agarwal <sa...@cs.berkeley.edu>
Date:   2017-05-03T20:57:16Z

    map short names to correct class

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17847: [SPARK-20590] Map default input data source forma...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17847#discussion_r114683317
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ---
    @@ -483,35 +483,42 @@ case class DataSource(
     
     object DataSource {
     
    +  private val jdbc = classOf[JdbcRelationProvider].getCanonicalName
    +  private val json = classOf[JsonFileFormat].getCanonicalName
    +  private val parquet = classOf[ParquetFileFormat].getCanonicalName
    +  private val csv = classOf[CSVFileFormat].getCanonicalName
    +  private val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
    +  private val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat"
    +
       /** A map to maintain backward compatibility in case we move data sources around. */
    -  private val backwardCompatibilityMap: Map[String, String] = {
    -    val jdbc = classOf[JdbcRelationProvider].getCanonicalName
    -    val json = classOf[JsonFileFormat].getCanonicalName
    -    val parquet = classOf[ParquetFileFormat].getCanonicalName
    -    val csv = classOf[CSVFileFormat].getCanonicalName
    -    val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
    -    val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat"
    -
    -    Map(
    -      "org.apache.spark.sql.jdbc" -> jdbc,
    -      "org.apache.spark.sql.jdbc.DefaultSource" -> jdbc,
    -      "org.apache.spark.sql.execution.datasources.jdbc.DefaultSource" -> jdbc,
    -      "org.apache.spark.sql.execution.datasources.jdbc" -> jdbc,
    -      "org.apache.spark.sql.json" -> json,
    -      "org.apache.spark.sql.json.DefaultSource" -> json,
    -      "org.apache.spark.sql.execution.datasources.json" -> json,
    -      "org.apache.spark.sql.execution.datasources.json.DefaultSource" -> json,
    -      "org.apache.spark.sql.parquet" -> parquet,
    -      "org.apache.spark.sql.parquet.DefaultSource" -> parquet,
    -      "org.apache.spark.sql.execution.datasources.parquet" -> parquet,
    -      "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet,
    -      "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
    -      "org.apache.spark.sql.hive.orc" -> orc,
    -      "org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,
    -      "org.apache.spark.ml.source.libsvm" -> libsvm,
    -      "com.databricks.spark.csv" -> csv
    -    )
    -  }
    +  private val backwardCompatibilityMap: Map[String, String] = Map(
    +    "org.apache.spark.sql.jdbc" -> jdbc,
    +    "org.apache.spark.sql.jdbc.DefaultSource" -> jdbc,
    +    "org.apache.spark.sql.execution.datasources.jdbc.DefaultSource" -> jdbc,
    +    "org.apache.spark.sql.execution.datasources.jdbc" -> jdbc,
    +    "org.apache.spark.sql.json" -> json,
    +    "org.apache.spark.sql.json.DefaultSource" -> json,
    +    "org.apache.spark.sql.execution.datasources.json" -> json,
    +    "org.apache.spark.sql.execution.datasources.json.DefaultSource" -> json,
    +    "org.apache.spark.sql.parquet" -> parquet,
    +    "org.apache.spark.sql.parquet.DefaultSource" -> parquet,
    +    "org.apache.spark.sql.execution.datasources.parquet" -> parquet,
    +    "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet,
    +    "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
    +    "org.apache.spark.sql.hive.orc" -> orc,
    +    "org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,
    +    "org.apache.spark.ml.source.libsvm" -> libsvm,
    +    "com.databricks.spark.csv" -> csv
    +  )
    +
    +  private val builtinShortNamesMap: Map[String, String] = Map(
    --- End diff --
    
    Probably, it is nicer if we explain why this one is needed with a small comment about why the shorten names of internal datasources should be mapped to fully qualified names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17847: [SPARK-20590] Map default input data source formats to i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17847
  
    **[Test build #76427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76427/testReport)** for PR 17847 at commit [`1af4675`](https://github.com/apache/spark/commit/1af4675707a6fe1a1acaff0a30e8ef6c2ed5ff46).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17847: [SPARK-20590] Map default input data source formats to i...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17847
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76427/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17847: [SPARK-20590] Map default input data source formats to i...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/17847
  
    I'm closing this in favor of https://github.com/apache/spark/pull/17916


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17847: [SPARK-20590] Map default input data source formats to i...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17847
  
    @sameeragarwal BTW, should we add `hive`, `kafka`, `text` and `console` too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17847: [SPARK-20590] Map default input data source formats to i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17847
  
    **[Test build #76427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76427/testReport)** for PR 17847 at commit [`1af4675`](https://github.com/apache/spark/commit/1af4675707a6fe1a1acaff0a30e8ef6c2ed5ff46).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17847: [SPARK-20590] Map default input data source forma...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal closed the pull request at:

    https://github.com/apache/spark/pull/17847


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17847: [SPARK-20590] Map default input data source forma...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17847#discussion_r114739023
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ---
    @@ -523,7 +530,8 @@ object DataSource {
     
       /** Given a provider name, look up the data source class definition. */
       def lookupDataSource(provider: String): Class[_] = {
    -    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
    +    val provider1 = builtinShortNamesMap.getOrElse(provider,
    +      backwardCompatibilityMap.getOrElse(provider, provider))
    --- End diff --
    
    And I guess these should be case-insensitive for shorten names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17847: [SPARK-20590] Map default input data source formats to i...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/17847
  
    cc @cloud-fan @HyukjinKwon thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17847: [SPARK-20590] Map default input data source forma...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17847#discussion_r114683203
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ---
    @@ -523,7 +530,8 @@ object DataSource {
     
       /** Given a provider name, look up the data source class definition. */
       def lookupDataSource(provider: String): Class[_] = {
    -    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
    +    val provider1 = builtinShortNamesMap.getOrElse(provider,
    +      backwardCompatibilityMap.getOrElse(provider, provider))
    --- End diff --
    
    Should we maybe combine both `builtinShortNamesMap` and `backwardCompatibilityMap` and use a single `getOrElse`? It seems probably confusing to read a bit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17847: [SPARK-20590] Map default input data source formats to i...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17847
  
    Ah, so does this always let Spark's intetnal datasources have a higher precedence instead of failing fast?  I support the idea but we might need to print a warning if multiple sources are detected by the same identifier (I don't think it is recommanded...). Let me check if there is something missing at my best today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17847: [SPARK-20590] Map default input data source formats to i...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17847
  
    @sameeragarwal BTW, should we add `hive`, `kafka`, `socket`, `text` and `console` too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17847: [SPARK-20590] Map default input data source forma...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17847#discussion_r114765388
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ---
    @@ -523,7 +530,8 @@ object DataSource {
     
       /** Given a provider name, look up the data source class definition. */
       def lookupDataSource(provider: String): Class[_] = {
    -    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
    +    val provider1 = builtinShortNamesMap.getOrElse(provider,
    +      backwardCompatibilityMap.getOrElse(provider, provider))
    --- End diff --
    
    yea, short names should be case insensitive


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17847: [SPARK-20590] Map default input data source formats to i...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17847
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org