You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2018/08/17 12:45:07 UTC

[GitHub] spark pull request #22134: [SPARK-25143][SQL] Support data source name mappi...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/22134

    [SPARK-25143][SQL] Support data source name mapping configuration

    ## What changes were proposed in this pull request?
    
    Currently, for better UX, Apache Spark provides data source backward compatibility by using the predefined built-in `backwardCompatibilityMap`.
    
    - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L564
    
    Although this mapping table is maintained carefully, but it's not flexible when we have multiple implementations like Avro/CSV/ORC. This introduced many additional options to decide which implementation will be use for the registered `shortName` (like `csv`, `orc`, `avro`).
    
    This PR aims to extend Spark backward-compatibility-mapping capability by allowing users provide a custom mapping in a general manner as configuration. If a user faces some issue with the provided mapping, he/she can override it by configuration. Also, this will reduce the complexity in the code, too. 
    
    ## How was this patch tested?
    
    Pass the Jenkins with a newly added test case.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-25143

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22134.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22134
    
----
commit e477a8ed6d2ea331be357a6fbbb3d55c504971b1
Author: Dongjoon Hyun <do...@...>
Date:   2018-08-17T12:27:32Z

    [SPARK-25143][SQL] Support data source name mapping configuration

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22134: [SPARK-25143][SQL] Support data source name mappi...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22134#discussion_r210968896
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ---
    @@ -609,7 +609,13 @@ object DataSource extends Logging {
     
       /** Given a provider name, look up the data source class definition. */
       def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
    -    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match {
    +    val customBackwardCompatibilityMap =
    +      conf.getAllConfs
    +        .filter(_._1.startsWith("spark.sql.datasource.map"))
    +        .map{ case (k, v) => (k.replaceFirst("^spark.sql.datasource.map.", ""), v) }
    +    val compatibilityMap = backwardCompatibilityMap ++ customBackwardCompatibilityMap
    --- End diff --
    
    I have the same concern as @tgravescs . It seems tricky to unset the default mapping.
    
    For example, if by default we map `com.databricks.spark.avro` to  internal avro, then to unset it we have to set
    `spark.sql.datasource.map.com.databricks.spark.avro -> com.databricks.spark.avro` .
    
    Currently we only have to deal with Avro and CSV, so I think it is ok to have one single straightforward configuration like https://github.com/apache/spark/pull/22133 proposed.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    **[Test build #94902 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94902/testReport)** for PR 22134 at commit [`e477a8e`](https://github.com/apache/spark/commit/e477a8ed6d2ea331be357a6fbbb3d55c504971b1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22134: [SPARK-25143][SQL] Support data source name mappi...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun closed the pull request at:

    https://github.com/apache/spark/pull/22134


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Ping, @gengliangwang . I made a different approach from #22133 . This will be more general.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    I got it. I'll close this approach.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2273/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94902/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    **[Test build #94889 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94889/testReport)** for PR 22134 at commit [`e477a8e`](https://github.com/apache/spark/commit/e477a8ed6d2ea331be357a6fbbb3d55c504971b1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94889/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    **[Test build #94902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94902/testReport)** for PR 22134 at commit [`e477a8e`](https://github.com/apache/spark/commit/e477a8ed6d2ea331be357a6fbbb3d55c504971b1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    @gengliangwang and @gatorsmile .
    
    First of all, the internal 3rd party mapping should be removed in Apache Spark 3.0. Please consider that. Also, with this PR, we can remove `com.databricks.spark.avro` in Spark 2.4, too. If you want, I can remove that in this PR, but I just want to keep this PR simplest.
    
    And, we don't need to `unset` here. The purpose of this PR is *RESET* to override the built-in `backwardCompatibilityMap`. Before this PR, we don't have any control over the built-in mapping. We need to allow the users can take a full controlability.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    **[Test build #94889 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94889/testReport)** for PR 22134 at commit [`e477a8e`](https://github.com/apache/spark/commit/e477a8ed6d2ea331be357a6fbbb3d55c504971b1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    I think it's premature to introduce this. The extra layer of abstraction actually makes it more difficult to reason about what's going on. We don't have that many data sources that require flexibility here, and we can always add the flexibility if needed later.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22134: [SPARK-25143][SQL] Support data source name mappi...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22134#discussion_r210910385
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ---
    @@ -609,7 +609,13 @@ object DataSource extends Logging {
     
       /** Given a provider name, look up the data source class definition. */
       def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
    -    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match {
    +    val customBackwardCompatibilityMap =
    +      conf.getAllConfs
    +        .filter(_._1.startsWith("spark.sql.datasource.map"))
    +        .map{ case (k, v) => (k.replaceFirst("^spark.sql.datasource.map.", ""), v) }
    +    val compatibilityMap = backwardCompatibilityMap ++ customBackwardCompatibilityMap
    --- End diff --
    
    so at this point we are leaving com.databricks.spark.avro -> internal avro as the default and users have to set it back to com.databricks.spark.avro or do they set it empty?   Although if set to empty I think it will return empty below which will cause an issue.
    
    We should have a test case for empty and perhaps have a check for it below in that case.
    
    What about documentation? Is there a jira for documenting all avro stuff?  If we do leave it as default we definitely want a release note with change in behavior.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    > The purpose of this PR is RESET to override the built-in backwardCompatibilityMap. Before this PR, we don't have any control over the built-in mapping. We need to allow the users can take a full controlability.
    
    I don't think it is user friendly for such RESET. 
    Also, besides the Databricks avro/csv repo, users can just provide their full class name in specifying the file format, or short name if it doesn't not conflict with built-in ones. I don't think they need such control.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    @gengliangwang . As of now, it's not possible to use that full class name for the Hive table saved as `com.databricks.spark.avro`. So, we are trying to find this way (this PR) or that way (maybe your PR).
    
    > Also, besides the Databricks avro/csv repo, users can just provide their full class name in specifying the file format, or short name if it doesn't not conflict with built-in ones. I don't think they need such control.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Do we need it in the current stage?  Regarding UX, it looks complex to end users. I am unable to remember the names. It is very easy to provide a wrong class name. 
    ```
    "spark.sql.datasource.map.org.apache.spark.sql.avro" -> testDataSource.getCanonicalName,
    "spark.sql.datasource.map.com.databricks.spark.csv" -> testDataSource.getCanonicalName,
    "spark.sql.datasource.map.com.databricks.spark.avro" -> testDataSource.getCanonicalName) 
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22134: [SPARK-25143][SQL] Support data source name mappi...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22134#discussion_r210916941
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ---
    @@ -609,7 +609,13 @@ object DataSource extends Logging {
     
       /** Given a provider name, look up the data source class definition. */
       def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
    -    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match {
    +    val customBackwardCompatibilityMap =
    +      conf.getAllConfs
    +        .filter(_._1.startsWith("spark.sql.datasource.map"))
    +        .map{ case (k, v) => (k.replaceFirst("^spark.sql.datasource.map.", ""), v) }
    +    val compatibilityMap = backwardCompatibilityMap ++ customBackwardCompatibilityMap
    --- End diff --
    
    answering one of my own questions found avro docs here: https://github.com/apache/spark/pull/22121


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22134: [SPARK-25143][SQL] Support data source name mappi...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22134#discussion_r210967474
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ---
    @@ -609,7 +609,13 @@ object DataSource extends Logging {
     
       /** Given a provider name, look up the data source class definition. */
       def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
    -    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match {
    +    val customBackwardCompatibilityMap =
    +      conf.getAllConfs
    +        .filter(_._1.startsWith("spark.sql.datasource.map"))
    +        .map{ case (k, v) => (k.replaceFirst("^spark.sql.datasource.map.", ""), v) }
    +    val compatibilityMap = backwardCompatibilityMap ++ customBackwardCompatibilityMap
    --- End diff --
    
    If this is merged, we can remove the internal Avro mapping code and put that into documents before branch cut.
    
    > so at this point we are leaving com.databricks.spark.avro -> internal avro as the default and users have to set it back to com.databricks.spark.avro or do they set it empty?
    
    This will be better than another option like the following `avro` specific option.
    ```scala
    val ENABLE_AVRO_BACKWARD_COMPATIBILITY = 
      buildConf("spark.sql.avro.backwardCompatibility")
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Could you review this, @tgravescs , @cloud-fan , @gatorsmile , @HyukjinKwon ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22134
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2285/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org