You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2016/07/06 17:20:51 UTC

[GitHub] spark pull request #14071: [SPARK-16397][SQL][WIP] make CatalogTable more ge...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/14071

    [SPARK-16397][SQL][WIP] make CatalogTable more general and less hive specific

    ## What changes were proposed in this pull request?
    
    `CatalogTable` is an internal interface to represent table metadata in Spark SQL, and should be able to express both hive serde table and data source table. Due to some hive hacks, data source table only use table properties to store its information, and the `CatalogStorageFormat` is very hive specific as it's only used by hive serde table currently.
    
    However this is not a good design, the `CatalogTable` is not general enough to express data source table if we remove hive hacks in the future.
    
    This PR is trying to make `CatalogStorageFormat` general enough to handle all types of tables in Spark SQL.
    
    ## How was this patch tested?
    
    existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark ds-table

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14071.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14071
    
----
commit 4ef7dd097f43947f7822ccb66ad0af1c5513ed43
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-07-06T17:08:43Z

    make CatalogTable more general and less hive specific

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r70004411
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
    @@ -939,42 +940,33 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder {
         // to include the partition columns here explicitly
         val schema = cols ++ partitionCols
     
    -    // Storage format
    -    val defaultStorage: CatalogStorageFormat = {
    -      val defaultStorageType = conf.getConfString("hive.default.fileformat", "textfile")
    -      val defaultHiveSerde = HiveSerDe.sourceToSerDe(defaultStorageType, conf)
    -      CatalogStorageFormat(
    -        locationUri = None,
    -        inputFormat = defaultHiveSerde.flatMap(_.inputFormat)
    -          .orElse(Some("org.apache.hadoop.mapred.TextInputFormat")),
    -        outputFormat = defaultHiveSerde.flatMap(_.outputFormat)
    -          .orElse(Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")),
    -        // Note: Keep this unspecified because we use the presence of the serde to decide
    -        // whether to convert a table created by CTAS to a datasource table.
    -        serde = None,
    -        compressed = false,
    -        serdeProperties = Map())
    -    }
         validateRowFormatFileFormat(ctx.rowFormat, ctx.createFileFormat, ctx)
    -    val fileStorage = Option(ctx.createFileFormat).map(visitCreateFileFormat)
    -      .getOrElse(CatalogStorageFormat.empty)
    -    val rowStorage = Option(ctx.rowFormat).map(visitRowFormat)
    -      .getOrElse(CatalogStorageFormat.empty)
    -    val location = Option(ctx.locationSpec).map(visitLocationSpec)
    +    var storage = CatalogStorageFormat(
    +      locationUri = Option(ctx.locationSpec).map(visitLocationSpec),
    +      provider = Some("hive"),
    +      properties = Map.empty)
    +    Option(ctx.createFileFormat).foreach(ctx => storage = getFileFormat(ctx, storage))
    +    Option(ctx.rowFormat).foreach(ctx => storage = getRowFormat(ctx, storage))
    --- End diff --
    
    uh,,, `defaultStorage.serde` is still `None`. Then, I think it should be fine. Maybe we just need to add a comment to explain how we determine the value of `serde`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r70173359
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala ---
    @@ -45,35 +45,28 @@ case class CatalogFunction(
      */
     case class CatalogStorageFormat(
         locationUri: Option[String],
    -    inputFormat: Option[String],
    -    outputFormat: Option[String],
    -    serde: Option[String],
    -    compressed: Boolean,
    -    serdeProperties: Map[String, String]) {
    +    provider: Option[String],
    --- End diff --
    
    To remove `CreateHiveTableAsSelectLogicalPlan`, we can add a `CatalogTable` member into the signature of `CreateTableUsingAsSelect`. After code changes, the signature of `CreateTableUsingAsSelect` becomes
    ```
        tableDesc: CatalogTable,
        provider: String,
        child: LogicalPlan,
        mode: SaveMode
    ```
    
    After this PR is merged, we also can remove `provider` from the signature!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #62247 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62247/consoleFull)** for PR 14071 at commit [`632d9a7`](https://github.com/apache/spark/commit/632d9a716fe75e8e85206ffcf9893b6ab477d048).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL][WIP] make CatalogTable more general a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #61858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61858/consoleFull)** for PR 14071 at commit [`4ef7dd0`](https://github.com/apache/spark/commit/4ef7dd097f43947f7822ccb66ad0af1c5513ed43).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62247/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    cc @yhuai @gatorsmile @liancheng @clockfly 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r70096412
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala ---
    @@ -45,35 +45,28 @@ case class CatalogFunction(
      */
     case class CatalogStorageFormat(
         locationUri: Option[String],
    -    inputFormat: Option[String],
    -    outputFormat: Option[String],
    -    serde: Option[String],
    -    compressed: Boolean,
    -    serdeProperties: Map[String, String]) {
    +    provider: Option[String],
    --- End diff --
    
    I see. Thank you! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #62247 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62247/consoleFull)** for PR 14071 at commit [`632d9a7`](https://github.com/apache/spark/commit/632d9a716fe75e8e85206ffcf9893b6ab477d048).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r70634823
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala ---
    @@ -42,38 +42,54 @@ case class CatalogFunction(
     
     /**
      * Storage format, used to describe how a partition or a table is stored.
    + *
    + * @param locationUri the storage location of the table/partition, can be None if it has no storage,
    + *                    e.g. view, or it's not file-based, e.g. JDBC data source table.
    + * @param properties used to store some extra information for the storage.
      */
     case class CatalogStorageFormat(
         locationUri: Option[String],
    -    inputFormat: Option[String],
    -    outputFormat: Option[String],
    -    serde: Option[String],
    -    compressed: Boolean,
    --- End diff --
    
    We never persist this flag into external catalog.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r70002456
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala ---
    @@ -45,35 +45,28 @@ case class CatalogFunction(
      */
     case class CatalogStorageFormat(
         locationUri: Option[String],
    -    inputFormat: Option[String],
    -    outputFormat: Option[String],
    -    serde: Option[String],
    -    compressed: Boolean,
    -    serdeProperties: Map[String, String]) {
    +    provider: Option[String],
    --- End diff --
    
    It sounds like that `provider` is not being used? Do we really need this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r70020600
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala ---
    @@ -45,35 +45,28 @@ case class CatalogFunction(
      */
     case class CatalogStorageFormat(
         locationUri: Option[String],
    -    inputFormat: Option[String],
    -    outputFormat: Option[String],
    -    serde: Option[String],
    -    compressed: Boolean,
    -    serdeProperties: Map[String, String]) {
    +    provider: Option[String],
    --- End diff --
    
    For now we only use it to assert a table is hive table in some places, but this will be very useful when we put data source table information into catalog table instead of table properties.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #61914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61914/consoleFull)** for PR 14071 at commit [`4d65609`](https://github.com/apache/spark/commit/4d65609ae71b2e30cea7b39e1b5a1a9ecfdd2de4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan closed the pull request at:

    https://github.com/apache/spark/pull/14071


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL][WIP] make CatalogTable more ge...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r69796816
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
    @@ -1050,19 +1032,38 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder {
       }
     
       /**
    -   * Create a [[CatalogStorageFormat]] for creating tables.
    +   * Parse the `createFileFormat` and put the storage information into the passed in map.
        *
    -   * Format: STORED AS ...
    +   * Example format:
    +   * {{{
    +   *   STORED AS INPUTFORMAT formatName1 OUTPUTFORMAT formatName2
    +   * }}}
    +   *
    +   * OR
    +   *
    +   * {{{
    +   *   STORED AS fileFormatName
    +   * }}}
        */
    -  override def visitCreateFileFormat(
    -      ctx: CreateFileFormatContext): CatalogStorageFormat = withOrigin(ctx) {
    +  private def getFileFormat(
    +      ctx: CreateFileFormatContext,
    +      storageProps: mutable.HashMap[String, String]): Unit = {
         (ctx.fileFormat, ctx.storageHandler) match {
           // Expected format: INPUTFORMAT input_format OUTPUTFORMAT output_format
           case (c: TableFileFormatContext, null) =>
    -        visitTableFileFormat(c)
    +        Option(string(c.inFmt)).foreach(inFmt => storageProps += "inputFormat" -> inFmt)
    --- End diff --
    
    `storageProps += "inputFormat" -> string(c.inFmt)`? `string(c.inFmt)` throws an NPE if it is null; so option is redundant...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r70012893
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -403,17 +400,18 @@ object CreateDataSourceTableUtils extends Logging {
           assert(partitionColumns.isEmpty)
           assert(relation.partitionSchema.isEmpty)
     
    +      var storage = CatalogStorageFormat(
    +        locationUri = None,
    --- End diff --
    
    oh it's a mistake...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62249/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #61914 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61914/consoleFull)** for PR 14071 at commit [`4d65609`](https://github.com/apache/spark/commit/4d65609ae71b2e30cea7b39e1b5a1a9ecfdd2de4).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL][WIP] make CatalogTable more general a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61858/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL][WIP] make CatalogTable more ge...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r69837833
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala ---
    @@ -45,35 +45,28 @@ case class CatalogFunction(
      */
     case class CatalogStorageFormat(
         locationUri: Option[String],
    -    inputFormat: Option[String],
    -    outputFormat: Option[String],
    -    serde: Option[String],
    -    compressed: Boolean,
    -    serdeProperties: Map[String, String]) {
    +    provider: Option[String],
    --- End diff --
    
    it'd be good to document in classdoc what these mean, and in what cases can it be None, and the behavior for hive.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #61959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61959/consoleFull)** for PR 14071 at commit [`b218bb7`](https://github.com/apache/spark/commit/b218bb72c8c81192e11db2baed9321578c61cb5e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #62249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62249/consoleFull)** for PR 14071 at commit [`d18ac59`](https://github.com/apache/spark/commit/d18ac598b27083962b4a783996a5bf39760b0369).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #61951 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61951/consoleFull)** for PR 14071 at commit [`9efe46e`](https://github.com/apache/spark/commit/9efe46e5156f95fb034a50066f14abbaf8e508ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #62249 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62249/consoleFull)** for PR 14071 at commit [`d18ac59`](https://github.com/apache/spark/commit/d18ac598b27083962b4a783996a5bf39760b0369).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r69997365
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -403,17 +400,18 @@ object CreateDataSourceTableUtils extends Logging {
           assert(partitionColumns.isEmpty)
           assert(relation.partitionSchema.isEmpty)
     
    +      var storage = CatalogStorageFormat(
    +        locationUri = None,
    --- End diff --
    
    Any reason why this `locationUri` is set to `None`? It sounds like the original value is `Some(relation.location.paths.map(_.toUri.toString).head`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    I'm closing it as it turns out that we still need these hive specific stuff in CatalogStorageFormat. Hiding them in properties seems don't have much benefit, what we need is just adding a 'provider' field in CatalogTable, which is done in #14155


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #61959 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61959/consoleFull)** for PR 14071 at commit [`b218bb7`](https://github.com/apache/spark/commit/b218bb72c8c81192e11db2baed9321578c61cb5e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #61951 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61951/consoleFull)** for PR 14071 at commit [`9efe46e`](https://github.com/apache/spark/commit/9efe46e5156f95fb034a50066f14abbaf8e508ab).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r70017144
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -162,25 +147,28 @@ private[hive] case class MetastoreRelation(
           tPartition.setTableName(tableName)
           tPartition.setValues(partitionKeys.map(a => p.spec(a.name)).asJava)
     
    -      val sd = new org.apache.hadoop.hive.metastore.api.StorageDescriptor()
    +      val sd = toHiveStorage(p.storage, catalogTable.schema.map(toHiveColumn))
           tPartition.setSd(sd)
    -      sd.setCols(catalogTable.schema.map(toHiveColumn).asJava)
    -      p.storage.locationUri.foreach(sd.setLocation)
    -      p.storage.inputFormat.foreach(sd.setInputFormat)
    -      p.storage.outputFormat.foreach(sd.setOutputFormat)
    +      new Partition(hiveQlTable, tPartition)
    +    }
    +  }
     
    -      val serdeInfo = new org.apache.hadoop.hive.metastore.api.SerDeInfo
    -      sd.setSerdeInfo(serdeInfo)
    -      // maps and lists should be set only after all elements are ready (see HIVE-7975)
    -      p.storage.serde.foreach(serdeInfo.setSerializationLib)
    +  private def toHiveStorage(storage: CatalogStorageFormat, schema: Seq[FieldSchema]) = {
    --- End diff --
    
    This function has a bug, right? `storage` is not being used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL][WIP] make CatalogTable more general a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61951/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL][WIP] make CatalogTable more general a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    **[Test build #61858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61858/consoleFull)** for PR 14071 at commit [`4ef7dd0`](https://github.com/apache/spark/commit/4ef7dd097f43947f7822ccb66ad0af1c5513ed43).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61959/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14071#discussion_r70003918
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
    @@ -939,42 +940,33 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder {
         // to include the partition columns here explicitly
         val schema = cols ++ partitionCols
     
    -    // Storage format
    -    val defaultStorage: CatalogStorageFormat = {
    -      val defaultStorageType = conf.getConfString("hive.default.fileformat", "textfile")
    -      val defaultHiveSerde = HiveSerDe.sourceToSerDe(defaultStorageType, conf)
    -      CatalogStorageFormat(
    -        locationUri = None,
    -        inputFormat = defaultHiveSerde.flatMap(_.inputFormat)
    -          .orElse(Some("org.apache.hadoop.mapred.TextInputFormat")),
    -        outputFormat = defaultHiveSerde.flatMap(_.outputFormat)
    -          .orElse(Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")),
    -        // Note: Keep this unspecified because we use the presence of the serde to decide
    -        // whether to convert a table created by CTAS to a datasource table.
    -        serde = None,
    -        compressed = false,
    -        serdeProperties = Map())
    -    }
         validateRowFormatFileFormat(ctx.rowFormat, ctx.createFileFormat, ctx)
    -    val fileStorage = Option(ctx.createFileFormat).map(visitCreateFileFormat)
    -      .getOrElse(CatalogStorageFormat.empty)
    -    val rowStorage = Option(ctx.rowFormat).map(visitRowFormat)
    -      .getOrElse(CatalogStorageFormat.empty)
    -    val location = Option(ctx.locationSpec).map(visitLocationSpec)
    +    var storage = CatalogStorageFormat(
    +      locationUri = Option(ctx.locationSpec).map(visitLocationSpec),
    +      provider = Some("hive"),
    +      properties = Map.empty)
    +    Option(ctx.createFileFormat).foreach(ctx => storage = getFileFormat(ctx, storage))
    +    Option(ctx.rowFormat).foreach(ctx => storage = getRowFormat(ctx, storage))
    --- End diff --
    
    Originally, the value of `serde` is set using the following logics:
    ```
    serde = rowStorage.serde.orElse(fileStorage.serde).orElse(defaultStorage.serde)
    ```
    Now, we are missing the last case, right?
    
    In addition, maybe we can write a description to document it. `getFileFormat` might first assign an initial value; then, `getRowFormat` could overwrite it. If both does not set it, we use the default value.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61914/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14071
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org