You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by gatorsmile <gi...@git.apache.org> on 2016/07/14 20:11:42 UTC

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/14207

    [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas into External Catalog Tables when Creating Tables

    #### What changes were proposed in this pull request?
    
    Currently, in Spark SQL, the initial creation of schema can be classified into two groups. It is applicable to both Hive tables and Data Source tables:
    
    **Group A. Users specify the schema.** 
    
    _Case 1 CREATE TABLE AS SELECT_: the schema is determined by the result schema of the SELECT clause. For example,
    ```SQL
    CREATE TABLE tab STORED AS TEXTFILE
    AS SELECT * from input
    ```
    
    _Case 2 CREATE TABLE_: users explicitly specify the schema. For example,
    ```SQL
    CREATE TABLE jsonTable (_1 string, _2 string)
    USING org.apache.spark.sql.json
    ```
    
    **Group B. Spark SQL infer the schema at runtime.**
    
    _Case 3 CREATE TABLE_. Users do not specify the schema but the path to the file location. For example,
    ```SQL
    CREATE TABLE jsonTable 
    USING org.apache.spark.sql.json
    OPTIONS (path '${tempDir.getCanonicalPath}')
    ```
    
    Before this PR, Spark SQL does not store the inferred schema in the external catalog for the cases in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark.
    
    This PR is to store the inferred schema in the external catalog when creating the table. When users intend to refresh the schema, they issue `REFRESH TABLE`. Spark SQL will infer the schema again based on the previously specified table location and update/refresh the schema in the external catalog and metadata cache. 
    
    In this PR, we do not use the inferred schema to replace the user specified schema for avoiding external behavior changes . Based on the design, user-specified schemas (as described in Group A) can be changed by ALTER TABLE commands, although we do not support them now. 
    
    
    #### How was this patch tested?
    TODO: add more cases to cover the changes.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark userSpecifiedSchema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14207
    
----
commit 3c992a9eb39e3258776e52d0524b8bc46bc3ee08
Author: gatorsmile <ga...@gmail.com>
Date:   2016-07-14T07:08:43Z

    fix.

commit 5ed4e68283dd0ee0ad5deddc787eae8fe47f7574
Author: gatorsmile <ga...@gmail.com>
Date:   2016-07-14T07:10:37Z

    Merge remote-tracking branch 'upstream/master' into userSpecifiedSchema

commit 3be0dc0b7cfd942459c598c0d35f3d67a2c020ba
Author: gatorsmile <ga...@gmail.com>
Date:   2016-07-14T19:19:40Z

    fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62513/consoleFull)** for PR 14207 at commit [`55c2c5e`](https://github.com/apache/spark/commit/55c2c5e2623478a79971af3b0513727b03c1ee87).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62580 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62580/consoleFull)** for PR 14207 at commit [`727ecf8`](https://github.com/apache/spark/commit/727ecf87463d6fe02cd29e0bbf3f488c043b1962).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r73940895
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
    @@ -521,31 +521,29 @@ object DDLUtils {
         table.partitionColumns.nonEmpty || table.properties.contains(DATASOURCE_SCHEMA_NUMPARTCOLS)
       }
     
    -  // A persisted data source table may not store its schema in the catalog. In this case, its schema
    -  // will be inferred at runtime when the table is referenced.
    -  def getSchemaFromTableProperties(metadata: CatalogTable): Option[StructType] = {
    +  // A persisted data source table always store its schema in the catalog.
    +  def getSchemaFromTableProperties(metadata: CatalogTable): StructType = {
         require(isDatasourceTable(metadata))
    +    val msgSchemaCorrupted = "Could not read schema from the metastore because it is corrupted."
         val props = metadata.properties
    -    if (props.isDefinedAt(DATASOURCE_SCHEMA)) {
    +    props.get(DATASOURCE_SCHEMA).map { schema =>
           // Originally, we used spark.sql.sources.schema to store the schema of a data source table.
           // After SPARK-6024, we removed this flag.
           // Although we are not using spark.sql.sources.schema any more, we need to still support.
    -      props.get(DATASOURCE_SCHEMA).map(DataType.fromJson(_).asInstanceOf[StructType])
    -    } else {
    -      metadata.properties.get(DATASOURCE_SCHEMA_NUMPARTS).map { numParts =>
    +      DataType.fromJson(schema).asInstanceOf[StructType]
    +    } getOrElse {
    --- End diff --
    
    I am not sure if `getOrElse` makes the code easier to follow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71472473
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -95,17 +95,41 @@ case class CreateDataSourceTableCommand(
           }
     
         // Create the relation to validate the arguments before writing the metadata to the metastore.
    -    DataSource(
    -      sparkSession = sparkSession,
    -      userSpecifiedSchema = userSpecifiedSchema,
    -      className = provider,
    -      bucketSpec = None,
    -      options = optionsWithPath).resolveRelation(checkPathExist = false)
    +    val dataSource: BaseRelation =
    +      DataSource(
    +        sparkSession = sparkSession,
    +        userSpecifiedSchema = userSpecifiedSchema,
    +        className = provider,
    +        bucketSpec = None,
    +        options = optionsWithPath).resolveRelation(checkPathExist = false)
    +
    +    val partitionColumns =
    --- End diff --
    
    Sure, will do it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71453356
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -95,17 +95,38 @@ case class CreateDataSourceTableCommand(
           }
     
         // Create the relation to validate the arguments before writing the metadata to the metastore.
    -    DataSource(
    -      sparkSession = sparkSession,
    -      userSpecifiedSchema = userSpecifiedSchema,
    -      className = provider,
    -      bucketSpec = None,
    -      options = optionsWithPath).resolveRelation(checkPathExist = false)
    +    val dataSource: HadoopFsRelation =
    +      DataSource(
    +        sparkSession = sparkSession,
    +        userSpecifiedSchema = userSpecifiedSchema,
    +        className = provider,
    +        bucketSpec = None,
    +        options = optionsWithPath)
    +        .resolveRelation(checkPathExist = false).asInstanceOf[HadoopFsRelation]
    --- End diff --
    
    I think a safer way is to do a pattern match here, if it's `HadoopFsRelation`, get it's partition columns, else, no partition columns


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71452979
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -52,7 +52,7 @@ case class CreateDataSourceTableCommand(
         userSpecifiedSchema: Option[StructType],
         provider: String,
         options: Map[String, String],
    -    partitionColumns: Array[String],
    +    userSpecifiedPartitionColumns: Array[String],
    --- End diff --
    
    use `Option[Array[String]]`? Or we should make `userSpecifiedSchema` a `StructType`, and use `length == 0` to indicate if there is no user specified schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    @gatorsmile When the data/files are input by an external system, and Spark is just used to process them in batch. Does it mean that schema can be inconsistent? Or it should call refresh every time it is going to query the table?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    thanks, merging to master!
    
    cc @yhuai @liancheng we will address your comments in follow up PRs if you have any.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14207#discussion_r71363705

--- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
@@ -223,6 +223,9 @@ abstract class Catalog {
* If this table is cached as an InMemoryRelation, drop the original cached version and make the
* new version cached lazily.
*
+ * If the table's schema is inferred at runtime, infer the schema again and update the schema
--- End diff --

cc @rxin, I'm thinking of what's the main reason to allow inferring the table schema at run time. IIRC, it's mainly because we wanna save some typing when creating data source table by SQL string, which usually have very long schema, e.g. json files.

If this is true, then the table schema is not supposed to change. If users do wanna change it, I'd argue that it's a different table, users should drop this table and create a new one. Then we don't need to make `refresh table` support schema changing and thus don't need to store the `DATASOURCE_SCHEMA_ISINFERRED` flag.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62581 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62581/consoleFull)** for PR 14207 at commit [`1ee1743`](https://github.com/apache/spark/commit/1ee1743906b41ffcc182cb8c74b4134bce8a3006).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71456320
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,115 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  test("Create data source table with partitioning columns but no schema") {
    +    import testImplicits._
    +
    +    val tabName = "tab1"
    +    withTempPath { dir =>
    +      val pathToPartitionedTable = new File(dir, "partitioned")
    +      val pathToNonPartitionedTable = new File(dir, "nonPartitioned")
    +      val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +      df.write.format("parquet").save(pathToNonPartitionedTable.getCanonicalPath)
    +      df.write.format("parquet").partitionBy("num").save(pathToPartitionedTable.getCanonicalPath)
    +
    +      Seq(pathToPartitionedTable, pathToNonPartitionedTable).foreach { path =>
    +        withTable(tabName) {
    +          spark.sql(
    +            s"""
    +               |CREATE TABLE $tabName
    +               |USING parquet
    +               |OPTIONS (
    +               |  path '$path'
    +               |)
    +               |PARTITIONED BY (inexistentColumns)
    +             """.stripMargin)
    +          val catalog = spark.sessionState.catalog
    +          val tableMetadata = catalog.getTableMetadata(TableIdentifier(tabName))
    +
    +          val tableSchema = DDLUtils.getSchemaFromTableProperties(tableMetadata)
    +          assert(tableSchema.nonEmpty, "the schema of data source tables are always recorded")
    +          val partCols = DDLUtils.getPartitionColumnsFromTableProperties(tableMetadata)
    +
    +          if (tableMetadata.storage.serdeProperties.get("path") ==
    --- End diff --
    
    In this test case, it verifies two scenarios. One is the path to a partitioned table (i.e., `pathToPartitionedTable`); another is the path to a non-partitioned table (i.e., `pathToNonPartitionedTable`) . This condition is just to check which path is being used. If the path points to `pathToNonPartitionedTable`, it will return false.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r73941472
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -95,17 +95,39 @@ case class CreateDataSourceTableCommand(
           }
     
         // Create the relation to validate the arguments before writing the metadata to the metastore.
    -    DataSource(
    -      sparkSession = sparkSession,
    -      userSpecifiedSchema = userSpecifiedSchema,
    -      className = provider,
    -      bucketSpec = None,
    -      options = optionsWithPath).resolveRelation(checkPathExist = false)
    +    val dataSource: BaseRelation =
    +      DataSource(
    +        sparkSession = sparkSession,
    +        userSpecifiedSchema = userSpecifiedSchema,
    +        className = provider,
    +        bucketSpec = None,
    +        options = optionsWithPath).resolveRelation(checkPathExist = false)
    +
    +    val partitionColumns = if (userSpecifiedSchema.nonEmpty) {
    +      userSpecifiedPartitionColumns
    +    } else {
    +      val res = dataSource match {
    +        case r: HadoopFsRelation => r.partitionSchema.fieldNames
    +        case _ => Array.empty[String]
    +      }
    +      if (userSpecifiedPartitionColumns.length > 0) {
    +        // The table does not have a specified schema, which means that the schema will be inferred
    +        // when we load the table. So, we are not expecting partition columns and we will discover
    +        // partitions when we load the table. However, if there are specified partition columns,
    +        // we simply ignore them and provide a warning message.
    +        logWarning(
    +          s"Specified partition columns (${userSpecifiedPartitionColumns.mkString(",")}) will be " +
    +            s"ignored. The schema and partition columns of table $tableIdent are inferred. " +
    +            s"Schema: ${dataSource.schema.simpleString}; " +
    +            s"Partition columns: ${res.mkString("(", ", ", ")")}")
    +      }
    +      res
    +    }
     
         CreateDataSourceTableUtils.createDataSourceTable(
           sparkSession = sparkSession,
           tableIdent = tableIdent,
    -      userSpecifiedSchema = userSpecifiedSchema,
    +      schema = dataSource.schema,
    --- End diff --
    
    I think from the code, it is not very clear that `dataSource.schema` will be `userSpecifiedSchema`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    @gatorsmile
    
    Where is change for the following description?
    ```
    This PR is to store the inferred schema in the external catalog when creating the table. When users intend to refresh the schema after possible changes on external files (table location), they issue REFRESH TABLE. Spark SQL will infer the schema again based on the previously specified table location and update/refresh the schema in the external catalog and metadata cache.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71453539
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,115 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  test("Create data source table with partitioning columns but no schema") {
    +    import testImplicits._
    +
    +    val tabName = "tab1"
    +    withTempPath { dir =>
    +      val pathToPartitionedTable = new File(dir, "partitioned")
    +      val pathToNonPartitionedTable = new File(dir, "nonPartitioned")
    +      val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +      df.write.format("parquet").save(pathToNonPartitionedTable.getCanonicalPath)
    +      df.write.format("parquet").partitionBy("num").save(pathToPartitionedTable.getCanonicalPath)
    +
    +      Seq(pathToPartitionedTable, pathToNonPartitionedTable).foreach { path =>
    +        withTable(tabName) {
    +          spark.sql(
    +            s"""
    +               |CREATE TABLE $tabName
    +               |USING parquet
    +               |OPTIONS (
    +               |  path '$path'
    +               |)
    +               |PARTITIONED BY (inexistentColumns)
    --- End diff --
    
    this doesn't fail?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71474127
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -316,27 +340,25 @@ object CreateDataSourceTableUtils extends Logging {
         tableProperties.put(DATASOURCE_PROVIDER, provider)
     
         // Saves optional user specified schema.  Serialized JSON schema string may be too long to be
    --- End diff --
    
    I think this comment is not correct anymore?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r73969863
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
    @@ -521,31 +521,29 @@ object DDLUtils {
         table.partitionColumns.nonEmpty || table.properties.contains(DATASOURCE_SCHEMA_NUMPARTCOLS)
       }
     
    -  // A persisted data source table may not store its schema in the catalog. In this case, its schema
    -  // will be inferred at runtime when the table is referenced.
    -  def getSchemaFromTableProperties(metadata: CatalogTable): Option[StructType] = {
    +  // A persisted data source table always store its schema in the catalog.
    +  def getSchemaFromTableProperties(metadata: CatalogTable): StructType = {
         require(isDatasourceTable(metadata))
    +    val msgSchemaCorrupted = "Could not read schema from the metastore because it is corrupted."
         val props = metadata.properties
    -    if (props.isDefinedAt(DATASOURCE_SCHEMA)) {
    +    props.get(DATASOURCE_SCHEMA).map { schema =>
           // Originally, we used spark.sql.sources.schema to store the schema of a data source table.
           // After SPARK-6024, we removed this flag.
           // Although we are not using spark.sql.sources.schema any more, we need to still support.
    -      props.get(DATASOURCE_SCHEMA).map(DataType.fromJson(_).asInstanceOf[StructType])
    -    } else {
    -      metadata.properties.get(DATASOURCE_SCHEMA_NUMPARTS).map { numParts =>
    +      DataType.fromJson(schema).asInstanceOf[StructType]
    +    } getOrElse {
    +      props.get(DATASOURCE_SCHEMA_NUMPARTS).map { numParts =>
             val parts = (0 until numParts.toInt).map { index =>
               val part = metadata.properties.get(s"$DATASOURCE_SCHEMA_PART_PREFIX$index").orNull
               if (part == null) {
    -            throw new AnalysisException(
    -              "Could not read schema from the metastore because it is corrupted " +
    -                s"(missing part $index of the schema, $numParts parts are expected).")
    +            throw new AnalysisException(msgSchemaCorrupted +
    +              s" (missing part $index of the schema, $numParts parts are expected).")
               }
    -
               part
             }
             // Stick all parts back to a single schema string.
             DataType.fromJson(parts.mkString).asInstanceOf[StructType]
    -      }
    +      } getOrElse(throw new AnalysisException(msgSchemaCorrupted))
    --- End diff --
    
    : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62581/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71471942
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,165 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  test("Create partitioned data source table with partitioning columns but no schema") {
    +    import testImplicits._
    +
    +    withTempPath { dir =>
    +      val pathToPartitionedTable = new File(dir, "partitioned")
    +      val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +      df.write.format("parquet").partitionBy("num").save(pathToPartitionedTable.getCanonicalPath)
    +      val tabName = "tab1"
    +      withTable(tabName) {
    +        spark.sql(
    +          s"""
    +             |CREATE TABLE $tabName
    +             |USING parquet
    +             |OPTIONS (
    +             |  path '$pathToPartitionedTable'
    +             |)
    +             |PARTITIONED BY (inexistentColumns)
    +           """.stripMargin)
    +        val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(tabName))
    --- End diff --
    
    Sure, will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62632 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62632/consoleFull)** for PR 14207 at commit [`224b048`](https://github.com/apache/spark/commit/224b0489917e53116a7122ed8e97b8b7f9af4966).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62647 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62647/consoleFull)** for PR 14207 at commit [`264ad35`](https://github.com/apache/spark/commit/264ad35a1a749e14f8d8a33e4977cddda0916204).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62580 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62580/consoleFull)** for PR 14207 at commit [`727ecf8`](https://github.com/apache/spark/commit/727ecf87463d6fe02cd29e0bbf3f488c043b1962).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62926/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62628 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62628/consoleFull)** for PR 14207 at commit [`b404eec`](https://github.com/apache/spark/commit/b404eecfd69dd73124157b339d6d68939ad040aa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62344 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62344/consoleFull)** for PR 14207 at commit [`3be0dc0`](https://github.com/apache/spark/commit/3be0dc0b7cfd942459c598c0d35f3d67a2c020ba).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71458430
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
    @@ -518,6 +510,19 @@ case class DescribeTableCommand(table: TableIdentifier, isExtended: Boolean, isF
         }
       }
     
    +  private def describeSchema(
    +      tableDesc: CatalogTable,
    +      buffer: ArrayBuffer[Row]): Unit = {
    +    if (DDLUtils.isDatasourceTable(tableDesc)) {
    +      DDLUtils.getSchemaFromTableProperties(tableDesc) match {
    --- End diff --
    
    Can we make `DDLUtils.getSchemaFromTableProperties` always return a schema and throw exception if it's corrupted? I think it's more consistent with the previous behaviour, i.e. throw exception if the expected schema properties doesn't exist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62344 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62344/consoleFull)** for PR 14207 at commit [`3be0dc0`](https://github.com/apache/spark/commit/3be0dc0b7cfd942459c598c0d35f3d67a2c020ba).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71456428
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -95,17 +95,38 @@ case class CreateDataSourceTableCommand(
           }
     
         // Create the relation to validate the arguments before writing the metadata to the metastore.
    -    DataSource(
    -      sparkSession = sparkSession,
    -      userSpecifiedSchema = userSpecifiedSchema,
    -      className = provider,
    -      bucketSpec = None,
    -      options = optionsWithPath).resolveRelation(checkPathExist = false)
    +    val dataSource: HadoopFsRelation =
    +      DataSource(
    +        sparkSession = sparkSession,
    +        userSpecifiedSchema = userSpecifiedSchema,
    +        className = provider,
    +        bucketSpec = None,
    +        options = optionsWithPath)
    +        .resolveRelation(checkPathExist = false).asInstanceOf[HadoopFsRelation]
    --- End diff --
    
    Sure, will do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71457323
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,115 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  test("Create data source table with partitioning columns but no schema") {
    +    import testImplicits._
    +
    +    val tabName = "tab1"
    +    withTempPath { dir =>
    +      val pathToPartitionedTable = new File(dir, "partitioned")
    +      val pathToNonPartitionedTable = new File(dir, "nonPartitioned")
    +      val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +      df.write.format("parquet").save(pathToNonPartitionedTable.getCanonicalPath)
    +      df.write.format("parquet").partitionBy("num").save(pathToPartitionedTable.getCanonicalPath)
    +
    +      Seq(pathToPartitionedTable, pathToNonPartitionedTable).foreach { path =>
    +        withTable(tabName) {
    +          spark.sql(
    +            s"""
    +               |CREATE TABLE $tabName
    +               |USING parquet
    +               |OPTIONS (
    +               |  path '$path'
    +               |)
    +               |PARTITIONED BY (inexistentColumns)
    +             """.stripMargin)
    +          val catalog = spark.sessionState.catalog
    +          val tableMetadata = catalog.getTableMetadata(TableIdentifier(tabName))
    +
    +          val tableSchema = DDLUtils.getSchemaFromTableProperties(tableMetadata)
    +          assert(tableSchema.nonEmpty, "the schema of data source tables are always recorded")
    +          val partCols = DDLUtils.getPartitionColumnsFromTableProperties(tableMetadata)
    +
    +          if (tableMetadata.storage.serdeProperties.get("path") ==
    --- End diff --
    
    hmmm, can we separate it into 2 cases instead of doing `Seq(...).foreach`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    > when the data/files are changed by external system (e.g., appended by a streaming system), the stored schema can be inconsistent with the actual schema of the data.
    
    I think this problem already exists, as we will use cached schema instead of inferring it everytime. The only difference is after reboot, this PR will still use the stored schema, and require users to refresh table manually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62552/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62512/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71632333
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,222 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  private def createDataSourceTable(
    +      path: File,
    +      userSpecifiedSchema: Option[String],
    +      userSpecifiedPartitionCols: Option[String]): (StructType, Seq[String]) = {
    --- End diff --
    
    nit: rename it to `userSpecifiedPartitionCol`, or declare it as `Seq[String]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62926 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62926/consoleFull)** for PR 14207 at commit [`b694d8b`](https://github.com/apache/spark/commit/b694d8bd3666e54e4d3ab972edcb04f2be64669b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    @rxin @cloud-fan @yhuai 
    
    This PR introduces a new concept `SchemaType` for determining the original source of a schema. When `SchemaType` is `USER`, it means this table belongs to `Group A`. When the type is `INFERRED`, the table requires schema inference. That is, `Group B`.
    
    Not sure whether this solution sounds OK to you. Let me know whether this is a right direction to resolve the issue. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71632699
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,222 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  private def createDataSourceTable(
    +      path: File,
    +      userSpecifiedSchema: Option[String],
    +      userSpecifiedPartitionCols: Option[String]): (StructType, Seq[String]) = {
    +    var tableSchema = StructType(Nil)
    +    var partCols = Seq.empty[String]
    +
    +    val tabName = "tab1"
    +    withTable(tabName) {
    +      val partitionClause =
    +        userSpecifiedPartitionCols.map(p => s"PARTITIONED BY ($p)").getOrElse("")
    +      val schemaClause = userSpecifiedSchema.map(s => s"($s)").getOrElse("")
    +      sql(
    +        s"""
    +           |CREATE TABLE $tabName $schemaClause
    +           |USING parquet
    +           |OPTIONS (
    +           |  path '$path'
    +           |)
    +           |$partitionClause
    +         """.stripMargin)
    +      val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(tabName))
    +
    +      tableSchema = DDLUtils.getSchemaFromTableProperties(tableMetadata)
    +      partCols = DDLUtils.getPartitionColumnsFromTableProperties(tableMetadata)
    +    }
    +    (tableSchema, partCols)
    +  }
    +
    +  test("Create partitioned data source table without user specified schema") {
    +    import testImplicits._
    +    val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +
    +    // Case 1: with partitioning columns but no schema: Option("inexistentColumns")
    +    // Case 2: without schema and partitioning columns: None
    +    Seq(Option("inexistentColumns"), None).foreach { partitionCols =>
    +      withTempPath { pathToPartitionedTable =>
    +        df.write.format("parquet").partitionBy("num")
    +          .save(pathToPartitionedTable.getCanonicalPath)
    +        val (tableSchema, partCols) =
    +          createDataSourceTable(
    +            pathToPartitionedTable,
    +            userSpecifiedSchema = None,
    +            userSpecifiedPartitionCols = partitionCols)
    +        assert(tableSchema ==
    +          StructType(StructField("str", StringType, nullable = true) ::
    --- End diff --
    
    nit: `new StructType().add...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71470908
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
    @@ -522,31 +522,31 @@ object DDLUtils {
         table.partitionColumns.nonEmpty || table.properties.contains(DATASOURCE_SCHEMA_NUMPARTCOLS)
       }
     
    -  // A persisted data source table may not store its schema in the catalog. In this case, its schema
    -  // will be inferred at runtime when the table is referenced.
    -  def getSchemaFromTableProperties(metadata: CatalogTable): Option[StructType] = {
    +  // A persisted data source table always store its schema in the catalog.
    +  def getSchemaFromTableProperties(metadata: CatalogTable): StructType = {
         require(isDatasourceTable(metadata))
    +    val msgSchemaCorrupted = "Could not read schema from the metastore because it is corrupted."
         val props = metadata.properties
         if (props.isDefinedAt(DATASOURCE_SCHEMA)) {
    --- End diff --
    
    Sure, let me change it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14207


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62647 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62647/consoleFull)** for PR 14207 at commit [`264ad35`](https://github.com/apache/spark/commit/264ad35a1a749e14f8d8a33e4977cddda0916204).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62512 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62512/consoleFull)** for PR 14207 at commit [`c6afbbb`](https://github.com/apache/spark/commit/c6afbbb9941113d6a78bfd3aaa627653ba0f6151).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71633259
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,222 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  private def createDataSourceTable(
    +      path: File,
    +      userSpecifiedSchema: Option[String],
    +      userSpecifiedPartitionCols: Option[String]): (StructType, Seq[String]) = {
    +    var tableSchema = StructType(Nil)
    +    var partCols = Seq.empty[String]
    +
    +    val tabName = "tab1"
    +    withTable(tabName) {
    +      val partitionClause =
    +        userSpecifiedPartitionCols.map(p => s"PARTITIONED BY ($p)").getOrElse("")
    +      val schemaClause = userSpecifiedSchema.map(s => s"($s)").getOrElse("")
    +      sql(
    +        s"""
    +           |CREATE TABLE $tabName $schemaClause
    +           |USING parquet
    +           |OPTIONS (
    +           |  path '$path'
    +           |)
    +           |$partitionClause
    +         """.stripMargin)
    +      val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(tabName))
    +
    +      tableSchema = DDLUtils.getSchemaFromTableProperties(tableMetadata)
    +      partCols = DDLUtils.getPartitionColumnsFromTableProperties(tableMetadata)
    +    }
    +    (tableSchema, partCols)
    +  }
    +
    +  test("Create partitioned data source table without user specified schema") {
    +    import testImplicits._
    +    val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +
    +    // Case 1: with partitioning columns but no schema: Option("inexistentColumns")
    +    // Case 2: without schema and partitioning columns: None
    +    Seq(Option("inexistentColumns"), None).foreach { partitionCols =>
    +      withTempPath { pathToPartitionedTable =>
    +        df.write.format("parquet").partitionBy("num")
    +          .save(pathToPartitionedTable.getCanonicalPath)
    +        val (tableSchema, partCols) =
    +          createDataSourceTable(
    +            pathToPartitionedTable,
    +            userSpecifiedSchema = None,
    +            userSpecifiedPartitionCols = partitionCols)
    +        assert(tableSchema ==
    +          StructType(StructField("str", StringType, nullable = true) ::
    +            StructField("num", IntegerType, nullable = true) :: Nil))
    +        assert(partCols == Seq("num"))
    +      }
    +    }
    +  }
    +
    +  test("Create partitioned data source table with user specified schema") {
    +    import testImplicits._
    +    val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +
    +    // Case 1: with partitioning columns but no schema: Option("num")
    +    // Case 2: without schema and partitioning columns: None
    +    Seq(Option("num"), None).foreach { partitionCols =>
    +      withTempPath { pathToPartitionedTable =>
    +        df.write.format("parquet").partitionBy("num")
    +          .save(pathToPartitionedTable.getCanonicalPath)
    +        val (tableSchema, partCols) =
    +          createDataSourceTable(
    +            pathToPartitionedTable,
    +            userSpecifiedSchema = Option("num int, str string"),
    +            userSpecifiedPartitionCols = partitionCols)
    +        assert(tableSchema ==
    +          StructType(StructField("num", IntegerType, nullable = true) ::
    +            StructField("str", StringType, nullable = true) :: Nil))
    +        assert(partCols.mkString(", ") == partitionCols.getOrElse(""))
    +      }
    +    }
    +  }
    +
    +  test("Create non-partitioned data source table without user specified schema") {
    +    import testImplicits._
    +    val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +
    +    // Case 1: with partitioning columns but no schema: Option("inexistentColumns")
    +    // Case 2: without schema and partitioning columns: None
    +    Seq(Option("inexistentColumns"), None).foreach { partitionCols =>
    +      withTempPath { pathToNonPartitionedTable =>
    +        df.write.format("parquet").save(pathToNonPartitionedTable.getCanonicalPath)
    +        val (tableSchema, partCols) =
    +          createDataSourceTable(
    +            pathToNonPartitionedTable,
    +            userSpecifiedSchema = None,
    +            userSpecifiedPartitionCols = partitionCols)
    +        assert(tableSchema ==
    +          StructType(StructField("num", IntegerType, nullable = true) ::
    +            StructField("str", StringType, nullable = true) :: Nil))
    +        assert(partCols.isEmpty)
    +      }
    +    }
    +  }
    +
    +  test("Create non-partitioned data source table with user specified schema") {
    +    import testImplicits._
    +    val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +
    +    // Case 1: with partitioning columns but no schema: Option("inexistentColumns")
    +    // Case 2: without schema and partitioning columns: None
    +    Seq(Option("num"), None).foreach { partitionCols =>
    +      withTempPath { pathToNonPartitionedTable =>
    +        df.write.format("parquet").save(pathToNonPartitionedTable.getCanonicalPath)
    +        val (tableSchema, partCols) =
    +          createDataSourceTable(
    +            pathToNonPartitionedTable,
    +            userSpecifiedSchema = Option("num int, str string"),
    +            userSpecifiedPartitionCols = partitionCols)
    +        assert(tableSchema ==
    +          StructType(StructField("num", IntegerType, nullable = true) ::
    +            StructField("str", StringType, nullable = true) :: Nil))
    +        assert(partCols.mkString(", ") == partitionCols.getOrElse(""))
    +      }
    +    }
    +  }
    +
    +  test("Describe Table with Corrupted Schema") {
    +    import testImplicits._
    +
    +    val tabName = "tab1"
    +    withTempPath { dir =>
    +      val path = dir.getCanonicalPath
    +      val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("col1", "col2")
    +      df.write.format("json").save(path)
    +
    +      withTable(tabName) {
    +        sql(
    +          s"""
    +             |CREATE TABLE $tabName
    +             |USING json
    +             |OPTIONS (
    +             |  path '$path'
    +             |)
    +           """.stripMargin)
    +
    +        val catalog = spark.sessionState.catalog
    +        val table = catalog.getTableMetadata(TableIdentifier(tabName))
    +        val newProperties = table.properties.filterKeys(key =>
    +          key != CreateDataSourceTableUtils.DATASOURCE_SCHEMA_NUMPARTS)
    +        val newTable = table.copy(properties = newProperties)
    +        catalog.alterTable(newTable)
    +
    +        val e = intercept[AnalysisException] {
    +          sql(s"DESC $tabName")
    +        }.getMessage
    +        assert(e.contains(s"Could not read schema from the metastore because it is corrupted"))
    +      }
    +    }
    +  }
    +
    +  test("Refresh table after changing the data source table partitioning") {
    +    import testImplicits._
    +
    +    val tabName = "tab1"
    +    val catalog = spark.sessionState.catalog
    +    withTempPath { dir =>
    +      val path = dir.getCanonicalPath
    +      val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString, i, i))
    +        .toDF("col1", "col2", "col3", "col4")
    +      df.write.format("json").partitionBy("col1", "col3").save(path)
    +      val schema = StructType(
    +        StructField("col2", StringType, nullable = true) ::
    +        StructField("col4", LongType, nullable = true) ::
    +        StructField("col1", IntegerType, nullable = true) ::
    +        StructField("col3", IntegerType, nullable = true) :: Nil)
    +      val partitionCols = Seq("col1", "col3")
    +
    +      // Ensure the schema is split to multiple properties.
    +      withSQLConf(SQLConf.SCHEMA_STRING_LENGTH_THRESHOLD.key -> "1") {
    --- End diff --
    
    why this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71382426
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
    @@ -223,6 +223,9 @@ abstract class Catalog {
        * If this table is cached as an InMemoryRelation, drop the original cached version and make the
        * new version cached lazily.
        *
    +   * If the table's schema is inferred at runtime, infer the schema again and update the schema
    --- End diff --
    
    @rxin @cloud-fan  I see. Will make a change
    
    FYI, this will change the existing external behavior. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71470907
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
    @@ -522,31 +522,31 @@ object DDLUtils {
         table.partitionColumns.nonEmpty || table.properties.contains(DATASOURCE_SCHEMA_NUMPARTCOLS)
       }
     
    -  // A persisted data source table may not store its schema in the catalog. In this case, its schema
    -  // will be inferred at runtime when the table is referenced.
    -  def getSchemaFromTableProperties(metadata: CatalogTable): Option[StructType] = {
    +  // A persisted data source table always store its schema in the catalog.
    +  def getSchemaFromTableProperties(metadata: CatalogTable): StructType = {
         require(isDatasourceTable(metadata))
    +    val msgSchemaCorrupted = "Could not read schema from the metastore because it is corrupted."
         val props = metadata.properties
         if (props.isDefinedAt(DATASOURCE_SCHEMA)) {
    --- End diff --
    
    Sure, let me change it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62552 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62552/consoleFull)** for PR 14207 at commit [`a043ca2`](https://github.com/apache/spark/commit/a043ca28fc06082bc8b4104d9b38f2fbf1aa337a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71085723
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
    @@ -487,6 +487,10 @@ object DDLUtils {
         isDatasourceTable(table.properties)
       }
     
    +  def isSchemaInferred(table: CatalogTable): Boolean = {
    +    table.properties.get(DATASOURCE_SCHEMA_TYPE) == Option(SchemaType.INFERRED.name)
    --- End diff --
    
    Please don't use contains. It makes it much harder to read and understand
    the return type is an option.
    
    On Sunday, July 17, 2016, Jacek Laskowski <no...@github.com> wrote:
    
    > In
    > sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
    > <https://github.com/apache/spark/pull/14207#discussion_r71083304>:
    >
    > > @@ -487,6 +487,10 @@ object DDLUtils {
    > >      isDatasourceTable(table.properties)
    > >    }
    > >
    > > +  def isSchemaInferred(table: CatalogTable): Boolean = {
    > > +    table.properties.get(DATASOURCE_SCHEMA_TYPE) == Option(SchemaType.INFERRED.name)
    >
    > Consider contains.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/14207/files/3be0dc0b7cfd942459c598c0d35f3d67a2c020ba#r71083304>,
    > or mute the thread
    > <https://github.com/notifications/unsubscribe-auth/AATvPLJxrOTgsryjrhIAFMb3v7t5vl8-ks5qWjylgaJpZM4JMzdl>
    > .
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71266596
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
    @@ -487,6 +487,10 @@ object DDLUtils {
         isDatasourceTable(table.properties)
       }
     
    +  def isSchemaInferred(table: CatalogTable): Boolean = {
    +    table.properties.get(DATASOURCE_SCHEMA_TYPE) == Option(SchemaType.INFERRED.name)
    --- End diff --
    
    Thanks! @rxin @jaceklaskowski 
    
    I will not change it because using `contains` will break the Scala 2.10 compiler. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62914/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r73940848
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
    @@ -521,31 +521,29 @@ object DDLUtils {
         table.partitionColumns.nonEmpty || table.properties.contains(DATASOURCE_SCHEMA_NUMPARTCOLS)
       }
     
    -  // A persisted data source table may not store its schema in the catalog. In this case, its schema
    -  // will be inferred at runtime when the table is referenced.
    -  def getSchemaFromTableProperties(metadata: CatalogTable): Option[StructType] = {
    +  // A persisted data source table always store its schema in the catalog.
    +  def getSchemaFromTableProperties(metadata: CatalogTable): StructType = {
         require(isDatasourceTable(metadata))
    +    val msgSchemaCorrupted = "Could not read schema from the metastore because it is corrupted."
         val props = metadata.properties
    -    if (props.isDefinedAt(DATASOURCE_SCHEMA)) {
    +    props.get(DATASOURCE_SCHEMA).map { schema =>
           // Originally, we used spark.sql.sources.schema to store the schema of a data source table.
           // After SPARK-6024, we removed this flag.
           // Although we are not using spark.sql.sources.schema any more, we need to still support.
    -      props.get(DATASOURCE_SCHEMA).map(DataType.fromJson(_).asInstanceOf[StructType])
    -    } else {
    -      metadata.properties.get(DATASOURCE_SCHEMA_NUMPARTS).map { numParts =>
    +      DataType.fromJson(schema).asInstanceOf[StructType]
    +    } getOrElse {
    +      props.get(DATASOURCE_SCHEMA_NUMPARTS).map { numParts =>
             val parts = (0 until numParts.toInt).map { index =>
               val part = metadata.properties.get(s"$DATASOURCE_SCHEMA_PART_PREFIX$index").orNull
               if (part == null) {
    -            throw new AnalysisException(
    -              "Could not read schema from the metastore because it is corrupted " +
    -                s"(missing part $index of the schema, $numParts parts are expected).")
    +            throw new AnalysisException(msgSchemaCorrupted +
    +              s" (missing part $index of the schema, $numParts parts are expected).")
               }
    -
               part
             }
             // Stick all parts back to a single schema string.
             DataType.fromJson(parts.mkString).asInstanceOf[StructType]
    -      }
    +      } getOrElse(throw new AnalysisException(msgSchemaCorrupted))
    --- End diff --
    
    ah, this `getOrElse` is too far from the `get(DATASOURCE_SCHEMA)`... Actually, I prefer the `if/else`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71471136
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -95,17 +95,41 @@ case class CreateDataSourceTableCommand(
           }
     
         // Create the relation to validate the arguments before writing the metadata to the metastore.
    -    DataSource(
    -      sparkSession = sparkSession,
    -      userSpecifiedSchema = userSpecifiedSchema,
    -      className = provider,
    -      bucketSpec = None,
    -      options = optionsWithPath).resolveRelation(checkPathExist = false)
    +    val dataSource: BaseRelation =
    +      DataSource(
    +        sparkSession = sparkSession,
    +        userSpecifiedSchema = userSpecifiedSchema,
    +        className = provider,
    +        bucketSpec = None,
    +        options = optionsWithPath).resolveRelation(checkPathExist = false)
    +
    +    val partitionColumns =
    --- End diff --
    
    IIUC, the logic should be: if schema is specified, use the given partition columns, else, infer it. Maybe it's more clear to write:
    ```
    val partitionColumns = if (userSpecifiedSchema.isEmpty) {
      if (userSpecifiedPartitionColumns.length > 0) {
        ...
      }
      dataSource match {
        ...
      }
    } else {
      userSpecifiedPartitionColumns
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71456601
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
    @@ -518,6 +510,19 @@ case class DescribeTableCommand(table: TableIdentifier, isExtended: Boolean, isF
         }
       }
     
    +  private def describeSchema(
    +      tableDesc: CatalogTable,
    +      buffer: ArrayBuffer[Row]): Unit = {
    +    if (DDLUtils.isDatasourceTable(tableDesc)) {
    +      DDLUtils.getSchemaFromTableProperties(tableDesc) match {
    --- End diff --
    
    For all types of data source tables, we store the schema in the table properties. Thus, we should not return None; unless the table properties are modified by users using the `Alter Table` command. 
    
    Sorry, forgot to update the message. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62574/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r73940468
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -95,17 +95,39 @@ case class CreateDataSourceTableCommand(
           }
     
         // Create the relation to validate the arguments before writing the metadata to the metastore.
    -    DataSource(
    -      sparkSession = sparkSession,
    -      userSpecifiedSchema = userSpecifiedSchema,
    -      className = provider,
    -      bucketSpec = None,
    -      options = optionsWithPath).resolveRelation(checkPathExist = false)
    +    val dataSource: BaseRelation =
    +      DataSource(
    +        sparkSession = sparkSession,
    +        userSpecifiedSchema = userSpecifiedSchema,
    +        className = provider,
    +        bucketSpec = None,
    +        options = optionsWithPath).resolveRelation(checkPathExist = false)
    +
    +    val partitionColumns = if (userSpecifiedSchema.nonEmpty) {
    +      userSpecifiedPartitionColumns
    +    } else {
    +      val res = dataSource match {
    +        case r: HadoopFsRelation => r.partitionSchema.fieldNames
    +        case _ => Array.empty[String]
    +      }
    +      if (userSpecifiedPartitionColumns.length > 0) {
    +        // The table does not have a specified schema, which means that the schema will be inferred
    +        // when we load the table. So, we are not expecting partition columns and we will discover
    +        // partitions when we load the table. However, if there are specified partition columns,
    +        // we simply ignore them and provide a warning message.
    +        logWarning(
    +          s"Specified partition columns (${userSpecifiedPartitionColumns.mkString(",")}) will be " +
    +            s"ignored. The schema and partition columns of table $tableIdent are inferred. " +
    +            s"Schema: ${dataSource.schema.simpleString}; " +
    +            s"Partition columns: ${res.mkString("(", ", ", ")")}")
    +      }
    +      res
    +    }
     
         CreateDataSourceTableUtils.createDataSourceTable(
           sparkSession = sparkSession,
           tableIdent = tableIdent,
    -      userSpecifiedSchema = userSpecifiedSchema,
    +      schema = dataSource.schema,
    --- End diff --
    
    seems we should still use the user-specified schema, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71083334
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala ---
    @@ -351,6 +353,44 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog {
       }
     
       /**
    +   * Refresh the inferred schema stored in the external catalog for data source tables.
    +   */
    +  private def refreshInferredSchema(tableIdent: TableIdentifier): Unit = {
    +    val table = sessionCatalog.getTableMetadataOption(tableIdent)
    +    table.foreach { tableDesc =>
    +      if (DDLUtils.isDatasourceTable(tableDesc) && DDLUtils.isSchemaInferred(tableDesc)) {
    +        val partitionColumns = DDLUtils.getPartitionColumnsFromTableProperties(tableDesc)
    +        val bucketSpec = DDLUtils.getBucketSpecFromTableProperties(tableDesc)
    +        val dataSource =
    +          DataSource(
    +            sparkSession,
    +            userSpecifiedSchema = None,
    +            partitionColumns = partitionColumns,
    +            bucketSpec = bucketSpec,
    +            className = tableDesc.properties(CreateDataSourceTableUtils.DATASOURCE_PROVIDER),
    +            options = tableDesc.storage.serdeProperties)
    +            .resolveRelation().asInstanceOf[HadoopFsRelation]
    +
    +        val schemaProperties = new mutable.HashMap[String, String]
    +        CreateDataSourceTableUtils.saveSchema(
    +          sparkSession, dataSource.schema, dataSource.partitionSchema.fieldNames, schemaProperties)
    +
    +        val tablePropertiesWithoutSchema = tableDesc.properties.filterKeys { k =>
    +          // Keep the properties that are not for schema or partition columns
    +          k != CreateDataSourceTableUtils.DATASOURCE_SCHEMA_NUMPARTS &&
    --- End diff --
    
    It's hard to know what the code's doing inside `filterKeys` -- consider creating a predicate function with a proper name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62632/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71453615
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,115 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  test("Create data source table with partitioning columns but no schema") {
    +    import testImplicits._
    +
    +    val tabName = "tab1"
    +    withTempPath { dir =>
    +      val pathToPartitionedTable = new File(dir, "partitioned")
    +      val pathToNonPartitionedTable = new File(dir, "nonPartitioned")
    +      val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +      df.write.format("parquet").save(pathToNonPartitionedTable.getCanonicalPath)
    +      df.write.format("parquet").partitionBy("num").save(pathToPartitionedTable.getCanonicalPath)
    +
    +      Seq(pathToPartitionedTable, pathToNonPartitionedTable).foreach { path =>
    +        withTable(tabName) {
    +          spark.sql(
    +            s"""
    +               |CREATE TABLE $tabName
    +               |USING parquet
    +               |OPTIONS (
    +               |  path '$path'
    +               |)
    +               |PARTITIONED BY (inexistentColumns)
    +             """.stripMargin)
    +          val catalog = spark.sessionState.catalog
    +          val tableMetadata = catalog.getTableMetadata(TableIdentifier(tabName))
    +
    +          val tableSchema = DDLUtils.getSchemaFromTableProperties(tableMetadata)
    +          assert(tableSchema.nonEmpty, "the schema of data source tables are always recorded")
    +          val partCols = DDLUtils.getPartitionColumnsFromTableProperties(tableMetadata)
    +
    +          if (tableMetadata.storage.serdeProperties.get("path") ==
    --- End diff --
    
    how could this condition be false?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62513/consoleFull)** for PR 14207 at commit [`55c2c5e`](https://github.com/apache/spark/commit/55c2c5e2623478a79971af3b0513727b03c1ee87).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62344/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r72378623
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,209 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  private def createDataSourceTable(
    +      path: File,
    +      userSpecifiedSchema: Option[String],
    +      userSpecifiedPartitionCols: Option[String]): (StructType, Seq[String]) = {
    --- End diff --
    
    how about we pass in the expected schema and partCols, and do the check in this method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62647/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r73968228
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -95,17 +95,39 @@ case class CreateDataSourceTableCommand(
           }
     
         // Create the relation to validate the arguments before writing the metadata to the metastore.
    -    DataSource(
    -      sparkSession = sparkSession,
    -      userSpecifiedSchema = userSpecifiedSchema,
    -      className = provider,
    -      bucketSpec = None,
    -      options = optionsWithPath).resolveRelation(checkPathExist = false)
    +    val dataSource: BaseRelation =
    +      DataSource(
    +        sparkSession = sparkSession,
    +        userSpecifiedSchema = userSpecifiedSchema,
    +        className = provider,
    +        bucketSpec = None,
    +        options = optionsWithPath).resolveRelation(checkPathExist = false)
    +
    +    val partitionColumns = if (userSpecifiedSchema.nonEmpty) {
    +      userSpecifiedPartitionColumns
    +    } else {
    +      val res = dataSource match {
    +        case r: HadoopFsRelation => r.partitionSchema.fieldNames
    +        case _ => Array.empty[String]
    +      }
    +      if (userSpecifiedPartitionColumns.length > 0) {
    +        // The table does not have a specified schema, which means that the schema will be inferred
    +        // when we load the table. So, we are not expecting partition columns and we will discover
    +        // partitions when we load the table. However, if there are specified partition columns,
    +        // we simply ignore them and provide a warning message.
    +        logWarning(
    +          s"Specified partition columns (${userSpecifiedPartitionColumns.mkString(",")}) will be " +
    +            s"ignored. The schema and partition columns of table $tableIdent are inferred. " +
    +            s"Schema: ${dataSource.schema.simpleString}; " +
    +            s"Partition columns: ${res.mkString("(", ", ", ")")}")
    +      }
    +      res
    +    }
     
         CreateDataSourceTableUtils.createDataSourceTable(
           sparkSession = sparkSession,
           tableIdent = tableIdent,
    -      userSpecifiedSchema = userSpecifiedSchema,
    +      schema = dataSource.schema,
    --- End diff --
    
    Here, `dataSource.schema` could be inferred. Previously, we do not store the inferred schema. After this PR, we did and thus we use `dataSource.schema`.
    
    Actually, after re-checking the code, I found the schema might be adjusted a little even if users specify the schema. For example, the nullability could be changed : https://github.com/apache/spark/blob/64529b186a1c33740067cc7639d630bc5b9ae6e8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L407
    
    I think we should make such a change but maybe we should test and log it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71384819
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
    @@ -223,6 +223,9 @@ abstract class Catalog {
        * If this table is cached as an InMemoryRelation, drop the original cached version and make the
        * new version cached lazily.
        *
    +   * If the table's schema is inferred at runtime, infer the schema again and update the schema
    --- End diff --
    
    Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    @yhuai Forgot to change the PR description. For data source tables, the schema will not be inferred and refreshed. This is based on the comment: https://github.com/apache/spark/pull/14207#discussion_r71380691


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62512/consoleFull)** for PR 14207 at commit [`c6afbbb`](https://github.com/apache/spark/commit/c6afbbb9941113d6a78bfd3aaa627653ba0f6151).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    I think it is not clear what the problem this PR tries to solve is. It just says it proposes to save the inferred schema in external catalog.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71380691
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
    @@ -223,6 +223,9 @@ abstract class Catalog {
        * If this table is cached as an InMemoryRelation, drop the original cached version and make the
        * new version cached lazily.
        *
    +   * If the table's schema is inferred at runtime, infer the schema again and update the schema
    --- End diff --
    
    refreshTable shouldn't run schema inference. Only run schema inference when creating the table.
    
    And don't make this a config flag. Just run schema inference when creating the table. For managed tables, store the schema explicitly. Users must explicitly change it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62914 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62914/consoleFull)** for PR 14207 at commit [`6492e98`](https://github.com/apache/spark/commit/6492e98f80aae6d95b5683c4b2a1aab8e3edb94d).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62580/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    @gatorsmile Yea. I meant that as you use the stored schema without inferred schema for table, when the data/files are changed by external system (e.g., appended by a streaming system), the stored schema can be inconsistent with the actual schema of the data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    @viirya The problem it tries to resolve is from the comment of @rxin in another PR: https://github.com/apache/spark/pull/14148#issuecomment-232273833


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71633776
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,222 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  private def createDataSourceTable(
    +      path: File,
    +      userSpecifiedSchema: Option[String],
    +      userSpecifiedPartitionCols: Option[String]): (StructType, Seq[String]) = {
    +    var tableSchema = StructType(Nil)
    +    var partCols = Seq.empty[String]
    +
    +    val tabName = "tab1"
    +    withTable(tabName) {
    +      val partitionClause =
    +        userSpecifiedPartitionCols.map(p => s"PARTITIONED BY ($p)").getOrElse("")
    +      val schemaClause = userSpecifiedSchema.map(s => s"($s)").getOrElse("")
    +      sql(
    +        s"""
    +           |CREATE TABLE $tabName $schemaClause
    +           |USING parquet
    +           |OPTIONS (
    +           |  path '$path'
    +           |)
    +           |$partitionClause
    +         """.stripMargin)
    +      val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(tabName))
    +
    +      tableSchema = DDLUtils.getSchemaFromTableProperties(tableMetadata)
    +      partCols = DDLUtils.getPartitionColumnsFromTableProperties(tableMetadata)
    +    }
    +    (tableSchema, partCols)
    +  }
    +
    +  test("Create partitioned data source table without user specified schema") {
    +    import testImplicits._
    +    val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +
    +    // Case 1: with partitioning columns but no schema: Option("inexistentColumns")
    +    // Case 2: without schema and partitioning columns: None
    +    Seq(Option("inexistentColumns"), None).foreach { partitionCols =>
    +      withTempPath { pathToPartitionedTable =>
    +        df.write.format("parquet").partitionBy("num")
    +          .save(pathToPartitionedTable.getCanonicalPath)
    +        val (tableSchema, partCols) =
    +          createDataSourceTable(
    +            pathToPartitionedTable,
    +            userSpecifiedSchema = None,
    +            userSpecifiedPartitionCols = partitionCols)
    +        assert(tableSchema ==
    +          StructType(StructField("str", StringType, nullable = true) ::
    --- End diff --
    
    Sure, will do. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71470616
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,165 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  test("Create partitioned data source table with partitioning columns but no schema") {
    +    import testImplicits._
    +
    +    withTempPath { dir =>
    +      val pathToPartitionedTable = new File(dir, "partitioned")
    +      val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +      df.write.format("parquet").partitionBy("num").save(pathToPartitionedTable.getCanonicalPath)
    +      val tabName = "tab1"
    +      withTable(tabName) {
    +        spark.sql(
    +          s"""
    +             |CREATE TABLE $tabName
    +             |USING parquet
    +             |OPTIONS (
    +             |  path '$pathToPartitionedTable'
    +             |)
    +             |PARTITIONED BY (inexistentColumns)
    +           """.stripMargin)
    +        val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(tabName))
    --- End diff --
    
    we can abstract common logic into some methods, to remove duplicated code a bit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r72385549
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,209 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  private def createDataSourceTable(
    +      path: File,
    +      userSpecifiedSchema: Option[String],
    +      userSpecifiedPartitionCols: Option[String]): (StructType, Seq[String]) = {
    --- End diff --
    
    Sure, will do. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62574/consoleFull)** for PR 14207 at commit [`e930819`](https://github.com/apache/spark/commit/e93081918b170d3fbd08d992ef251c83af9e433d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Does it mean that if users do not issue refresh when the table location is changed, the schema will be wrong when the Spark is re-starting?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    @viirya Schema inference is time-consuming, especially when the number of files is huge. Thus, we should avoid refreshing it every time. That is one of the major reasons why we have a metadata cache for data source tables. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62810/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71457330
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
    @@ -518,6 +510,19 @@ case class DescribeTableCommand(table: TableIdentifier, isExtended: Boolean, isF
         }
       }
     
    +  private def describeSchema(
    +      tableDesc: CatalogTable,
    +      buffer: ArrayBuffer[Row]): Unit = {
    +    if (DDLUtils.isDatasourceTable(tableDesc)) {
    +      DDLUtils.getSchemaFromTableProperties(tableDesc) match {
    --- End diff --
    
    Now, the message is changed to `"# Schema of this table is corrupted"`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71453222
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -95,17 +95,38 @@ case class CreateDataSourceTableCommand(
           }
     
         // Create the relation to validate the arguments before writing the metadata to the metastore.
    -    DataSource(
    -      sparkSession = sparkSession,
    -      userSpecifiedSchema = userSpecifiedSchema,
    -      className = provider,
    -      bucketSpec = None,
    -      options = optionsWithPath).resolveRelation(checkPathExist = false)
    +    val dataSource: HadoopFsRelation =
    +      DataSource(
    +        sparkSession = sparkSession,
    +        userSpecifiedSchema = userSpecifiedSchema,
    +        className = provider,
    +        bucketSpec = None,
    +        options = optionsWithPath)
    +        .resolveRelation(checkPathExist = false).asInstanceOf[HadoopFsRelation]
    --- End diff --
    
    is it safe to cast it `HadoopFsRelation`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    The table location is not allowed to change. Right? 
    
    With the changes of this PR, if the changes on the data/files (pointed by the table location) affect the table schema, they need to manually call the `REFRESH` command. Restarting Spark will not cause the schema changes.
    
    Before this PR, if users restart Spark or the corresponding cache item is replaced, the table schema could be changed without notice. This could be a potential issue when the read and write are conducted in parallel. This undocumented behavior could complicate the Spark applications. 
    
    The unexpected changes should be avoided. If the schema is changed and the table fetching is ready for new schema, users should manually issue `REFRESH` command.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62810 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62810/consoleFull)** for PR 14207 at commit [`264ad35`](https://github.com/apache/spark/commit/264ad35a1a749e14f8d8a33e4977cddda0916204).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62628 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62628/consoleFull)** for PR 14207 at commit [`b404eec`](https://github.com/apache/spark/commit/b404eecfd69dd73124157b339d6d68939ad040aa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71460333
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,115 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  test("Create data source table with partitioning columns but no schema") {
    +    import testImplicits._
    +
    +    val tabName = "tab1"
    +    withTempPath { dir =>
    +      val pathToPartitionedTable = new File(dir, "partitioned")
    +      val pathToNonPartitionedTable = new File(dir, "nonPartitioned")
    +      val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +      df.write.format("parquet").save(pathToNonPartitionedTable.getCanonicalPath)
    +      df.write.format("parquet").partitionBy("num").save(pathToPartitionedTable.getCanonicalPath)
    +
    +      Seq(pathToPartitionedTable, pathToNonPartitionedTable).foreach { path =>
    +        withTable(tabName) {
    +          spark.sql(
    +            s"""
    +               |CREATE TABLE $tabName
    +               |USING parquet
    +               |OPTIONS (
    +               |  path '$path'
    +               |)
    +               |PARTITIONED BY (inexistentColumns)
    +             """.stripMargin)
    +          val catalog = spark.sessionState.catalog
    +          val tableMetadata = catalog.getTableMetadata(TableIdentifier(tabName))
    +
    +          val tableSchema = DDLUtils.getSchemaFromTableProperties(tableMetadata)
    +          assert(tableSchema.nonEmpty, "the schema of data source tables are always recorded")
    +          val partCols = DDLUtils.getPartitionColumnsFromTableProperties(tableMetadata)
    +
    +          if (tableMetadata.storage.serdeProperties.get("path") ==
    --- End diff --
    
    Ok, no problem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62513/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71083304
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
    @@ -487,6 +487,10 @@ object DDLUtils {
         isDatasourceTable(table.properties)
       }
     
    +  def isSchemaInferred(table: CatalogTable): Boolean = {
    +    table.properties.get(DATASOURCE_SCHEMA_TYPE) == Option(SchemaType.INFERRED.name)
    --- End diff --
    
    Consider `contains`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r73965912
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -95,17 +95,39 @@ case class CreateDataSourceTableCommand(
           }
     
         // Create the relation to validate the arguments before writing the metadata to the metastore.
    -    DataSource(
    -      sparkSession = sparkSession,
    -      userSpecifiedSchema = userSpecifiedSchema,
    -      className = provider,
    -      bucketSpec = None,
    -      options = optionsWithPath).resolveRelation(checkPathExist = false)
    +    val dataSource: BaseRelation =
    +      DataSource(
    +        sparkSession = sparkSession,
    +        userSpecifiedSchema = userSpecifiedSchema,
    +        className = provider,
    +        bucketSpec = None,
    +        options = optionsWithPath).resolveRelation(checkPathExist = false)
    +
    +    val partitionColumns = if (userSpecifiedSchema.nonEmpty) {
    +      userSpecifiedPartitionColumns
    +    } else {
    +      val res = dataSource match {
    +        case r: HadoopFsRelation => r.partitionSchema.fieldNames
    +        case _ => Array.empty[String]
    +      }
    +      if (userSpecifiedPartitionColumns.length > 0) {
    --- End diff --
    
    Here, I just keep the existing behavior. 
    
    To be honest, I think we should throw an exception whenever it makes sense. It sounds like the job log is not being read by most users. Will submit a follow-up PR to make it a change. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71460325
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
    @@ -518,6 +510,19 @@ case class DescribeTableCommand(table: TableIdentifier, isExtended: Boolean, isF
         }
       }
     
    +  private def describeSchema(
    +      tableDesc: CatalogTable,
    +      buffer: ArrayBuffer[Row]): Unit = {
    +    if (DDLUtils.isDatasourceTable(tableDesc)) {
    +      DDLUtils.getSchemaFromTableProperties(tableDesc) match {
    --- End diff --
    
    Sure, will do. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71262073
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -270,6 +291,11 @@ case class CreateDataSourceTableAsSelectCommand(
       }
     }
     
    +case class SchemaType private(name: String)
    +object SchemaType {
    --- End diff --
    
    Sure, will do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62914/consoleFull)** for PR 14207 at commit [`6492e98`](https://github.com/apache/spark/commit/6492e98f80aae6d95b5683c4b2a1aab8e3edb94d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71470370
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
    @@ -522,31 +522,31 @@ object DDLUtils {
         table.partitionColumns.nonEmpty || table.properties.contains(DATASOURCE_SCHEMA_NUMPARTCOLS)
       }
     
    -  // A persisted data source table may not store its schema in the catalog. In this case, its schema
    -  // will be inferred at runtime when the table is referenced.
    -  def getSchemaFromTableProperties(metadata: CatalogTable): Option[StructType] = {
    +  // A persisted data source table always store its schema in the catalog.
    +  def getSchemaFromTableProperties(metadata: CatalogTable): StructType = {
         require(isDatasourceTable(metadata))
    +    val msgSchemaCorrupted = "Could not read schema from the metastore because it is corrupted."
         val props = metadata.properties
         if (props.isDefinedAt(DATASOURCE_SCHEMA)) {
    --- End diff --
    
    how about
    ```
    props.get(DATASOURCE_SCHEMA).map { schema =>
      // ....
      DataType.fromJson(schema).asInstanceOf[StructType]
    }.getOrElse {
      props.get(DATASOURCE_SCHEMA_NUMPARTS).map {
        ....
      }.getOrElse(throw ...)
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62632/consoleFull)** for PR 14207 at commit [`224b048`](https://github.com/apache/spark/commit/224b0489917e53116a7122ed8e97b8b7f9af4966).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Could you please review it again? @yhuai @liancheng @rxin Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    cc @yhuai @liancheng to take another look


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas int...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62552 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62552/consoleFull)** for PR 14207 at commit [`a043ca2`](https://github.com/apache/spark/commit/a043ca28fc06082bc8b4104d9b38f2fbf1aa337a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71453136
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
    @@ -518,6 +510,19 @@ case class DescribeTableCommand(table: TableIdentifier, isExtended: Boolean, isF
         }
       }
     
    +  private def describeSchema(
    +      tableDesc: CatalogTable,
    +      buffer: ArrayBuffer[Row]): Unit = {
    +    if (DDLUtils.isDatasourceTable(tableDesc)) {
    +      DDLUtils.getSchemaFromTableProperties(tableDesc) match {
    --- End diff --
    
    Now `getSchemaFromTableProperties` should never return None?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62810 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62810/consoleFull)** for PR 14207 at commit [`264ad35`](https://github.com/apache/spark/commit/264ad35a1a749e14f8d8a33e4977cddda0916204).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62926 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62926/consoleFull)** for PR 14207 at commit [`b694d8b`](https://github.com/apache/spark/commit/b694d8bd3666e54e4d3ab972edcb04f2be64669b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71633897
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -252,6 +252,222 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
         }
       }
     
    +  private def createDataSourceTable(
    +      path: File,
    +      userSpecifiedSchema: Option[String],
    +      userSpecifiedPartitionCols: Option[String]): (StructType, Seq[String]) = {
    +    var tableSchema = StructType(Nil)
    +    var partCols = Seq.empty[String]
    +
    +    val tabName = "tab1"
    +    withTable(tabName) {
    +      val partitionClause =
    +        userSpecifiedPartitionCols.map(p => s"PARTITIONED BY ($p)").getOrElse("")
    +      val schemaClause = userSpecifiedSchema.map(s => s"($s)").getOrElse("")
    +      sql(
    +        s"""
    +           |CREATE TABLE $tabName $schemaClause
    +           |USING parquet
    +           |OPTIONS (
    +           |  path '$path'
    +           |)
    +           |$partitionClause
    +         """.stripMargin)
    +      val tableMetadata = spark.sessionState.catalog.getTableMetadata(TableIdentifier(tabName))
    +
    +      tableSchema = DDLUtils.getSchemaFromTableProperties(tableMetadata)
    +      partCols = DDLUtils.getPartitionColumnsFromTableProperties(tableMetadata)
    +    }
    +    (tableSchema, partCols)
    +  }
    +
    +  test("Create partitioned data source table without user specified schema") {
    +    import testImplicits._
    +    val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +
    +    // Case 1: with partitioning columns but no schema: Option("inexistentColumns")
    +    // Case 2: without schema and partitioning columns: None
    +    Seq(Option("inexistentColumns"), None).foreach { partitionCols =>
    +      withTempPath { pathToPartitionedTable =>
    +        df.write.format("parquet").partitionBy("num")
    +          .save(pathToPartitionedTable.getCanonicalPath)
    +        val (tableSchema, partCols) =
    +          createDataSourceTable(
    +            pathToPartitionedTable,
    +            userSpecifiedSchema = None,
    +            userSpecifiedPartitionCols = partitionCols)
    +        assert(tableSchema ==
    +          StructType(StructField("str", StringType, nullable = true) ::
    +            StructField("num", IntegerType, nullable = true) :: Nil))
    +        assert(partCols == Seq("num"))
    +      }
    +    }
    +  }
    +
    +  test("Create partitioned data source table with user specified schema") {
    +    import testImplicits._
    +    val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +
    +    // Case 1: with partitioning columns but no schema: Option("num")
    +    // Case 2: without schema and partitioning columns: None
    +    Seq(Option("num"), None).foreach { partitionCols =>
    +      withTempPath { pathToPartitionedTable =>
    +        df.write.format("parquet").partitionBy("num")
    +          .save(pathToPartitionedTable.getCanonicalPath)
    +        val (tableSchema, partCols) =
    +          createDataSourceTable(
    +            pathToPartitionedTable,
    +            userSpecifiedSchema = Option("num int, str string"),
    +            userSpecifiedPartitionCols = partitionCols)
    +        assert(tableSchema ==
    +          StructType(StructField("num", IntegerType, nullable = true) ::
    +            StructField("str", StringType, nullable = true) :: Nil))
    +        assert(partCols.mkString(", ") == partitionCols.getOrElse(""))
    +      }
    +    }
    +  }
    +
    +  test("Create non-partitioned data source table without user specified schema") {
    +    import testImplicits._
    +    val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +
    +    // Case 1: with partitioning columns but no schema: Option("inexistentColumns")
    +    // Case 2: without schema and partitioning columns: None
    +    Seq(Option("inexistentColumns"), None).foreach { partitionCols =>
    +      withTempPath { pathToNonPartitionedTable =>
    +        df.write.format("parquet").save(pathToNonPartitionedTable.getCanonicalPath)
    +        val (tableSchema, partCols) =
    +          createDataSourceTable(
    +            pathToNonPartitionedTable,
    +            userSpecifiedSchema = None,
    +            userSpecifiedPartitionCols = partitionCols)
    +        assert(tableSchema ==
    +          StructType(StructField("num", IntegerType, nullable = true) ::
    +            StructField("str", StringType, nullable = true) :: Nil))
    +        assert(partCols.isEmpty)
    +      }
    +    }
    +  }
    +
    +  test("Create non-partitioned data source table with user specified schema") {
    +    import testImplicits._
    +    val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("num", "str")
    +
    +    // Case 1: with partitioning columns but no schema: Option("inexistentColumns")
    +    // Case 2: without schema and partitioning columns: None
    +    Seq(Option("num"), None).foreach { partitionCols =>
    +      withTempPath { pathToNonPartitionedTable =>
    +        df.write.format("parquet").save(pathToNonPartitionedTable.getCanonicalPath)
    +        val (tableSchema, partCols) =
    +          createDataSourceTable(
    +            pathToNonPartitionedTable,
    +            userSpecifiedSchema = Option("num int, str string"),
    +            userSpecifiedPartitionCols = partitionCols)
    +        assert(tableSchema ==
    +          StructType(StructField("num", IntegerType, nullable = true) ::
    +            StructField("str", StringType, nullable = true) :: Nil))
    +        assert(partCols.mkString(", ") == partitionCols.getOrElse(""))
    +      }
    +    }
    +  }
    +
    +  test("Describe Table with Corrupted Schema") {
    +    import testImplicits._
    +
    +    val tabName = "tab1"
    +    withTempPath { dir =>
    +      val path = dir.getCanonicalPath
    +      val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString)).toDF("col1", "col2")
    +      df.write.format("json").save(path)
    +
    +      withTable(tabName) {
    +        sql(
    +          s"""
    +             |CREATE TABLE $tabName
    +             |USING json
    +             |OPTIONS (
    +             |  path '$path'
    +             |)
    +           """.stripMargin)
    +
    +        val catalog = spark.sessionState.catalog
    +        val table = catalog.getTableMetadata(TableIdentifier(tabName))
    +        val newProperties = table.properties.filterKeys(key =>
    +          key != CreateDataSourceTableUtils.DATASOURCE_SCHEMA_NUMPARTS)
    +        val newTable = table.copy(properties = newProperties)
    +        catalog.alterTable(newTable)
    +
    +        val e = intercept[AnalysisException] {
    +          sql(s"DESC $tabName")
    +        }.getMessage
    +        assert(e.contains(s"Could not read schema from the metastore because it is corrupted"))
    +      }
    +    }
    +  }
    +
    +  test("Refresh table after changing the data source table partitioning") {
    +    import testImplicits._
    +
    +    val tabName = "tab1"
    +    val catalog = spark.sessionState.catalog
    +    withTempPath { dir =>
    +      val path = dir.getCanonicalPath
    +      val df = sparkContext.parallelize(1 to 10).map(i => (i, i.toString, i, i))
    +        .toDF("col1", "col2", "col3", "col4")
    +      df.write.format("json").partitionBy("col1", "col3").save(path)
    +      val schema = StructType(
    +        StructField("col2", StringType, nullable = true) ::
    +        StructField("col4", LongType, nullable = true) ::
    +        StructField("col1", IntegerType, nullable = true) ::
    +        StructField("col3", IntegerType, nullable = true) :: Nil)
    +      val partitionCols = Seq("col1", "col3")
    +
    +      // Ensure the schema is split to multiple properties.
    +      withSQLConf(SQLConf.SCHEMA_STRING_LENGTH_THRESHOLD.key -> "1") {
    --- End diff --
    
    Previously, we used this to verify whether the refresh works well when the table schema is split to multiple properties. Now, since we do not need to refresh the schema, we can remove it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62581 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62581/consoleFull)** for PR 14207 at commit [`1ee1743`](https://github.com/apache/spark/commit/1ee1743906b41ffcc182cb8c74b4134bce8a3006).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71261112
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -270,6 +291,11 @@ case class CreateDataSourceTableAsSelectCommand(
       }
     }
     
    +case class SchemaType private(name: String)
    +object SchemaType {
    --- End diff --
    
    will we have more schema type? If not, I think a boolean flag `isSchemaInferred` should be good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r73940350
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -95,17 +95,39 @@ case class CreateDataSourceTableCommand(
           }
     
         // Create the relation to validate the arguments before writing the metadata to the metastore.
    -    DataSource(
    -      sparkSession = sparkSession,
    -      userSpecifiedSchema = userSpecifiedSchema,
    -      className = provider,
    -      bucketSpec = None,
    -      options = optionsWithPath).resolveRelation(checkPathExist = false)
    +    val dataSource: BaseRelation =
    +      DataSource(
    +        sparkSession = sparkSession,
    +        userSpecifiedSchema = userSpecifiedSchema,
    +        className = provider,
    +        bucketSpec = None,
    +        options = optionsWithPath).resolveRelation(checkPathExist = false)
    +
    +    val partitionColumns = if (userSpecifiedSchema.nonEmpty) {
    +      userSpecifiedPartitionColumns
    +    } else {
    +      val res = dataSource match {
    +        case r: HadoopFsRelation => r.partitionSchema.fieldNames
    +        case _ => Array.empty[String]
    +      }
    +      if (userSpecifiedPartitionColumns.length > 0) {
    --- End diff --
    
    Should we throw an exception for this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    **[Test build #62574 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62574/consoleFull)** for PR 14207 at commit [`e930819`](https://github.com/apache/spark/commit/e93081918b170d3fbd08d992ef251c83af9e433d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    @cloud-fan @rxin @yhuai The code is ready to review. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71383305
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
    @@ -223,6 +223,9 @@ abstract class Catalog {
        * If this table is cached as an InMemoryRelation, drop the original cached version and make the
        * new version cached lazily.
        *
    +   * If the table's schema is inferred at runtime, infer the schema again and update the schema
    --- End diff --
    
    Yes unfortunately I find out about this one too late. I will add it to the release notes for 2.0 that this change will come.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] Store the Inferred Schemas in...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71474538
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -316,27 +340,25 @@ object CreateDataSourceTableUtils extends Logging {
         tableProperties.put(DATASOURCE_PROVIDER, provider)
     
         // Saves optional user specified schema.  Serialized JSON schema string may be too long to be
    --- End diff --
    
    Yeah, will correct it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14207#discussion_r71266605
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala ---
    @@ -351,6 +353,44 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog {
       }
     
       /**
    +   * Refresh the inferred schema stored in the external catalog for data source tables.
    +   */
    +  private def refreshInferredSchema(tableIdent: TableIdentifier): Unit = {
    +    val table = sessionCatalog.getTableMetadataOption(tableIdent)
    +    table.foreach { tableDesc =>
    +      if (DDLUtils.isDatasourceTable(tableDesc) && DDLUtils.isSchemaInferred(tableDesc)) {
    +        val partitionColumns = DDLUtils.getPartitionColumnsFromTableProperties(tableDesc)
    +        val bucketSpec = DDLUtils.getBucketSpecFromTableProperties(tableDesc)
    +        val dataSource =
    +          DataSource(
    +            sparkSession,
    +            userSpecifiedSchema = None,
    +            partitionColumns = partitionColumns,
    +            bucketSpec = bucketSpec,
    +            className = tableDesc.properties(CreateDataSourceTableUtils.DATASOURCE_PROVIDER),
    +            options = tableDesc.storage.serdeProperties)
    +            .resolveRelation().asInstanceOf[HadoopFsRelation]
    +
    +        val schemaProperties = new mutable.HashMap[String, String]
    +        CreateDataSourceTableUtils.saveSchema(
    +          sparkSession, dataSource.schema, dataSource.partitionSchema.fieldNames, schemaProperties)
    +
    +        val tablePropertiesWithoutSchema = tableDesc.properties.filterKeys { k =>
    +          // Keep the properties that are not for schema or partition columns
    +          k != CreateDataSourceTableUtils.DATASOURCE_SCHEMA_NUMPARTS &&
    --- End diff --
    
    Will change it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14207: [SPARK-16552] [SQL] Store the Inferred Schemas into Exte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14207
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62628/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org