You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by vanzin <gi...@git.apache.org> on 2017/08/04 23:25:19 UTC

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/18849

    [SPARK-21617][SQL] Store correct table metadata when altering schema in Hive metastore.

    HiveExternalCatalog.alterTableSchema takes a shortcut by modifying the raw
    Hive table metadata instead of the full Spark view; that means it needs to
    be aware of whether the table is Hive-compatible or not.
    
    For compatible tables, the current "replace the schema" code is the correct
    path, except that an exception in that path should result in an error, and
    not in retrying in a different way.
    
    For non-compatible tables, Spark should just update the table properties,
    and leave the schema stored in the raw table untouched.
    
    Because Spark doesn't explicitly store metadata about whether a table is
    Hive-compatible or not, a new property was added just to make that explicit.
    The code tries to detect old DS tables that don't have the property and do
    the right thing.
    
    These changes also uncovered a problem with the way case-sensitive DS tables
    were being saved to the Hive metastore; the metastore is case-insensitive,
    and the code was treating these tables as Hive-compatible if the data source
    had a Hive counterpart (e.g. for parquet). In this scenario, the schema
    could be corrupted when being updated from Spark if conflicting columns existed
    ignoring case. The change fixes this by making case-sensitive DS-tables not
    Hive-compatible.
    
    Tested with existing and added unit tests (plus internal tests with a 2.1 metastore).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-21617

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18849.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18849
    
----
commit aae3abd673adc7ff939d842e49d566fa722403a3
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2017-08-02T21:47:34Z

    [SPARK-21617][SQL] Store correct metadata in Hive for altered DS table.
    
    This change fixes two issues:
    - when loading table metadata from Hive, restore the "provider" field of
      CatalogTable so DS tables can be identified.
    - when altering a DS table in the Hive metastore, make sure to not alter
      the table's schema, since the DS table's schema is stored as a table
      property in those cases.
    
    Also added a new unit test for this issue which fails without this change.

commit 2350b105a599dde849e44bde50aa6d13812e4f83
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2017-08-04T22:49:31Z

    Fix 2.1 DDL suite to not use SparkSession.

commit 7ccf4743024a8a447a4b05369f6ebf237cf88c4f
Author: Marcelo Vanzin <va...@cloudera.com>
Date:   2017-08-04T22:57:44Z

    Proper fix.
    
    HiveExternalCatalog.alterTableSchema takes a shortcut by modifying the raw
    Hive table metadata instead of the full Spark view; that means it needs to
    be aware of whether the table is Hive-compatible or not.
    
    For compatible tables, the current "replace the schema" code is the correct
    path, except that an exception in that path should result in an error, and
    not in retrying in a different way.
    
    For non-compatible tables, Spark should just update the table properties,
    and leave the schema stored in the raw table untouched.
    
    Because Spark doesn't explicitly store metadata about whether a table is
    Hive-compatible or not, a new property was added just to make that explicit.
    The code tries to detect old DS tables that don't have the property and do
    the right thing.
    
    These changes also uncovered a problem with the way case-sensitive DS tables
    were being saved to the Hive metastore; the metastore is case-insensitive,
    and the code was treating these tables as Hive-compatible if the data source
    had a Hive counterpart (e.g. for parquet). In this scenario, the schema
    could be corrupted when being updated from Spark if conflicting columns existed
    ignoring case. The change fixes this by making case-sensitive DS-tables not
    Hive-compatible.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132012867
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -611,21 +640,39 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       override def alterTableSchema(db: String, table: String, schema: StructType): Unit = withClient {
         requireTableExists(db, table)
         val rawTable = getRawTable(db, table)
    -    val withNewSchema = rawTable.copy(schema = schema)
    -    verifyColumnNames(withNewSchema)
         // Add table metadata such as table schema, partition columns, etc. to table properties.
    -    val updatedTable = withNewSchema.copy(
    -      properties = withNewSchema.properties ++ tableMetaToTableProps(withNewSchema))
    -    try {
    -      client.alterTable(updatedTable)
    -    } catch {
    -      case NonFatal(e) =>
    -        val warningMessage =
    -          s"Could not alter schema of table  ${rawTable.identifier.quotedString} in a Hive " +
    -            "compatible way. Updating Hive metastore in Spark SQL specific format."
    -        logWarning(warningMessage, e)
    -        client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema))
    +    val updatedProperties = rawTable.properties ++ tableMetaToTableProps(rawTable, schema)
    +
    +    // Detect whether this is a Hive-compatible table.
    +    val provider = rawTable.properties.get(DATASOURCE_PROVIDER)
    +    val isHiveCompatible = if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    --- End diff --
    
    Could you add a test case for the compatibility checking? I am just afraid it might not work as expected


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132012423
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala ---
    @@ -1998,3 +2001,61 @@ class HiveDDLSuite
         }
       }
     }
    +
    +/**
    + * A separate set of DDL tests that uses Hive 2.1 libraries, which behave a little differently
    + * from the built-in ones.
    + */
    +@ExtendedHiveTest
    +class Hive_2_1_DDLSuite extends SparkFunSuite with TestHiveSingleton with BeforeAndAfterEach
    --- End diff --
    
    Could we create a separate suite for this? `HiveDDLSuite.scala` is too big now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r134000329
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    --- End diff --
    
    And, also, if you look at `createDataSourceTable`, it does more checks that your example to decide whether a table is Hive-compatible or not. And the ultimate check is whether the Hive metastore accepts the table (exception means Spark tries to create the table again, as non-compatible).
    
    So, again, you cannot detect compatibility by just checking the schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r131504834
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala ---
    @@ -908,7 +909,13 @@ private[hive] object HiveClientImpl {
         }
         // after SPARK-19279, it is not allowed to create a hive table with an empty schema,
         // so here we should not add a default col schema
    -    if (schema.isEmpty && DDLUtils.isDatasourceTable(table)) {
    +    //
    +    // Because HiveExternalCatalog sometimes writes back "raw" tables that have not been
    +    // completely translated to Spark's view, the provider information needs to be looked
    +    // up in two places.
    +    val provider = table.provider.orElse(
    --- End diff --
    
    This change would have fixed the second exception in the bug (about storing an empty schema); but the code was just ending up in that situation because of the other problems this PR is fixing. This change shouldn't be needed for the fix, but I included it for further correctness.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r131727081
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -611,21 +640,39 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       override def alterTableSchema(db: String, table: String, schema: StructType): Unit = withClient {
         requireTableExists(db, table)
         val rawTable = getRawTable(db, table)
    -    val withNewSchema = rawTable.copy(schema = schema)
    -    verifyColumnNames(withNewSchema)
         // Add table metadata such as table schema, partition columns, etc. to table properties.
    -    val updatedTable = withNewSchema.copy(
    -      properties = withNewSchema.properties ++ tableMetaToTableProps(withNewSchema))
    -    try {
    -      client.alterTable(updatedTable)
    -    } catch {
    -      case NonFatal(e) =>
    -        val warningMessage =
    -          s"Could not alter schema of table  ${rawTable.identifier.quotedString} in a Hive " +
    -            "compatible way. Updating Hive metastore in Spark SQL specific format."
    -        logWarning(warningMessage, e)
    -        client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema))
    +    val updatedProperties = rawTable.properties ++ tableMetaToTableProps(rawTable, schema)
    +
    +    // Detect whether this is a Hive-compatible table.
    +    val provider = rawTable.properties.get(DATASOURCE_PROVIDER)
    +    val isHiveCompatible = if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    --- End diff --
    
    I think you mean the check below in the `case _ =>` case, right?
    
    I see that both compatible and non-compatible tables set that property, at least in 2.1, so let me see if there's an easy way to differentiate that without having to replicate all the original checks (which may be hard to do at this point).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132061177
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/Hive_2_1_DDLSuite.scala ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.execution
    +
    +import scala.language.existentials
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach}
    +
    +import org.apache.spark.{SparkConf, SparkFunSuite}
    +import org.apache.spark.launcher.SparkLauncher
    +import org.apache.spark.sql.AnalysisException
    +import org.apache.spark.sql.catalyst.catalog._
    +import org.apache.spark.sql.hive.{HiveExternalCatalog, HiveUtils}
    +import org.apache.spark.sql.hive.test.TestHiveSingleton
    +import org.apache.spark.sql.internal.StaticSQLConf._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.tags.ExtendedHiveTest
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A separate set of DDL tests that uses Hive 2.1 libraries, which behave a little differently
    + * from the built-in ones.
    + */
    +@ExtendedHiveTest
    +class Hive_2_1_DDLSuite extends SparkFunSuite with TestHiveSingleton with BeforeAndAfterEach
    +  with BeforeAndAfterAll {
    +
    +  // Create a custom HiveExternalCatalog instance with the desired configuration. We cannot
    +  // use SparkSession here since there's already an active on managed by the TestHive object.
    +  private var catalog = {
    +    val warehouse = Utils.createTempDir()
    +    val metastore = Utils.createTempDir()
    +    metastore.delete()
    +    val sparkConf = new SparkConf()
    +      .set(SparkLauncher.SPARK_MASTER, "local")
    +      .set(WAREHOUSE_PATH.key, warehouse.toURI().toString())
    +      .set(CATALOG_IMPLEMENTATION.key, "hive")
    +      .set(HiveUtils.HIVE_METASTORE_VERSION.key, "2.1")
    +      .set(HiveUtils.HIVE_METASTORE_JARS.key, "maven")
    +
    +    val hadoopConf = new Configuration()
    +    hadoopConf.set("hive.metastore.warehouse.dir", warehouse.toURI().toString())
    +    hadoopConf.set("javax.jdo.option.ConnectionURL",
    +      s"jdbc:derby:;databaseName=${metastore.getAbsolutePath()};create=true")
    +    // These options are needed since the defaults in Hive 2.1 cause exceptions with an
    +    // empty metastore db.
    +    hadoopConf.set("datanucleus.schema.autoCreateAll", "true")
    +    hadoopConf.set("hive.metastore.schema.verification", "false")
    +
    +    new HiveExternalCatalog(sparkConf, hadoopConf)
    +  }
    +
    +  override def afterEach: Unit = {
    +    catalog.listTables("default").foreach { t =>
    +      catalog.dropTable("default", t, true, false)
    +    }
    +    spark.sessionState.catalog.reset()
    +  }
    +
    +  override def afterAll(): Unit = {
    +    catalog = null
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for non-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING json",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))),
    +      hiveCompatible = false)
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) STORED AS parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE with incompatible schema on Hive-compatible table") {
    +    val exception = intercept[AnalysisException] {
    +      testAlterTable(
    +        "t1",
    +        "CREATE TABLE t1 (c1 string) USING parquet",
    +        StructType(Array(StructField("c2", IntegerType))))
    +    }
    +    assert(exception.getMessage().contains("types incompatible with the existing columns"))
    +  }
    +
    +  private def testAlterTable(
    +      tableName: String,
    +      createTableStmt: String,
    +      updatedSchema: StructType,
    +      hiveCompatible: Boolean = true): Unit = {
    +    spark.sql(createTableStmt)
    +    val oldTable = spark.sessionState.catalog.externalCatalog.getTable("default", tableName)
    +    catalog.createTable(oldTable, true)
    +    catalog.alterTableSchema("default", tableName, updatedSchema)
    +
    +    val updatedTable = catalog.getTable("default", tableName)
    +    assert(updatedTable.schema.fieldNames === updatedSchema.fieldNames)
    +
    +    val rawTable = catalog.getRawTable("default", tableName)
    +    val compatibility = rawTable.properties.get(HiveExternalCatalog.DATASOURCE_HIVE_COMPATIBLE)
    +      .map(_.toBoolean).getOrElse(true)
    +    assert(hiveCompatible === compatibility)
    --- End diff --
    
    We also need to test whether Hive can still read the altered table schema by using 
    ```
    spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client.runSqlHive
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/18849

I think there is still a lot of confusion around here about what this is fixing. I see a bunch of comments related to testing the schema for compatibility.

That does not work. Schema compatibility is not the issue here; the issue is whether the table was *initially* created as Hive-compatible or not. This is the Hive metastore, not Spark, complaining, so the Spark-side schema for non-compatible tables is pretty irrelevant.

The schema by itself does not provide enough information to detect whether a table is compatible or not. Even if the schema is Hive compatible, the data source may not have a Hive counterpart, or the table might have been initially created in a case sensitive session and have conflicting column names when case is ignore, or a few other things, all of which are checked at table creation time.

The same checks *cannot* be done later, and should not be done. If the table was non-compatible it should remain non-compatible, and vice-versa. The only thing that is needed is a way to detect that single property of the table. You cannot do that just from the schema as has been proposed a few times here.

There are two options:
- use an explicit option, which is the approach I took
- use some combination of metadata written by old Spark versions that tells you whether the table is compatible or not.

The only thing that exists for the second one is the serde field in the storage descriptor. Spark sets it to either `None` or some placeholder that does not match the datasource serde. I use that fact as a fallback for when the property does not exist, but I think it's safer to have an explicit property for that instead of relying on these artifacts.

Hope that clarifies things.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    I can see the value of maintaining hive compatibility for users who use Hive/Spark SQL together. We can do it in a separate PR. We also need to change `CREATE TABLE` for such usage scenarios. WDYT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80414/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r133236830
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    --- End diff --
    
    There's already code to check compatibility in `createDataSourceTable`. The problem is that the logic there can only really be applied in the context of that method. Later on you cannot detect compatibility the same way, and just checking the schema is *not* enough.
    
    The previous code was prone to leave corrupted tables (in old versions of Hive) or just plain not work (in newer versions of Hive). Having a consistent way of checking for Hive compatibility is key to that fix, because `alterTableSchema` has to behave differently for each case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132087575
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala ---
    @@ -908,7 +909,13 @@ private[hive] object HiveClientImpl {
         }
         // after SPARK-19279, it is not allowed to create a hive table with an empty schema,
         // so here we should not add a default col schema
    --- End diff --
    
    This comment looks like to move accordingly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80432/testReport)** for PR 18849 at commit [`7b777ed`](https://github.com/apache/spark/commit/7b777edccddf91f2dfb99b30855265284188e00b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80270 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80270/testReport)** for PR 18849 at commit [`7ccf474`](https://github.com/apache/spark/commit/7ccf4743024a8a447a4b05369f6ebf237cf88c4f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r133236101
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -342,6 +359,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
                 "Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. "
             (None, message)
     
    +      case _ if currentSessionConf(SQLConf.CASE_SENSITIVE) =>
    --- End diff --
    
    Ok, I'll remove this change. The write-path change you propose isn't necessary because if you have an "invalid" schema (same column name with different case), the Hive metastore will complain and the table will be stored as non-Hive-compatible.
    
    The problem this was trying to avoid is related to the changes in `alterTableSchema`; if you create a Hive-compatible table here, then later tried to update it with an invalid schema, you'd have a frankentable because the code in `alterTableSchema` was wrong.
    
    But since this change is mainly about fixing `alterTableSchema`, you'll now get a proper error in that case instead of ending up with a potentially corrupted table.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132525593
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/Hive_2_1_DDLSuite.scala ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.execution
    +
    +import scala.language.existentials
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach}
    +
    +import org.apache.spark.{SparkConf, SparkFunSuite}
    +import org.apache.spark.launcher.SparkLauncher
    +import org.apache.spark.sql.AnalysisException
    +import org.apache.spark.sql.catalyst.catalog._
    +import org.apache.spark.sql.hive.{HiveExternalCatalog, HiveUtils}
    +import org.apache.spark.sql.hive.test.TestHiveSingleton
    +import org.apache.spark.sql.internal.StaticSQLConf._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.tags.ExtendedHiveTest
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A separate set of DDL tests that uses Hive 2.1 libraries, which behave a little differently
    + * from the built-in ones.
    + */
    +@ExtendedHiveTest
    +class Hive_2_1_DDLSuite extends SparkFunSuite with TestHiveSingleton with BeforeAndAfterEach
    +  with BeforeAndAfterAll {
    +
    +  // Create a custom HiveExternalCatalog instance with the desired configuration. We cannot
    +  // use SparkSession here since there's already an active on managed by the TestHive object.
    +  private var catalog = {
    +    val warehouse = Utils.createTempDir()
    +    val metastore = Utils.createTempDir()
    +    metastore.delete()
    +    val sparkConf = new SparkConf()
    +      .set(SparkLauncher.SPARK_MASTER, "local")
    +      .set(WAREHOUSE_PATH.key, warehouse.toURI().toString())
    +      .set(CATALOG_IMPLEMENTATION.key, "hive")
    +      .set(HiveUtils.HIVE_METASTORE_VERSION.key, "2.1")
    +      .set(HiveUtils.HIVE_METASTORE_JARS.key, "maven")
    +
    +    val hadoopConf = new Configuration()
    +    hadoopConf.set("hive.metastore.warehouse.dir", warehouse.toURI().toString())
    +    hadoopConf.set("javax.jdo.option.ConnectionURL",
    +      s"jdbc:derby:;databaseName=${metastore.getAbsolutePath()};create=true")
    +    // These options are needed since the defaults in Hive 2.1 cause exceptions with an
    +    // empty metastore db.
    +    hadoopConf.set("datanucleus.schema.autoCreateAll", "true")
    +    hadoopConf.set("hive.metastore.schema.verification", "false")
    +
    +    new HiveExternalCatalog(sparkConf, hadoopConf)
    +  }
    +
    +  override def afterEach: Unit = {
    +    catalog.listTables("default").foreach { t =>
    +      catalog.dropTable("default", t, true, false)
    +    }
    +    spark.sessionState.catalog.reset()
    +  }
    +
    +  override def afterAll(): Unit = {
    +    catalog = null
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for non-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING json",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))),
    +      hiveCompatible = false)
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) STORED AS parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE with incompatible schema on Hive-compatible table") {
    +    val exception = intercept[AnalysisException] {
    +      testAlterTable(
    +        "t1",
    +        "CREATE TABLE t1 (c1 string) USING parquet",
    +        StructType(Array(StructField("c2", IntegerType))))
    +    }
    +    assert(exception.getMessage().contains("types incompatible with the existing columns"))
    +  }
    +
    +  private def testAlterTable(
    +      tableName: String,
    +      createTableStmt: String,
    +      updatedSchema: StructType,
    +      hiveCompatible: Boolean = true): Unit = {
    +    spark.sql(createTableStmt)
    +    val oldTable = spark.sessionState.catalog.externalCatalog.getTable("default", tableName)
    +    catalog.createTable(oldTable, true)
    +    catalog.alterTableSchema("default", tableName, updatedSchema)
    +
    +    val updatedTable = catalog.getTable("default", tableName)
    +    assert(updatedTable.schema.fieldNames === updatedSchema.fieldNames)
    +
    +    val rawTable = catalog.getRawTable("default", tableName)
    +    val compatibility = rawTable.properties.get(HiveExternalCatalog.DATASOURCE_HIVE_COMPATIBLE)
    +      .map(_.toBoolean).getOrElse(true)
    +    assert(hiveCompatible === compatibility)
    --- End diff --
    
    I checked the test case coverage. We do not have such a test case. Could you add them in the following test cases? 
    https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala#L1865-L1923
    
    I think this PR is also trying to make Hive readable after Spark adds columns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132250345
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    +    val provider = table.provider.orElse(table.properties.get(DATASOURCE_PROVIDER))
    +    if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    +      table.properties.get(DATASOURCE_HIVE_COMPATIBLE) match {
    +        case Some(value) =>
    +          value.toBoolean
    +        case _ =>
    +          // If the property is not set, the table may have been created by an old version
    +          // of Spark. Detect Hive compatibility by comparing the table's serde with the
    +          // serde for the table's data source. If they match, the table is Hive-compatible.
    +          // If they don't, they're not, because of some other table property that made it
    +          // not initially Hive-compatible.
    +          HiveSerDe.sourceToSerDe(provider.get) == table.storage.serde
    --- End diff --
    
    Case-sensitive tables are weird. They're a session configuration, but IMO that config should affect compatibility, because even if you create a table that is Hive compatible initially, you could modify it later so that it's not Hive compatible anymore. Seems like the 1.2 Hive libraries would allow the broken metadata, while the 2.1 libraries complain about it.
    
    So yes, currently when case-sensitivity is enabled you still create tables that may be Hive-compatible, and this change forces those tables to not be Hive-compatible.
    
    As for existing tables, there's no way to know, because that data is not present anywhere in the table's metadata. (It's not after my change either, so basically you can read that table with a case-insensitive session and who knows what might happen.)
    
    I'm ok with reverting this part since it's all a little hazy, but just wanted to point out that it's a kinda weird part of the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    If `ALTER TABLE` makes the hive compatibility broken, the value of this flag becomes misleading. Currently, the naming of this flag is pretty general. I expect this flag could be used for the other places in the future (besides `ALTER TABLE ADD COLUMN`). Introducing a flag is simple but maintaining the flag needs more works. That is why we do not want to introduce the extra new flags if they are not required.
    
    If we want to introduce such a flag, we also need to ensure the value is always true. That means, we need to follow [what we are doing in the CREATE TABLE code path](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L364-L374). When Hive metastore complained about it, we should also set it to `false`. 
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80694/testReport)** for PR 18849 at commit [`4a05b55`](https://github.com/apache/spark/commit/4a05b55b5c23755cba384cf85a0af2b802b8a9bd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18849


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80432/testReport)** for PR 18849 at commit [`7b777ed`](https://github.com/apache/spark/commit/7b777edccddf91f2dfb99b30855265284188e00b).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80694/testReport)** for PR 18849 at commit [`4a05b55`](https://github.com/apache/spark/commit/4a05b55b5c23755cba384cf85a0af2b802b8a9bd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80463 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80463/testReport)** for PR 18849 at commit [`abd6cf1`](https://github.com/apache/spark/commit/abd6cf119c0b400fda9c96fcb6432b090c2505bf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132107646
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/Hive_2_1_DDLSuite.scala ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.execution
    +
    +import scala.language.existentials
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach}
    +
    +import org.apache.spark.{SparkConf, SparkFunSuite}
    +import org.apache.spark.launcher.SparkLauncher
    +import org.apache.spark.sql.AnalysisException
    +import org.apache.spark.sql.catalyst.catalog._
    +import org.apache.spark.sql.hive.{HiveExternalCatalog, HiveUtils}
    +import org.apache.spark.sql.hive.test.TestHiveSingleton
    +import org.apache.spark.sql.internal.StaticSQLConf._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.tags.ExtendedHiveTest
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A separate set of DDL tests that uses Hive 2.1 libraries, which behave a little differently
    + * from the built-in ones.
    + */
    +@ExtendedHiveTest
    +class Hive_2_1_DDLSuite extends SparkFunSuite with TestHiveSingleton with BeforeAndAfterEach
    +  with BeforeAndAfterAll {
    +
    +  // Create a custom HiveExternalCatalog instance with the desired configuration. We cannot
    +  // use SparkSession here since there's already an active on managed by the TestHive object.
    +  private var catalog = {
    +    val warehouse = Utils.createTempDir()
    +    val metastore = Utils.createTempDir()
    +    metastore.delete()
    +    val sparkConf = new SparkConf()
    +      .set(SparkLauncher.SPARK_MASTER, "local")
    +      .set(WAREHOUSE_PATH.key, warehouse.toURI().toString())
    +      .set(CATALOG_IMPLEMENTATION.key, "hive")
    +      .set(HiveUtils.HIVE_METASTORE_VERSION.key, "2.1")
    +      .set(HiveUtils.HIVE_METASTORE_JARS.key, "maven")
    +
    +    val hadoopConf = new Configuration()
    +    hadoopConf.set("hive.metastore.warehouse.dir", warehouse.toURI().toString())
    +    hadoopConf.set("javax.jdo.option.ConnectionURL",
    +      s"jdbc:derby:;databaseName=${metastore.getAbsolutePath()};create=true")
    +    // These options are needed since the defaults in Hive 2.1 cause exceptions with an
    +    // empty metastore db.
    +    hadoopConf.set("datanucleus.schema.autoCreateAll", "true")
    +    hadoopConf.set("hive.metastore.schema.verification", "false")
    +
    +    new HiveExternalCatalog(sparkConf, hadoopConf)
    +  }
    +
    +  override def afterEach: Unit = {
    +    catalog.listTables("default").foreach { t =>
    +      catalog.dropTable("default", t, true, false)
    +    }
    +    spark.sessionState.catalog.reset()
    +  }
    +
    +  override def afterAll(): Unit = {
    +    catalog = null
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for non-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING json",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))),
    +      hiveCompatible = false)
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) STORED AS parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE with incompatible schema on Hive-compatible table") {
    +    val exception = intercept[AnalysisException] {
    +      testAlterTable(
    +        "t1",
    +        "CREATE TABLE t1 (c1 string) USING parquet",
    +        StructType(Array(StructField("c2", IntegerType))))
    +    }
    +    assert(exception.getMessage().contains("types incompatible with the existing columns"))
    +  }
    +
    +  private def testAlterTable(
    +      tableName: String,
    +      createTableStmt: String,
    +      updatedSchema: StructType,
    +      hiveCompatible: Boolean = true): Unit = {
    +    spark.sql(createTableStmt)
    +    val oldTable = spark.sessionState.catalog.externalCatalog.getTable("default", tableName)
    +    catalog.createTable(oldTable, true)
    +    catalog.alterTableSchema("default", tableName, updatedSchema)
    +
    +    val updatedTable = catalog.getTable("default", tableName)
    +    assert(updatedTable.schema.fieldNames === updatedSchema.fieldNames)
    +
    +    val rawTable = catalog.getRawTable("default", tableName)
    +    val compatibility = rawTable.properties.get(HiveExternalCatalog.DATASOURCE_HIVE_COMPATIBLE)
    +      .map(_.toBoolean).getOrElse(true)
    +    assert(hiveCompatible === compatibility)
    --- End diff --
    
    My only comment here is to ensure the altered table is still readable by Hive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132058586
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1193,6 +1242,7 @@ object HiveExternalCatalog {
       val DATASOURCE_SCHEMA_PARTCOL_PREFIX = DATASOURCE_SCHEMA_PREFIX + "partCol."
       val DATASOURCE_SCHEMA_BUCKETCOL_PREFIX = DATASOURCE_SCHEMA_PREFIX + "bucketCol."
       val DATASOURCE_SCHEMA_SORTCOL_PREFIX = DATASOURCE_SCHEMA_PREFIX + "sortCol."
    +  val DATASOURCE_HIVE_COMPATIBLE = SPARK_SQL_PREFIX + "hive.compatibility"
    --- End diff --
    
    Use `DATASOURCE_PREFIX `?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80935/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r134038745
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    --- End diff --
    
    > The value of this new compatibility conf/flag becomes invalid.
    
    Also, that's not true. While it's true that old Spark versions can still corrupt these tables, this property is supposed to be a reliable way to detect compatibility going forward, so that if there are more cases similar to this, they can be handled without having to make guesses about whether the table is compatible or not.
    
    So, in my view, as long as old Spark versions don't get rid of the property when altering tables, and it seems they don't, it's beneficial to have this explicit compatibility flag.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Will review it today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    If the flag is set to true, then whenever an "alter table" command is executed, it will follow the "Hive compatible" path, which lets the Hive metastore decide whether the change is valid or not. So, to the best of Spark's knowledge, compatibility is maintained because Hive did not complain about it. No other table metadata (e.g. storage info) is changed by that command.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Sure, I will put it in my to-do list. Thank you very much! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80423/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132751607
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    +    val provider = table.provider.orElse(table.properties.get(DATASOURCE_PROVIDER))
    +    if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    +      table.properties.get(DATASOURCE_HIVE_COMPATIBLE) match {
    +        case Some(value) =>
    +          value.toBoolean
    +        case _ =>
    +          // If the property is not set, the table may have been created by an old version
    +          // of Spark. Detect Hive compatibility by comparing the table's serde with the
    +          // serde for the table's data source. If they match, the table is Hive-compatible.
    +          // If they don't, they're not, because of some other table property that made it
    +          // not initially Hive-compatible.
    +          HiveSerDe.sourceToSerDe(provider.get) == table.storage.serde
    --- End diff --
    
    Hey all, could I get a thumbs up / down on the case-sensitiveness-handling part of this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80358/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80270 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80270/testReport)** for PR 18849 at commit [`7ccf474`](https://github.com/apache/spark/commit/7ccf4743024a8a447a4b05369f6ebf237cf88c4f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80405 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80405/testReport)** for PR 18849 at commit [`2f57a3c`](https://github.com/apache/spark/commit/2f57a3c1db2b4f8e58456b48bbc62ef01fa14633).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80936 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80936/testReport)** for PR 18849 at commit [`c41683a`](https://github.com/apache/spark/commit/c41683a907cfd9dad9b16d462e7d8f6b5f78c200).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    That's fine if you want to do it, I'm just not signing up for actually doing it. I'm more worried about Spark actually working with a 2.1 metastore, which it currently doesn't in a few scenarios.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    This is a corrected version of #18824 after I tracked the actual failure and looked at the suggested code paths in the original review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80358 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80358/testReport)** for PR 18849 at commit [`40ebc96`](https://github.com/apache/spark/commit/40ebc966ff8e9ad742f2cdbf631a289b8388560a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80935 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80935/testReport)** for PR 18849 at commit [`cef66ac`](https://github.com/apache/spark/commit/cef66ac783a096cc5c623fa52ac9276321e554f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r131750071
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -611,21 +640,39 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       override def alterTableSchema(db: String, table: String, schema: StructType): Unit = withClient {
         requireTableExists(db, table)
         val rawTable = getRawTable(db, table)
    -    val withNewSchema = rawTable.copy(schema = schema)
    -    verifyColumnNames(withNewSchema)
         // Add table metadata such as table schema, partition columns, etc. to table properties.
    -    val updatedTable = withNewSchema.copy(
    -      properties = withNewSchema.properties ++ tableMetaToTableProps(withNewSchema))
    -    try {
    -      client.alterTable(updatedTable)
    -    } catch {
    -      case NonFatal(e) =>
    -        val warningMessage =
    -          s"Could not alter schema of table  ${rawTable.identifier.quotedString} in a Hive " +
    -            "compatible way. Updating Hive metastore in Spark SQL specific format."
    -        logWarning(warningMessage, e)
    -        client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema))
    +    val updatedProperties = rawTable.properties ++ tableMetaToTableProps(rawTable, schema)
    +
    +    // Detect whether this is a Hive-compatible table.
    +    val provider = rawTable.properties.get(DATASOURCE_PROVIDER)
    +    val isHiveCompatible = if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    --- End diff --
    
    I also checked 2.0 and 1.6 and both seem to do the same thing (both set the provider, and both use a different serde for non-compatible tables), so the check should work for those versions too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132088495
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    +    val provider = table.provider.orElse(table.properties.get(DATASOURCE_PROVIDER))
    +    if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    +      table.properties.get(DATASOURCE_HIVE_COMPATIBLE) match {
    +        case Some(value) =>
    +          value.toBoolean
    +        case _ =>
    +          // If the property is not set, the table may have been created by an old version
    +          // of Spark. Detect Hive compatibility by comparing the table's serde with the
    +          // serde for the table's data source. If they match, the table is Hive-compatible.
    +          // If they don't, they're not, because of some other table property that made it
    +          // not initially Hive-compatible.
    +          HiveSerDe.sourceToSerDe(provider.get) == table.storage.serde
    --- End diff --
    
    There is a change regarding treating case-sensitive DS tables as Hive-incompatible. Once the given table is this kind of table without the new `DATASOURCE_HIVE_COMPATIBLE` property, we should treat it as Hive compatible or incompatible? Looks like for now we treat it as compatible?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80463/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80432/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/18849

> If ALTER TABLE makes the hive compatibility broken, the value of this flag becomes misleading.

That's the whole point of the flag and what the current changes do! It takes different paths when handling alter table depending on whether the table is compatible. So if the table was compatible, it will remain compatible (or otherwise Hive should complain about the updated table, as it does in certain cases).

So I really do not understand what is it you're not understanding about the patch.

> When Hive metastore complained about it, we should also set it to false.

Absolutely not. If you have a Hive compatible table and you try to update its schema with something that Hive complains about, YOU SHOULD GET AN ERROR. And that's what the current patch does. You should not try to mess up the table even further. The old code was just plain broken in this regard.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r133999734
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    --- End diff --
    
    Let me revert the question for you: why is it bad to have the compatibility bit be a table property?
    
    You cannot make a non-compatible table compatible after it's been created. At the very least, there is zero code that would achieve that today. And the key to this fix is that the `alterTableSchema` method needs to know whether the table is compatible or not in a reliable way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132249235
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/Hive_2_1_DDLSuite.scala ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.execution
    +
    +import scala.language.existentials
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach}
    +
    +import org.apache.spark.{SparkConf, SparkFunSuite}
    +import org.apache.spark.launcher.SparkLauncher
    +import org.apache.spark.sql.AnalysisException
    +import org.apache.spark.sql.catalyst.catalog._
    +import org.apache.spark.sql.hive.{HiveExternalCatalog, HiveUtils}
    +import org.apache.spark.sql.hive.test.TestHiveSingleton
    +import org.apache.spark.sql.internal.StaticSQLConf._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.tags.ExtendedHiveTest
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A separate set of DDL tests that uses Hive 2.1 libraries, which behave a little differently
    + * from the built-in ones.
    + */
    +@ExtendedHiveTest
    +class Hive_2_1_DDLSuite extends SparkFunSuite with TestHiveSingleton with BeforeAndAfterEach
    +  with BeforeAndAfterAll {
    +
    +  // Create a custom HiveExternalCatalog instance with the desired configuration. We cannot
    +  // use SparkSession here since there's already an active on managed by the TestHive object.
    +  private var catalog = {
    +    val warehouse = Utils.createTempDir()
    +    val metastore = Utils.createTempDir()
    +    metastore.delete()
    +    val sparkConf = new SparkConf()
    +      .set(SparkLauncher.SPARK_MASTER, "local")
    +      .set(WAREHOUSE_PATH.key, warehouse.toURI().toString())
    +      .set(CATALOG_IMPLEMENTATION.key, "hive")
    +      .set(HiveUtils.HIVE_METASTORE_VERSION.key, "2.1")
    +      .set(HiveUtils.HIVE_METASTORE_JARS.key, "maven")
    +
    +    val hadoopConf = new Configuration()
    +    hadoopConf.set("hive.metastore.warehouse.dir", warehouse.toURI().toString())
    +    hadoopConf.set("javax.jdo.option.ConnectionURL",
    +      s"jdbc:derby:;databaseName=${metastore.getAbsolutePath()};create=true")
    +    // These options are needed since the defaults in Hive 2.1 cause exceptions with an
    +    // empty metastore db.
    +    hadoopConf.set("datanucleus.schema.autoCreateAll", "true")
    +    hadoopConf.set("hive.metastore.schema.verification", "false")
    +
    +    new HiveExternalCatalog(sparkConf, hadoopConf)
    +  }
    +
    +  override def afterEach: Unit = {
    +    catalog.listTables("default").foreach { t =>
    +      catalog.dropTable("default", t, true, false)
    +    }
    +    spark.sessionState.catalog.reset()
    +  }
    +
    +  override def afterAll(): Unit = {
    +    catalog = null
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for non-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING json",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))),
    +      hiveCompatible = false)
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) STORED AS parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE with incompatible schema on Hive-compatible table") {
    +    val exception = intercept[AnalysisException] {
    +      testAlterTable(
    +        "t1",
    +        "CREATE TABLE t1 (c1 string) USING parquet",
    +        StructType(Array(StructField("c2", IntegerType))))
    +    }
    +    assert(exception.getMessage().contains("types incompatible with the existing columns"))
    +  }
    +
    +  private def testAlterTable(
    +      tableName: String,
    +      createTableStmt: String,
    +      updatedSchema: StructType,
    +      hiveCompatible: Boolean = true): Unit = {
    +    spark.sql(createTableStmt)
    +    val oldTable = spark.sessionState.catalog.externalCatalog.getTable("default", tableName)
    +    catalog.createTable(oldTable, true)
    +    catalog.alterTableSchema("default", tableName, updatedSchema)
    +
    +    val updatedTable = catalog.getTable("default", tableName)
    +    assert(updatedTable.schema.fieldNames === updatedSchema.fieldNames)
    +
    +    val rawTable = catalog.getRawTable("default", tableName)
    +    val compatibility = rawTable.properties.get(HiveExternalCatalog.DATASOURCE_HIVE_COMPATIBLE)
    +      .map(_.toBoolean).getOrElse(true)
    +    assert(hiveCompatible === compatibility)
    --- End diff --
    
    I understand, but it's really hard to write that kind of test without a serious rewrite of the tests in the hive module, so that you can have multiple `SparkSession` instances.
    
    Right now, I think the best we can achieve is "the metastore has accepted the table so the metadata looks ok", and assume that the tests performed elsewhere (e.g. HiveDDLSuite), where a proper `SparkSession` exists, are enough to make sure Hive can read the data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80423 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80423/testReport)** for PR 18849 at commit [`6824e35`](https://github.com/apache/spark/commit/6824e358d231b7c56031c946d983cde0b89fd574).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80936 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80936/testReport)** for PR 18849 at commit [`c41683a`](https://github.com/apache/spark/commit/c41683a907cfd9dad9b16d462e7d8f6b5f78c200).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Ping?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r134029768
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    --- End diff --
    
    A quick look shows that all places that do a metastore `alterTable` use `getRawTable` to get the metadata, meaning they would preserve the new property even in old Spark versions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132071898
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/Hive_2_1_DDLSuite.scala ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.execution
    +
    +import scala.language.existentials
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach}
    +
    +import org.apache.spark.{SparkConf, SparkFunSuite}
    +import org.apache.spark.launcher.SparkLauncher
    +import org.apache.spark.sql.AnalysisException
    +import org.apache.spark.sql.catalyst.catalog._
    +import org.apache.spark.sql.hive.{HiveExternalCatalog, HiveUtils}
    +import org.apache.spark.sql.hive.test.TestHiveSingleton
    +import org.apache.spark.sql.internal.StaticSQLConf._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.tags.ExtendedHiveTest
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A separate set of DDL tests that uses Hive 2.1 libraries, which behave a little differently
    + * from the built-in ones.
    + */
    +@ExtendedHiveTest
    +class Hive_2_1_DDLSuite extends SparkFunSuite with TestHiveSingleton with BeforeAndAfterEach
    +  with BeforeAndAfterAll {
    +
    +  // Create a custom HiveExternalCatalog instance with the desired configuration. We cannot
    +  // use SparkSession here since there's already an active on managed by the TestHive object.
    +  private var catalog = {
    +    val warehouse = Utils.createTempDir()
    +    val metastore = Utils.createTempDir()
    +    metastore.delete()
    +    val sparkConf = new SparkConf()
    +      .set(SparkLauncher.SPARK_MASTER, "local")
    +      .set(WAREHOUSE_PATH.key, warehouse.toURI().toString())
    +      .set(CATALOG_IMPLEMENTATION.key, "hive")
    +      .set(HiveUtils.HIVE_METASTORE_VERSION.key, "2.1")
    +      .set(HiveUtils.HIVE_METASTORE_JARS.key, "maven")
    +
    +    val hadoopConf = new Configuration()
    +    hadoopConf.set("hive.metastore.warehouse.dir", warehouse.toURI().toString())
    +    hadoopConf.set("javax.jdo.option.ConnectionURL",
    +      s"jdbc:derby:;databaseName=${metastore.getAbsolutePath()};create=true")
    +    // These options are needed since the defaults in Hive 2.1 cause exceptions with an
    +    // empty metastore db.
    +    hadoopConf.set("datanucleus.schema.autoCreateAll", "true")
    +    hadoopConf.set("hive.metastore.schema.verification", "false")
    +
    +    new HiveExternalCatalog(sparkConf, hadoopConf)
    +  }
    +
    +  override def afterEach: Unit = {
    +    catalog.listTables("default").foreach { t =>
    +      catalog.dropTable("default", t, true, false)
    +    }
    +    spark.sessionState.catalog.reset()
    +  }
    +
    +  override def afterAll(): Unit = {
    +    catalog = null
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for non-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING json",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))),
    +      hiveCompatible = false)
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) STORED AS parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE with incompatible schema on Hive-compatible table") {
    +    val exception = intercept[AnalysisException] {
    +      testAlterTable(
    +        "t1",
    +        "CREATE TABLE t1 (c1 string) USING parquet",
    +        StructType(Array(StructField("c2", IntegerType))))
    +    }
    +    assert(exception.getMessage().contains("types incompatible with the existing columns"))
    +  }
    +
    +  private def testAlterTable(
    +      tableName: String,
    +      createTableStmt: String,
    +      updatedSchema: StructType,
    +      hiveCompatible: Boolean = true): Unit = {
    +    spark.sql(createTableStmt)
    +    val oldTable = spark.sessionState.catalog.externalCatalog.getTable("default", tableName)
    +    catalog.createTable(oldTable, true)
    +    catalog.alterTableSchema("default", tableName, updatedSchema)
    +
    +    val updatedTable = catalog.getTable("default", tableName)
    +    assert(updatedTable.schema.fieldNames === updatedSchema.fieldNames)
    +
    +    val rawTable = catalog.getRawTable("default", tableName)
    +    val compatibility = rawTable.properties.get(HiveExternalCatalog.DATASOURCE_HIVE_COMPATIBLE)
    +      .map(_.toBoolean).getOrElse(true)
    +    assert(hiveCompatible === compatibility)
    --- End diff --
    
    That's not easy to do here. The catalog being updated is not the same as the one the spark session is using. You can potentially run queries against the 2.1 catalog in the test, but how do you insert data into the table? (You could run a Hive query for that to, but then what's the point?)
    
    I'd argue this kind of test should be done in `HiveDDLSuite` if it doesn't do it now; and if it's desired to test against multiple Hive versions, that it needs to be re-worked so it can be run against multiple Hive versions. But `TestHiveSingleton` makes that really hard currently, and fixing that is way beyond the scope of this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r134027875
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    --- End diff --
    
    > why is it bad to have the compatibility bit be a table property?
    
    This compatibility bit is only processable/parsable to the Spark SQL (V2.3+, if we added it). If the other Spark SQL engine (e.g., V2.2) share the same metastore, they can make a schema change by altering the table properties (i.e., the behavior before this PR). Then, it will break the assumption we made here. The value of this new compatibility conf/flag becomes invalid. 
    
    So far, the safest way to check the compatibility is to compare the schema. If you think it is not enough, we can add the same thing we do for CREATE TABLE


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r133134000
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -342,6 +359,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
                 "Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. "
             (None, message)
     
    +      case _ if currentSessionConf(SQLConf.CASE_SENSITIVE) =>
    --- End diff --
    
    I think we should look at the schema instead of looking at the config. It's possible that even case sensitive config is on, the column names are all lowercased and it's still hive compatible.
    
    My proposal: checking `schema.asLowerCased == schema`, if it's false, then it's not hive compatible. We need to add `StructType.asLowerCased` though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132526607
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/Hive_2_1_DDLSuite.scala ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.execution
    +
    +import scala.language.existentials
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach}
    +
    +import org.apache.spark.{SparkConf, SparkFunSuite}
    +import org.apache.spark.launcher.SparkLauncher
    +import org.apache.spark.sql.AnalysisException
    +import org.apache.spark.sql.catalyst.catalog._
    +import org.apache.spark.sql.hive.{HiveExternalCatalog, HiveUtils}
    +import org.apache.spark.sql.hive.test.TestHiveSingleton
    +import org.apache.spark.sql.internal.StaticSQLConf._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.tags.ExtendedHiveTest
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A separate set of DDL tests that uses Hive 2.1 libraries, which behave a little differently
    + * from the built-in ones.
    + */
    +@ExtendedHiveTest
    +class Hive_2_1_DDLSuite extends SparkFunSuite with TestHiveSingleton with BeforeAndAfterEach
    +  with BeforeAndAfterAll {
    +
    +  // Create a custom HiveExternalCatalog instance with the desired configuration. We cannot
    +  // use SparkSession here since there's already an active on managed by the TestHive object.
    +  private var catalog = {
    +    val warehouse = Utils.createTempDir()
    +    val metastore = Utils.createTempDir()
    +    metastore.delete()
    +    val sparkConf = new SparkConf()
    +      .set(SparkLauncher.SPARK_MASTER, "local")
    +      .set(WAREHOUSE_PATH.key, warehouse.toURI().toString())
    +      .set(CATALOG_IMPLEMENTATION.key, "hive")
    +      .set(HiveUtils.HIVE_METASTORE_VERSION.key, "2.1")
    +      .set(HiveUtils.HIVE_METASTORE_JARS.key, "maven")
    +
    +    val hadoopConf = new Configuration()
    +    hadoopConf.set("hive.metastore.warehouse.dir", warehouse.toURI().toString())
    +    hadoopConf.set("javax.jdo.option.ConnectionURL",
    +      s"jdbc:derby:;databaseName=${metastore.getAbsolutePath()};create=true")
    +    // These options are needed since the defaults in Hive 2.1 cause exceptions with an
    +    // empty metastore db.
    +    hadoopConf.set("datanucleus.schema.autoCreateAll", "true")
    +    hadoopConf.set("hive.metastore.schema.verification", "false")
    +
    +    new HiveExternalCatalog(sparkConf, hadoopConf)
    +  }
    +
    +  override def afterEach: Unit = {
    +    catalog.listTables("default").foreach { t =>
    +      catalog.dropTable("default", t, true, false)
    +    }
    +    spark.sessionState.catalog.reset()
    +  }
    +
    +  override def afterAll(): Unit = {
    +    catalog = null
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for non-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING json",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))),
    +      hiveCompatible = false)
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) STORED AS parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE with incompatible schema on Hive-compatible table") {
    +    val exception = intercept[AnalysisException] {
    +      testAlterTable(
    +        "t1",
    +        "CREATE TABLE t1 (c1 string) USING parquet",
    +        StructType(Array(StructField("c2", IntegerType))))
    +    }
    +    assert(exception.getMessage().contains("types incompatible with the existing columns"))
    +  }
    +
    +  private def testAlterTable(
    +      tableName: String,
    +      createTableStmt: String,
    +      updatedSchema: StructType,
    +      hiveCompatible: Boolean = true): Unit = {
    +    spark.sql(createTableStmt)
    +    val oldTable = spark.sessionState.catalog.externalCatalog.getTable("default", tableName)
    +    catalog.createTable(oldTable, true)
    +    catalog.alterTableSchema("default", tableName, updatedSchema)
    +
    +    val updatedTable = catalog.getTable("default", tableName)
    +    assert(updatedTable.schema.fieldNames === updatedSchema.fieldNames)
    +
    +    val rawTable = catalog.getRawTable("default", tableName)
    +    val compatibility = rawTable.properties.get(HiveExternalCatalog.DATASOURCE_HIVE_COMPATIBLE)
    +      .map(_.toBoolean).getOrElse(true)
    +    assert(hiveCompatible === compatibility)
    --- End diff --
    
    > I think this PR is also trying to make Hive readable after Spark adds columns.
    
    No, that should be the case before already. This PR is just to make the existing feature work on Hive 2.1.
    
    I really would like to avoid turning this PR into "let's fix all the Hive tests to make sure they make sense". If you'd like I can open a bug to track that, but that is *not* what this change is about and I'd like to keep it focused.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80405 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80405/testReport)** for PR 18849 at commit [`2f57a3c`](https://github.com/apache/spark/commit/2f57a3c1db2b4f8e58456b48bbc62ef01fa14633).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80414 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80414/testReport)** for PR 18849 at commit [`0b27209`](https://github.com/apache/spark/commit/0b272094c8a490e16066e85fc48ad59c4ccf0468).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132060528
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -611,21 +640,41 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       override def alterTableSchema(db: String, table: String, schema: StructType): Unit = withClient {
         requireTableExists(db, table)
         val rawTable = getRawTable(db, table)
    -    val withNewSchema = rawTable.copy(schema = schema)
    -    verifyColumnNames(withNewSchema)
         // Add table metadata such as table schema, partition columns, etc. to table properties.
    -    val updatedTable = withNewSchema.copy(
    -      properties = withNewSchema.properties ++ tableMetaToTableProps(withNewSchema))
    -    try {
    -      client.alterTable(updatedTable)
    -    } catch {
    -      case NonFatal(e) =>
    -        val warningMessage =
    -          s"Could not alter schema of table  ${rawTable.identifier.quotedString} in a Hive " +
    -            "compatible way. Updating Hive metastore in Spark SQL specific format."
    -        logWarning(warningMessage, e)
    -        client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema))
    +    val updatedProperties = rawTable.properties ++ tableMetaToTableProps(rawTable, schema)
    +
    +    // Detect whether this is a Hive-compatible table.
    +    val provider = rawTable.properties.get(DATASOURCE_PROVIDER)
    +    val isHiveCompatible = if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    --- End diff --
    
    Could you create a separate utility function for `isHiveCompatible` in HiveExternalCatalog.scala?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r133133656
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -288,6 +303,7 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         // bucket specification to empty. Note that partition columns are retained, so that we can
         // call partition-related Hive API later.
         def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
    +      val hiveCompatible = Map(DATASOURCE_HIVE_COMPATIBLE -> "false")
    --- End diff --
    
    This is a good idea if we do this from the first version. But now, for backward compatibility, we have to handle the case without this special flag at read path, then I can't see the point of having this flag.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132027248
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -611,21 +640,39 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       override def alterTableSchema(db: String, table: String, schema: StructType): Unit = withClient {
         requireTableExists(db, table)
         val rawTable = getRawTable(db, table)
    -    val withNewSchema = rawTable.copy(schema = schema)
    -    verifyColumnNames(withNewSchema)
         // Add table metadata such as table schema, partition columns, etc. to table properties.
    -    val updatedTable = withNewSchema.copy(
    -      properties = withNewSchema.properties ++ tableMetaToTableProps(withNewSchema))
    -    try {
    -      client.alterTable(updatedTable)
    -    } catch {
    -      case NonFatal(e) =>
    -        val warningMessage =
    -          s"Could not alter schema of table  ${rawTable.identifier.quotedString} in a Hive " +
    -            "compatible way. Updating Hive metastore in Spark SQL specific format."
    -        logWarning(warningMessage, e)
    -        client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema))
    +    val updatedProperties = rawTable.properties ++ tableMetaToTableProps(rawTable, schema)
    +
    +    // Detect whether this is a Hive-compatible table.
    +    val provider = rawTable.properties.get(DATASOURCE_PROVIDER)
    +    val isHiveCompatible = if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    --- End diff --
    
    We plan to submit a separate PR for verifying all the related cross-version issues. That needs to verify most DDL statements. You can ignore my previous comment. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    > If that is not the point of "Hive compatibility", then there is no point in creating data source tables in a Hive compatible way to start with. Just always create them as "not Hive compatible" because then Spark is free to do whatever it wants with them.
    
    For most usage scenarios of Spark native file source tables, they do not use Hive to query the tables. Thus, breaking/maintaining Hive compatibility will not affect them. Their DDL commands on the data source tables should not be blocked even if Hive metastore complains it. 
    
    For Hive users who want to query Spark native file source tables, we can introduce the property like `DATASOURCE_HIVE_COMPATIBLE` for ensuring the Hive compatibility will not be broken in the whole life cycle of these tables. This property has to be manually set by users, instead of adding by Spark SQL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/18849

> This is how we behave in CREATE TABLE.

Yes, and I'm not advocating changing that. That is fine and that is correct.

The problem is what to do after the table has already been created. At that point, "Hive compatibility" is already a property of the table. If you break it, you might break a Hive application that was able to read from the table before. So it's wrong, in my view, to change compatibility at that point.

If that is not the point of "Hive compatibility", then there is no point in creating data source tables in a Hive compatible way to start with. Just always create them as "not Hive compatible" because then Spark is free to do whatever it wants with them.

At best, you could implement the current fallback behavior, but only if it's a data source table. It is just wrong to fallback to the exception handling case for normal Hive tables. But even then, that sort of make the case for storing data source tables as Hive-compatible rather flimsy.

> In addition, we should avoid introducing a flag just for fixing a specific scenario.

The flag is not for fixing this specific scenario. The flag is for checking the Hive compatibility property of the table, so that code can make the correct decisions when Hive compatibility is an issue - like it's the case for "alter table".

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r131700138
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -611,21 +640,39 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       override def alterTableSchema(db: String, table: String, schema: StructType): Unit = withClient {
         requireTableExists(db, table)
         val rawTable = getRawTable(db, table)
    -    val withNewSchema = rawTable.copy(schema = schema)
    -    verifyColumnNames(withNewSchema)
         // Add table metadata such as table schema, partition columns, etc. to table properties.
    -    val updatedTable = withNewSchema.copy(
    -      properties = withNewSchema.properties ++ tableMetaToTableProps(withNewSchema))
    -    try {
    -      client.alterTable(updatedTable)
    -    } catch {
    -      case NonFatal(e) =>
    -        val warningMessage =
    -          s"Could not alter schema of table  ${rawTable.identifier.quotedString} in a Hive " +
    -            "compatible way. Updating Hive metastore in Spark SQL specific format."
    -        logWarning(warningMessage, e)
    -        client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema))
    +    val updatedProperties = rawTable.properties ++ tableMetaToTableProps(rawTable, schema)
    +
    +    // Detect whether this is a Hive-compatible table.
    +    val provider = rawTable.properties.get(DATASOURCE_PROVIDER)
    +    val isHiveCompatible = if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    --- End diff --
    
    +1, since we still need to handle the case without the special flag for old spark versions, it makes more sense to just detect hive compatibility by comparing the row table schema and the table schema from table properties.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    When Hive complains it, we should still let users update the Spark-native file source tables. In Spark SQL, we do our best to make the native data source tables Hive compatible. However, we should not block users just because Hive metastore complained it. This is how we behave in [CREATE TABLE](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L364-L374).  
    
    If users really require reading our Spark-native data source tables from Hive, we should introduce a SQLConf or table-specific option and update the corresponding part in `CREATE TABLE` too. 
    
    In addition, we should avoid introducing a flag just for fixing a specific scenario. Thus, I still think comparing the table schemas is preferred for such a fix. Could you show an example that could break it? cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80935 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80935/testReport)** for PR 18849 at commit [`cef66ac`](https://github.com/apache/spark/commit/cef66ac783a096cc5c623fa52ac9276321e554f9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    My main concern is, we should not change the write path if not necessary. For backward compatibility, the read path needs to handle all possible cases, and changing write path just adds more complexity to the read path.
    
    To be clear, if there are some new features that have to be implemented by some write path changes, we should do it. If you only wanna clean something up, changing the write path can only make it worse.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132059925
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala ---
    @@ -908,7 +909,13 @@ private[hive] object HiveClientImpl {
         }
         // after SPARK-19279, it is not allowed to create a hive table with an empty schema,
         // so here we should not add a default col schema
    -    if (schema.isEmpty && DDLUtils.isDatasourceTable(table)) {
    +    //
    +    // Because HiveExternalCatalog sometimes writes back "raw" tables that have not been
    +    // completely translated to Spark's view, the provider information needs to be looked
    +    // up in two places.
    +    val provider = table.provider.orElse(
    --- End diff --
    
    What is the second exception? Could you explain more? If this is fixing a different bug, could you open a new JIRA and put it in the PR title?
    
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r133134774
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -342,6 +359,12 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
                 "Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. "
             (None, message)
     
    +      case _ if currentSessionConf(SQLConf.CASE_SENSITIVE) =>
    --- End diff --
    
    Actually is this a useful change? In the read path we still need to handle the case that, a hive compatible table have inconsistent schema between table properties and metadata.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r131894824
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala ---
    @@ -2356,18 +2356,9 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils {
                 }.getMessage
                 assert(e.contains("Found duplicate column(s)"))
               } else {
    -            if (isUsingHiveMetastore) {
    -              // hive catalog will still complains that c1 is duplicate column name because hive
    -              // identifiers are case insensitive.
    -              val e = intercept[AnalysisException] {
    -                sql("ALTER TABLE t1 ADD COLUMNS (C1 string)")
    -              }.getMessage
    -              assert(e.contains("HiveException"))
    -            } else {
    -              sql("ALTER TABLE t1 ADD COLUMNS (C1 string)")
    -              assert(spark.table("t1").schema
    -                .equals(new StructType().add("c1", IntegerType).add("C1", StringType)))
    -            }
    +            sql("ALTER TABLE t1 ADD COLUMNS (C1 string)")
    +            assert(spark.table("t1").schema
    +              .equals(new StructType().add("c1", IntegerType).add("C1", StringType)))
    --- End diff --
    
    `.equals` -> `==`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80423 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80423/testReport)** for PR 18849 at commit [`6824e35`](https://github.com/apache/spark/commit/6824e358d231b7c56031c946d983cde0b89fd574).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r134029245
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    --- End diff --
    
    > So far, the safest way to check the compatibility is to compare the schema.
    
    That is not enough, as I've tried to explain several times. `alterTableSchema` is broken in older Spark versions and can end up creating corrupt tables (where the schema does not match the storage descriptor, for example). So you need a reliable way of detecting compatibility, and the schema is not it.
    
    The closest we have currently is the storage descriptor (the fallback case in my code).
    
    Let me think about what happens when an older Spark touches the tables with the new property set. `alterTableSchema` even in older Spark versions will maintain that information, but maybe other code paths won't.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r131516698
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -611,21 +640,39 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       override def alterTableSchema(db: String, table: String, schema: StructType): Unit = withClient {
         requireTableExists(db, table)
         val rawTable = getRawTable(db, table)
    -    val withNewSchema = rawTable.copy(schema = schema)
    -    verifyColumnNames(withNewSchema)
         // Add table metadata such as table schema, partition columns, etc. to table properties.
    -    val updatedTable = withNewSchema.copy(
    -      properties = withNewSchema.properties ++ tableMetaToTableProps(withNewSchema))
    -    try {
    -      client.alterTable(updatedTable)
    -    } catch {
    -      case NonFatal(e) =>
    -        val warningMessage =
    -          s"Could not alter schema of table  ${rawTable.identifier.quotedString} in a Hive " +
    -            "compatible way. Updating Hive metastore in Spark SQL specific format."
    -        logWarning(warningMessage, e)
    -        client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema))
    +    val updatedProperties = rawTable.properties ++ tableMetaToTableProps(rawTable, schema)
    +
    +    // Detect whether this is a Hive-compatible table.
    +    val provider = rawTable.properties.get(DATASOURCE_PROVIDER)
    +    val isHiveCompatible = if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    --- End diff --
    
    The whole check might not support all the previous versions. We change these flags multiple times. We might break the support of the table metadata created by the previous version of Spark
    
    How about directly comparing the schemas and checks they are Hive compatible. cc @cloud-fan WDYT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132058727
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala ---
    @@ -35,7 +35,7 @@ import org.apache.spark.sql.hive.HiveExternalCatalog
     import org.apache.spark.sql.hive.orc.OrcFileOperator
     import org.apache.spark.sql.hive.test.TestHiveSingleton
     import org.apache.spark.sql.internal.{HiveSerDe, SQLConf}
    -import org.apache.spark.sql.internal.StaticSQLConf.CATALOG_IMPLEMENTATION
    +import org.apache.spark.sql.internal.StaticSQLConf._
    --- End diff --
    
    Revert it back?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80694/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132027648
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -611,21 +640,39 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       override def alterTableSchema(db: String, table: String, schema: StructType): Unit = withClient {
         requireTableExists(db, table)
         val rawTable = getRawTable(db, table)
    -    val withNewSchema = rawTable.copy(schema = schema)
    -    verifyColumnNames(withNewSchema)
         // Add table metadata such as table schema, partition columns, etc. to table properties.
    -    val updatedTable = withNewSchema.copy(
    -      properties = withNewSchema.properties ++ tableMetaToTableProps(withNewSchema))
    -    try {
    -      client.alterTable(updatedTable)
    -    } catch {
    -      case NonFatal(e) =>
    -        val warningMessage =
    -          s"Could not alter schema of table  ${rawTable.identifier.quotedString} in a Hive " +
    -            "compatible way. Updating Hive metastore in Spark SQL specific format."
    -        logWarning(warningMessage, e)
    -        client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema))
    +    val updatedProperties = rawTable.properties ++ tableMetaToTableProps(rawTable, schema)
    +
    +    // Detect whether this is a Hive-compatible table.
    +    val provider = rawTable.properties.get(DATASOURCE_PROVIDER)
    +    val isHiveCompatible = if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    --- End diff --
    
    Too late now, already added the tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132527085
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/Hive_2_1_DDLSuite.scala ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.hive.execution
    +
    +import scala.language.existentials
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach}
    +
    +import org.apache.spark.{SparkConf, SparkFunSuite}
    +import org.apache.spark.launcher.SparkLauncher
    +import org.apache.spark.sql.AnalysisException
    +import org.apache.spark.sql.catalyst.catalog._
    +import org.apache.spark.sql.hive.{HiveExternalCatalog, HiveUtils}
    +import org.apache.spark.sql.hive.test.TestHiveSingleton
    +import org.apache.spark.sql.internal.StaticSQLConf._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.tags.ExtendedHiveTest
    +import org.apache.spark.util.Utils
    +
    +/**
    + * A separate set of DDL tests that uses Hive 2.1 libraries, which behave a little differently
    + * from the built-in ones.
    + */
    +@ExtendedHiveTest
    +class Hive_2_1_DDLSuite extends SparkFunSuite with TestHiveSingleton with BeforeAndAfterEach
    +  with BeforeAndAfterAll {
    +
    +  // Create a custom HiveExternalCatalog instance with the desired configuration. We cannot
    +  // use SparkSession here since there's already an active on managed by the TestHive object.
    +  private var catalog = {
    +    val warehouse = Utils.createTempDir()
    +    val metastore = Utils.createTempDir()
    +    metastore.delete()
    +    val sparkConf = new SparkConf()
    +      .set(SparkLauncher.SPARK_MASTER, "local")
    +      .set(WAREHOUSE_PATH.key, warehouse.toURI().toString())
    +      .set(CATALOG_IMPLEMENTATION.key, "hive")
    +      .set(HiveUtils.HIVE_METASTORE_VERSION.key, "2.1")
    +      .set(HiveUtils.HIVE_METASTORE_JARS.key, "maven")
    +
    +    val hadoopConf = new Configuration()
    +    hadoopConf.set("hive.metastore.warehouse.dir", warehouse.toURI().toString())
    +    hadoopConf.set("javax.jdo.option.ConnectionURL",
    +      s"jdbc:derby:;databaseName=${metastore.getAbsolutePath()};create=true")
    +    // These options are needed since the defaults in Hive 2.1 cause exceptions with an
    +    // empty metastore db.
    +    hadoopConf.set("datanucleus.schema.autoCreateAll", "true")
    +    hadoopConf.set("hive.metastore.schema.verification", "false")
    +
    +    new HiveExternalCatalog(sparkConf, hadoopConf)
    +  }
    +
    +  override def afterEach: Unit = {
    +    catalog.listTables("default").foreach { t =>
    +      catalog.dropTable("default", t, true, false)
    +    }
    +    spark.sessionState.catalog.reset()
    +  }
    +
    +  override def afterAll(): Unit = {
    +    catalog = null
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for non-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING json",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))),
    +      hiveCompatible = false)
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive-compatible DataSource tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) USING parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE for Hive tables") {
    +    testAlterTable(
    +      "t1",
    +      "CREATE TABLE t1 (c1 int) STORED AS parquet",
    +      StructType(Array(StructField("c1", IntegerType), StructField("c2", IntegerType))))
    +  }
    +
    +  test("SPARK-21617: ALTER TABLE with incompatible schema on Hive-compatible table") {
    +    val exception = intercept[AnalysisException] {
    +      testAlterTable(
    +        "t1",
    +        "CREATE TABLE t1 (c1 string) USING parquet",
    +        StructType(Array(StructField("c2", IntegerType))))
    +    }
    +    assert(exception.getMessage().contains("types incompatible with the existing columns"))
    +  }
    +
    +  private def testAlterTable(
    +      tableName: String,
    +      createTableStmt: String,
    +      updatedSchema: StructType,
    +      hiveCompatible: Boolean = true): Unit = {
    +    spark.sql(createTableStmt)
    +    val oldTable = spark.sessionState.catalog.externalCatalog.getTable("default", tableName)
    +    catalog.createTable(oldTable, true)
    +    catalog.alterTableSchema("default", tableName, updatedSchema)
    +
    +    val updatedTable = catalog.getTable("default", tableName)
    +    assert(updatedTable.schema.fieldNames === updatedSchema.fieldNames)
    +
    +    val rawTable = catalog.getRawTable("default", tableName)
    +    val compatibility = rawTable.properties.get(HiveExternalCatalog.DATASOURCE_HIVE_COMPATIBLE)
    +      .map(_.toBoolean).getOrElse(true)
    +    assert(hiveCompatible === compatibility)
    --- End diff --
    
    OK, we can do it in a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80936/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    If the new flag `DATASOURCE_HIVE_COMPATIBLE` is set to `true` when creating a table, are we sure it can be `true` forever? Is it a reliable flag we can trust? Is that possible the `ALTER TABLE` commands by the previous/current/future versions of Spark SQL might also change the hive compatibility? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80463 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80463/testReport)** for PR 18849 at commit [`abd6cf1`](https://github.com/apache/spark/commit/abd6cf119c0b400fda9c96fcb6432b090c2505bf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80405/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132530396
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    +    val provider = table.provider.orElse(table.properties.get(DATASOURCE_PROVIDER))
    +    if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    +      table.properties.get(DATASOURCE_HIVE_COMPATIBLE) match {
    +        case Some(value) =>
    +          value.toBoolean
    +        case _ =>
    +          // If the property is not set, the table may have been created by an old version
    +          // of Spark. Detect Hive compatibility by comparing the table's serde with the
    +          // serde for the table's data source. If they match, the table is Hive-compatible.
    +          // If they don't, they're not, because of some other table property that made it
    +          // not initially Hive-compatible.
    +          HiveSerDe.sourceToSerDe(provider.get) == table.storage.serde
    --- End diff --
    
    cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80414 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80414/testReport)** for PR 18849 at commit [`0b27209`](https://github.com/apache/spark/commit/0b272094c8a490e16066e85fc48ad59c4ccf0468).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r131730515
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -611,21 +640,39 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       override def alterTableSchema(db: String, table: String, schema: StructType): Unit = withClient {
         requireTableExists(db, table)
         val rawTable = getRawTable(db, table)
    -    val withNewSchema = rawTable.copy(schema = schema)
    -    verifyColumnNames(withNewSchema)
         // Add table metadata such as table schema, partition columns, etc. to table properties.
    -    val updatedTable = withNewSchema.copy(
    -      properties = withNewSchema.properties ++ tableMetaToTableProps(withNewSchema))
    -    try {
    -      client.alterTable(updatedTable)
    -    } catch {
    -      case NonFatal(e) =>
    -        val warningMessage =
    -          s"Could not alter schema of table  ${rawTable.identifier.quotedString} in a Hive " +
    -            "compatible way. Updating Hive metastore in Spark SQL specific format."
    -        logWarning(warningMessage, e)
    -        client.alterTable(updatedTable.copy(schema = updatedTable.partitionSchema))
    +    val updatedProperties = rawTable.properties ++ tableMetaToTableProps(rawTable, schema)
    +
    +    // Detect whether this is a Hive-compatible table.
    +    val provider = rawTable.properties.get(DATASOURCE_PROVIDER)
    +    val isHiveCompatible = if (provider.isDefined && provider != Some(DDLUtils.HIVE_PROVIDER)) {
    --- End diff --
    
    I changed the check to use the serde instead. The new tests pass even without the explicit check for `DATASOURCE_HIVE_COMPATIBLE` when doing that, although I prefer leaving the explicit property for clarity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    **[Test build #80358 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80358/testReport)** for PR 18849 at commit [`40ebc96`](https://github.com/apache/spark/commit/40ebc966ff8e9ad742f2cdbf631a289b8388560a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r133135343
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    --- End diff --
    
    I think an easy way to check hive compatibility is, comparing the schema from table properties and metadata, if they equal without considering case and nullability, it's hive compatible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Alright, I give up. If you don't think it's important to maintain Hive compatibility once it's been set, and it's ok to create tables that have completely messed up metadata (from Hive's perspective) as long as they're data source tables, I'll do that. I'd rather fix the actual problem that actually happens when using Hive 2.x than keep a long discussion about what does it mean to be compatible...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80270/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r132060765
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala ---
    @@ -908,7 +909,13 @@ private[hive] object HiveClientImpl {
         }
         // after SPARK-19279, it is not allowed to create a hive table with an empty schema,
         // so here we should not add a default col schema
    -    if (schema.isEmpty && DDLUtils.isDatasourceTable(table)) {
    +    //
    +    // Because HiveExternalCatalog sometimes writes back "raw" tables that have not been
    +    // completely translated to Spark's view, the provider information needs to be looked
    +    // up in two places.
    +    val provider = table.provider.orElse(
    --- End diff --
    
    If you look at the bug, there are two exceptions. One gets logged, the second is thrown and caused the test to fail in my 2.1-based branch.
    
    The exception happened because `alterTableSchema` is writing back the result of `getRawTable`. That raw table does not have the provider set; instead, it's in the table's properties. This check looks at both places, so that other code that uses `getRawTable` can properly pass this check.
    
    As I explained in a previous comment, this doesn't happen anymore for `alterTableSchema` because of the other changes. But there's still code in the catalog class that writes back tables fetched with `getRawTable`, so this feels safer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18849
  
    Thanks! Merging to master/2.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18849: [SPARK-21617][SQL] Store correct table metadata w...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18849#discussion_r133899594
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1175,6 +1205,27 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
         client.listFunctions(db, pattern)
       }
     
    +  /** Detect whether a table is stored with Hive-compatible metadata. */
    +  private def isHiveCompatible(table: CatalogTable): Boolean = {
    --- End diff --
    
    Why the following logics does not work? Could you share the example we can try?
    ```Scala
          val schemaFromTableProps = getSchemaFromTableProperties(table)
          val partColumnNames = getPartitionColumnsFromTableProperties(table)
          val reorderedSchema = reorderSchema(schema = schemaFromTableProps, partColumnNames)
          val isHiveCompatible  = DataType.equalsIgnoreCaseAndNullability(reorderedSchema, table.schema)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org