You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2018/09/21 06:25:07 UTC

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/22514

    [SPARK-25271][SQL] Hive ctas commands should use data source if it is convertible

    ## What changes were proposed in this pull request?
    
    We have a [regression](https://github.com/apache/spark/pull/20521/files#r217254430) since 2.3.1 that Hive ctas command only uses Hive Serde to write data. Hive ctas command previously will use Parquet/Orc data source to write data if it is convertible.
    
    Because of it, the related regression reported by this JIRA is when writing a empty map in to Hive using ctas, it hits Hive's known issue and is thrown exception.
    
    ## How was this patch tested?
    
    Added test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-25271-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22514.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22514
    
----
commit 5debc6096ae6e505d3386fd7eb569d154f158d55
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-09-12T10:33:53Z

    Hive ctas commands should use data source format if it is convertible.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99637/testReport)** for PR 22514 at commit [`e04812d`](https://github.com/apache/spark/commit/e04812d6aff0270ab4c0e6c3b2e204a682739e23).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5494/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96784/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r238902415
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -181,62 +180,39 @@ case class RelationConversions(
         conf: SQLConf,
         sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
       private def isConvertible(relation: HiveTableRelation): Boolean = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -    serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    -      serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
    +    isConvertible(relation.tableMeta)
       }
     
    -  // Return true for Apache ORC and Hive ORC-related configuration names.
    -  // Note that Spark doesn't support configurations like `hive.merge.orcfile.stripe.level`.
    -  private def isOrcProperty(key: String) =
    -    key.startsWith("orc.") || key.contains(".orc.")
    -
    -  private def isParquetProperty(key: String) =
    -    key.startsWith("parquet.") || key.contains(".parquet.")
    -
    -  private def convert(relation: HiveTableRelation): LogicalRelation = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -
    -    // Consider table and storage properties. For properties existing in both sides, storage
    -    // properties will supersede table properties.
    -    if (serde.contains("parquet")) {
    -      val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
    -        relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
    -        conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
    -      sessionCatalog.metastoreCatalog
    -        .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
    -    } else {
    -      val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
    -        relation.tableMeta.storage.properties
    -      if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
    -          "orc")
    -      } else {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
    -          "orc")
    -      }
    -    }
    +  private def isConvertible(tableMeta: CatalogTable): Boolean = {
    +    val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    +    serde.contains("parquet") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    +      serde.contains("orc") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_ORC)
       }
     
    +  private val metastoreCatalog = sessionCatalog.metastoreCatalog
    +
       override def apply(plan: LogicalPlan): LogicalPlan = {
         plan resolveOperators {
           // Write path
           case InsertIntoTable(r: HiveTableRelation, partition, query, overwrite, ifPartitionNotExists)
             // Inserting into partitioned table is not supported in Parquet/Orc data source (yet).
               if query.resolved && DDLUtils.isHiveTable(r.tableMeta) &&
                 !r.isPartitioned && isConvertible(r) =>
    -        InsertIntoTable(convert(r), partition, query, overwrite, ifPartitionNotExists)
    +        InsertIntoTable(metastoreCatalog.convert(r), partition,
    +          query, overwrite, ifPartitionNotExists)
     
           // Read path
           case relation: HiveTableRelation
               if DDLUtils.isHiveTable(relation.tableMeta) && isConvertible(relation) =>
    -        convert(relation)
    +        metastoreCatalog.convert(relation)
    +
    +      // CTAS
    +      case CreateTable(tableDesc, mode, Some(query))
    +          if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
    +            isConvertible(tableDesc) =>
    --- End diff --
    
    hmm, the optimization is already controlled by configs like `HiveUtils.CONVERT_METASTORE_ORC` and `HiveUtils.CONVERT_METASTORE_PARQUET`. Do we need another config for it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Oh, PySpark UT failures look weird.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r237756394
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,98 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  def getDataWritingCommand(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable,
    +    tableExists: Boolean): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the Table Describe, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def getDataWritingCommand(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable,
    +      tableExists: Boolean): DataWritingCommand = {
    +    if (tableExists) {
    +      InsertIntoHiveTable(
    +        tableDesc,
    +        Map.empty,
    +        query,
    +        overwrite = false,
    +        ifPartitionNotExists = false,
    +        outputColumnNames = outputColumnNames)
    +    } else {
    +      // For CTAS, there is no static partition values to insert.
    +      val partition = tableDesc.partitionColumnNames.map(_ -> None).toMap
    +      InsertIntoHiveTable(
    +        tableDesc,
    +        partition,
    +        query,
    +        overwrite = true,
    +        ifPartitionNotExists = false,
    +        outputColumnNames = outputColumnNames)
    +    }
    +  }
    +}
    +
    +/**
    + * Create table and insert the query result into it. This creates Hive table but inserts
    + * the query result into it by using data source.
    + *
    + * @param tableDesc the Table Describe, which may contain serde, storage handler etc.
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3460/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r238899698
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -181,62 +180,39 @@ case class RelationConversions(
         conf: SQLConf,
         sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
       private def isConvertible(relation: HiveTableRelation): Boolean = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -    serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    -      serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
    +    isConvertible(relation.tableMeta)
       }
     
    -  // Return true for Apache ORC and Hive ORC-related configuration names.
    -  // Note that Spark doesn't support configurations like `hive.merge.orcfile.stripe.level`.
    -  private def isOrcProperty(key: String) =
    -    key.startsWith("orc.") || key.contains(".orc.")
    -
    -  private def isParquetProperty(key: String) =
    -    key.startsWith("parquet.") || key.contains(".parquet.")
    -
    -  private def convert(relation: HiveTableRelation): LogicalRelation = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -
    -    // Consider table and storage properties. For properties existing in both sides, storage
    -    // properties will supersede table properties.
    -    if (serde.contains("parquet")) {
    -      val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
    -        relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
    -        conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
    -      sessionCatalog.metastoreCatalog
    -        .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
    -    } else {
    -      val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
    -        relation.tableMeta.storage.properties
    -      if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
    -          "orc")
    -      } else {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
    -          "orc")
    -      }
    -    }
    +  private def isConvertible(tableMeta: CatalogTable): Boolean = {
    +    val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    +    serde.contains("parquet") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    +      serde.contains("orc") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_ORC)
       }
     
    +  private val metastoreCatalog = sessionCatalog.metastoreCatalog
    +
       override def apply(plan: LogicalPlan): LogicalPlan = {
         plan resolveOperators {
           // Write path
           case InsertIntoTable(r: HiveTableRelation, partition, query, overwrite, ifPartitionNotExists)
             // Inserting into partitioned table is not supported in Parquet/Orc data source (yet).
               if query.resolved && DDLUtils.isHiveTable(r.tableMeta) &&
                 !r.isPartitioned && isConvertible(r) =>
    -        InsertIntoTable(convert(r), partition, query, overwrite, ifPartitionNotExists)
    +        InsertIntoTable(metastoreCatalog.convert(r), partition,
    +          query, overwrite, ifPartitionNotExists)
     
           // Read path
           case relation: HiveTableRelation
               if DDLUtils.isHiveTable(relation.tableMeta) && isConvertible(relation) =>
    -        convert(relation)
    +        metastoreCatalog.convert(relation)
    +
    +      // CTAS
    +      case CreateTable(tableDesc, mode, Some(query))
    +          if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
    +            isConvertible(tableDesc) =>
    --- End diff --
    
    It's not a new optimization... It's an optimization we dropped in 2.3 by mistake.
    
    I'm fine to add a config with default value true.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    @cloud-fan I've updated the PR description. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239733875
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,116 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    --- End diff --
    
    or open a new PR to allow CTAS for partitioned hive table first?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99378/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    This requires more careful review. I do not think we can make it in RC5


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r238949362
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -181,62 +180,39 @@ case class RelationConversions(
         conf: SQLConf,
         sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
       private def isConvertible(relation: HiveTableRelation): Boolean = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -    serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    -      serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
    +    isConvertible(relation.tableMeta)
       }
     
    -  // Return true for Apache ORC and Hive ORC-related configuration names.
    -  // Note that Spark doesn't support configurations like `hive.merge.orcfile.stripe.level`.
    -  private def isOrcProperty(key: String) =
    -    key.startsWith("orc.") || key.contains(".orc.")
    -
    -  private def isParquetProperty(key: String) =
    -    key.startsWith("parquet.") || key.contains(".parquet.")
    -
    -  private def convert(relation: HiveTableRelation): LogicalRelation = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -
    -    // Consider table and storage properties. For properties existing in both sides, storage
    -    // properties will supersede table properties.
    -    if (serde.contains("parquet")) {
    -      val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
    -        relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
    -        conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
    -      sessionCatalog.metastoreCatalog
    -        .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
    -    } else {
    -      val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
    -        relation.tableMeta.storage.properties
    -      if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
    -          "orc")
    -      } else {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
    -          "orc")
    -      }
    -    }
    +  private def isConvertible(tableMeta: CatalogTable): Boolean = {
    +    val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    +    serde.contains("parquet") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    +      serde.contains("orc") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_ORC)
       }
     
    +  private val metastoreCatalog = sessionCatalog.metastoreCatalog
    +
       override def apply(plan: LogicalPlan): LogicalPlan = {
         plan resolveOperators {
           // Write path
           case InsertIntoTable(r: HiveTableRelation, partition, query, overwrite, ifPartitionNotExists)
             // Inserting into partitioned table is not supported in Parquet/Orc data source (yet).
               if query.resolved && DDLUtils.isHiveTable(r.tableMeta) &&
                 !r.isPartitioned && isConvertible(r) =>
    -        InsertIntoTable(convert(r), partition, query, overwrite, ifPartitionNotExists)
    +        InsertIntoTable(metastoreCatalog.convert(r), partition,
    +          query, overwrite, ifPartitionNotExists)
     
           // Read path
           case relation: HiveTableRelation
               if DDLUtils.isHiveTable(relation.tableMeta) && isConvertible(relation) =>
    -        convert(relation)
    +        metastoreCatalog.convert(relation)
    +
    +      // CTAS
    +      case CreateTable(tableDesc, mode, Some(query))
    +          if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
    +            isConvertible(tableDesc) =>
    --- End diff --
    
    We usually don't write a migration guide for perf optimizations. Otherwise it's annoying to write one for each optimization and ask users to turn it off if something goes wrong. I think we only do that when there are known issues.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5568/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99415 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99415/testReport)** for PR 22514 at commit [`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    can we try a query and see what the SQL UI looks like?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99370/testReport)** for PR 22514 at commit [`6e2a31b`](https://github.com/apache/spark/commit/6e2a31b0db71ab3dcc8a81cb5ca585f784793736).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99501/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99665/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99415 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99415/testReport)** for PR 22514 at commit [`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    This PR will resolve a regression which is throwing `java.lang.RuntimeException` for Parquet tables. I'm wondering if we can consider this for 2.4.0 RC5.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228776103
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
    @@ -2648,7 +2648,7 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext {
               "transform_values(" +
                 "z,(k, v) -> map_from_arrays(ARRAY(1, 2, 3), " +
                 "ARRAY('one', 'two', 'three'))[k] || '_' || CAST(v AS String))"),
    -        Seq(Row(Map(1 -> "one_1.0", 2 -> "two_1.4", 3 ->"three_1.7"))))
    +        Seq(Row(Map(1 -> "one_1.0", 2 -> "two_1.4", 3 -> "three_1.7"))))
    --- End diff --
    
    Ok. I will put those change into a minor PR instead.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5447/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228096896
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala ---
    @@ -92,4 +92,18 @@ class HiveParquetSuite extends QueryTest with ParquetTest with TestHiveSingleton
           }
         }
       }
    +
    +  test("SPARK-25271: write empty map into hive parquet table") {
    +    val testData = hiveContext.getHiveFile("data/files/empty_map.dat").getCanonicalFile()
    +    val sourceTable = "sourceTable"
    +    val targetTable = "targetTable"
    +    withTable(sourceTable, targetTable) {
    +      sql(s"CREATE TABLE $sourceTable (i int,m map<int, string>) ROW FORMAT DELIMITED FIELDS " +
    +        "TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '$'")
    +      sql(s"LOAD DATA LOCAL INPATH '${testData.toURI}' INTO TABLE $sourceTable")
    --- End diff --
    
    Ok.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r237052582
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
     
       override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
    --- End diff --
    
    I created a Hive CTAS with data source command.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228817044
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
     
       override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
    --- End diff --
    
    I also thought about it. But then we will have two Hive CTAS commands. Is it good for you?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5719/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99665 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99665/testReport)** for PR 22514 at commit [`3c07d74`](https://github.com/apache/spark/commit/3c07d74ec94cf85f9aad79cde6c42ea667cb2090).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228772854
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
    @@ -2648,7 +2648,7 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext {
               "transform_values(" +
                 "z,(k, v) -> map_from_arrays(ARRAY(1, 2, 3), " +
                 "ARRAY('one', 'two', 'three'))[k] || '_' || CAST(v AS String))"),
    -        Seq(Row(Map(1 -> "one_1.0", 2 -> "two_1.4", 3 ->"three_1.7"))))
    +        Seq(Row(Map(1 -> "one_1.0", 2 -> "two_1.4", 3 -> "three_1.7"))))
    --- End diff --
    
    This one, too. Let's not touch this because we didn't change anything in this file. It would be helpful for backporting.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99501 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99501/testReport)** for PR 22514 at commit [`9629175`](https://github.com/apache/spark/commit/96291751c5a4992325f37bcb794ea5fd3f31593b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r237753433
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,98 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  def getDataWritingCommand(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable,
    +    tableExists: Boolean): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the Table Describe, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def getDataWritingCommand(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable,
    +      tableExists: Boolean): DataWritingCommand = {
    +    if (tableExists) {
    +      InsertIntoHiveTable(
    +        tableDesc,
    +        Map.empty,
    +        query,
    +        overwrite = false,
    +        ifPartitionNotExists = false,
    +        outputColumnNames = outputColumnNames)
    +    } else {
    +      // For CTAS, there is no static partition values to insert.
    +      val partition = tableDesc.partitionColumnNames.map(_ -> None).toMap
    +      InsertIntoHiveTable(
    +        tableDesc,
    +        partition,
    +        query,
    +        overwrite = true,
    +        ifPartitionNotExists = false,
    +        outputColumnNames = outputColumnNames)
    +    }
    +  }
    +}
    +
    +/**
    + * Create table and insert the query result into it. This creates Hive table but inserts
    + * the query result into it by using data source.
    + *
    + * @param tableDesc the Table Describe, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectWithDataSourceCommand(
    --- End diff --
    
    `OptimizedCreateHiveTableAsSelectCommand`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239300131
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,116 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    --- End diff --
    
    I think it is because we don't allow to have a Hive CTAS statement to create a partitioned table. I saw there is one test in hive's SQLQuerySuite for it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r237747152
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -181,62 +180,39 @@ case class RelationConversions(
         conf: SQLConf,
         sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
       private def isConvertible(relation: HiveTableRelation): Boolean = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -    serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    -      serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
    +    isConvertible(relation.tableMeta)
       }
     
    -  // Return true for Apache ORC and Hive ORC-related configuration names.
    -  // Note that Spark doesn't support configurations like `hive.merge.orcfile.stripe.level`.
    -  private def isOrcProperty(key: String) =
    -    key.startsWith("orc.") || key.contains(".orc.")
    -
    -  private def isParquetProperty(key: String) =
    -    key.startsWith("parquet.") || key.contains(".parquet.")
    -
    -  private def convert(relation: HiveTableRelation): LogicalRelation = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -
    -    // Consider table and storage properties. For properties existing in both sides, storage
    -    // properties will supersede table properties.
    -    if (serde.contains("parquet")) {
    -      val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
    -        relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
    -        conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
    -      sessionCatalog.metastoreCatalog
    -        .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
    -    } else {
    -      val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
    -        relation.tableMeta.storage.properties
    -      if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
    -          "orc")
    -      } else {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
    -          "orc")
    -      }
    -    }
    +  private def isConvertible(tableMeta: CatalogTable): Boolean = {
    +    val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    +    serde.contains("parquet") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    +      serde.contains("orc") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_ORC)
       }
     
    +  private val metastoreCatalog = sessionCatalog.metastoreCatalog
    +
       override def apply(plan: LogicalPlan): LogicalPlan = {
         plan resolveOperators {
           // Write path
           case InsertIntoTable(r: HiveTableRelation, partition, query, overwrite, ifPartitionNotExists)
             // Inserting into partitioned table is not supported in Parquet/Orc data source (yet).
               if query.resolved && DDLUtils.isHiveTable(r.tableMeta) &&
                 !r.isPartitioned && isConvertible(r) =>
    -        InsertIntoTable(convert(r), partition, query, overwrite, ifPartitionNotExists)
    +        InsertIntoTable(metastoreCatalog.convert(r), partition,
    +          query, overwrite, ifPartitionNotExists)
     
           // Read path
           case relation: HiveTableRelation
               if DDLUtils.isHiveTable(relation.tableMeta) && isConvertible(relation) =>
    -        convert(relation)
    +        metastoreCatalog.convert(relation)
    +
    +      // CTAS
    +      case CreateTable(tableDesc, mode, Some(query))
    +          if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
    +            isConvertible(tableDesc) =>
    +        DDLUtils.checkDataColNames(tableDesc)
    --- End diff --
    
    In HiveAnalysis, when transforming CreateTable to CreateHiveTableAsSelectCommand, it has this too. checkDataColNames checks if any invalid character is using in column name.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99495/testReport)** for PR 22514 at commit [`9629175`](https://github.com/apache/spark/commit/96291751c5a4992325f37bcb794ea5fd3f31593b).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    <img width="531" alt="screen shot 2018-11-30 at 12 22 29 pm" src="https://user-images.githubusercontent.com/68855/49268483-aaa6d000-f49a-11e8-92c3-5ee78012fe9e.png">



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96614/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r227653464
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala ---
    @@ -92,4 +92,18 @@ class HiveParquetSuite extends QueryTest with ParquetTest with TestHiveSingleton
           }
         }
       }
    +
    +  test("SPARK-25271: write empty map into hive parquet table") {
    +    val testData = hiveContext.getHiveFile("data/files/empty_map.dat").getCanonicalFile()
    +    val sourceTable = "sourceTable"
    +    val targetTable = "targetTable"
    +    withTable(sourceTable, targetTable) {
    +      sql(s"CREATE TABLE $sourceTable (i int,m map<int, string>) ROW FORMAT DELIMITED FIELDS " +
    +        "TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '$'")
    +      sql(s"LOAD DATA LOCAL INPATH '${testData.toURI}' INTO TABLE $sourceTable")
    --- End diff --
    
    can we generate the input data with a temp view? e.g. create a dataframe with literals and register temp view.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r237753623
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,98 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  def getDataWritingCommand(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable,
    +    tableExists: Boolean): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the Table Describe, which may contain serde, storage handler etc.
    --- End diff --
    
    `table description`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99378 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99378/testReport)** for PR 22514 at commit [`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239264207
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,116 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    --- End diff --
    
    Why this is empty? Do we have a test case for partitioning tables? 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239323943
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,116 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    --- End diff --
    
    https://github.com/apache/spark/blob/8534d753ecb21ea64ffbaefb5eaca38ba0464c6d/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala#L686-L697


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r238706120
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,127 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    +      query,
    +      overwrite = false,
    +      ifPartitionNotExists = false,
    +      outputColumnNames = outputColumnNames)
    +  }
    +
    +  override def writingCommandForNewTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    // For CTAS, there is no static partition values to insert.
    +    val partition = tableDesc.partitionColumnNames.map(_ -> None).toMap
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      partition,
    +      query,
    +      overwrite = true,
    +      ifPartitionNotExists = false,
    +      outputColumnNames = outputColumnNames)
    +  }
    +}
    +
    +/**
    + * Create table and insert the query result into it. This creates Hive table but inserts
    + * the query result into it by using data source.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class OptimizedCreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  private def getHadoopRelation(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): HadoopFsRelation = {
    +    val metastoreCatalog = catalog.asInstanceOf[HiveSessionCatalog].metastoreCatalog
    +    val hiveTable = DDLUtils.readHiveTable(tableDesc)
    +
    +    metastoreCatalog.convert(hiveTable) match {
    +      case LogicalRelation(t: HadoopFsRelation, _, _, _) => t
    +      case _ => throw new AnalysisException(s"$tableIdentifier should be converted to " +
    +        "HadoopFsRelation.")
    +    }
    +  }
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    val hadoopRelation = getHadoopRelation(catalog, tableDesc)
    +    InsertIntoHadoopFsRelationCommand(
    +      hadoopRelation.location.rootPaths.head,
    +      Map.empty, // We don't support to convert partitioned table.
    +      false,
    +      Seq.empty, // We don't support to convert partitioned table.
    +      hadoopRelation.bucketSpec,
    +      hadoopRelation.fileFormat,
    +      hadoopRelation.options,
    +      query,
    +      mode,
    +      Some(tableDesc),
    +      Some(hadoopRelation.location),
    +      query.output.map(_.name))
    +  }
    +
    +  override def writingCommandForNewTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    val hadoopRelation = getHadoopRelation(catalog, tableDesc)
    +    InsertIntoHadoopFsRelationCommand(
    +      hadoopRelation.location.rootPaths.head,
    +      Map.empty, // We don't support to convert partitioned table.
    +      false,
    +      Seq.empty, // We don't support to convert partitioned table.
    +      hadoopRelation.bucketSpec,
    +      hadoopRelation.fileFormat,
    +      hadoopRelation.options,
    +      query,
    +      SaveMode.Overwrite,
    --- End diff --
    
    if the only difference is this `mode`, maybe we can further deduplicate the code. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    cc @cloud-fan @HyukjinKwon 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99370/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Also, cc @gatorsmile .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96625 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96625/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228825355
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
     
       override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
    --- End diff --
    
    I'm OK with that, since we do have 2 different ways to do Hive CTAS.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99415/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #97352 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97352/testReport)** for PR 22514 at commit [`1223178`](https://github.com/apache/spark/commit/122317891cb2794d67b5e11157a43afaa25edbba).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96625/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    > can we try a query and see what the SQL UI looks like?
    
    Yes. I will try a query and post the SQL UI.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99398 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99398/testReport)** for PR 22514 at commit [`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r238871523
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -181,62 +180,39 @@ case class RelationConversions(
         conf: SQLConf,
         sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
       private def isConvertible(relation: HiveTableRelation): Boolean = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -    serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    -      serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
    +    isConvertible(relation.tableMeta)
       }
     
    -  // Return true for Apache ORC and Hive ORC-related configuration names.
    -  // Note that Spark doesn't support configurations like `hive.merge.orcfile.stripe.level`.
    -  private def isOrcProperty(key: String) =
    -    key.startsWith("orc.") || key.contains(".orc.")
    -
    -  private def isParquetProperty(key: String) =
    -    key.startsWith("parquet.") || key.contains(".parquet.")
    -
    -  private def convert(relation: HiveTableRelation): LogicalRelation = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -
    -    // Consider table and storage properties. For properties existing in both sides, storage
    -    // properties will supersede table properties.
    -    if (serde.contains("parquet")) {
    -      val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
    -        relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
    -        conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
    -      sessionCatalog.metastoreCatalog
    -        .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
    -    } else {
    -      val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
    -        relation.tableMeta.storage.properties
    -      if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
    -          "orc")
    -      } else {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
    -          "orc")
    -      }
    -    }
    +  private def isConvertible(tableMeta: CatalogTable): Boolean = {
    +    val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    +    serde.contains("parquet") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    +      serde.contains("orc") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_ORC)
       }
     
    +  private val metastoreCatalog = sessionCatalog.metastoreCatalog
    +
       override def apply(plan: LogicalPlan): LogicalPlan = {
         plan resolveOperators {
           // Write path
           case InsertIntoTable(r: HiveTableRelation, partition, query, overwrite, ifPartitionNotExists)
             // Inserting into partitioned table is not supported in Parquet/Orc data source (yet).
               if query.resolved && DDLUtils.isHiveTable(r.tableMeta) &&
                 !r.isPartitioned && isConvertible(r) =>
    -        InsertIntoTable(convert(r), partition, query, overwrite, ifPartitionNotExists)
    +        InsertIntoTable(metastoreCatalog.convert(r), partition,
    +          query, overwrite, ifPartitionNotExists)
     
           // Read path
           case relation: HiveTableRelation
               if DDLUtils.isHiveTable(relation.tableMeta) && isConvertible(relation) =>
    -        convert(relation)
    +        metastoreCatalog.convert(relation)
    +
    +      // CTAS
    +      case CreateTable(tableDesc, mode, Some(query))
    +          if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
    +            isConvertible(tableDesc) =>
    --- End diff --
    
    Add an internal SQL conf here? The perf impact is huge. It could be better or worse. 
    
    Also add it to the migration guide and explain the behavior changes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99423 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99423/testReport)** for PR 22514 at commit [`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99423/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96792 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96792/testReport)** for PR 22514 at commit [`1223178`](https://github.com/apache/spark/commit/122317891cb2794d67b5e11157a43afaa25edbba).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #98010 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98010/testReport)** for PR 22514 at commit [`0b0a900`](https://github.com/apache/spark/commit/0b0a90059331a19f160de35d12634cad1b98f944).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228772799
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -21,7 +21,6 @@ import java.net.URI
     
     import org.apache.spark.sql._
     import org.apache.spark.sql.catalyst.catalog._
    -import org.apache.spark.sql.catalyst.expressions.Attribute
    --- End diff --
    
    If possible, let's not touch this because we didn't change anything in this file. It would be helpful for backporting.
    SPARK-25271 is reported as a regression in 2.3.x. I assume that we need to backport this for 2.4.1 and 2.3.3 at least.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r227654755
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
     
       override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
         val catalog = sparkSession.sessionState.catalog
    +    val metastoreCatalog = catalog.asInstanceOf[HiveSessionCatalog].metastoreCatalog
    +
    +    // Whether this table is convertible to data source relation.
    +    val isConvertible = metastoreCatalog.isConvertible(tableDesc)
    --- End diff --
    
    I feel `CreateHiveTableAsSelectCommand` is not useful. It simply creates the table first and then call `InsertIntoHiveTable.run`. Maybe we should just remove it and implement hive table CTAS as `Union(CreateTable, InsertIntoTable)`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96792 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96792/testReport)** for PR 22514 at commit [`1223178`](https://github.com/apache/spark/commit/122317891cb2794d67b5e11157a43afaa25edbba).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Yea, lets see if retest works well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239264673
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -181,62 +180,39 @@ case class RelationConversions(
         conf: SQLConf,
         sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
       private def isConvertible(relation: HiveTableRelation): Boolean = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -    serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    -      serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
    +    isConvertible(relation.tableMeta)
       }
     
    -  // Return true for Apache ORC and Hive ORC-related configuration names.
    -  // Note that Spark doesn't support configurations like `hive.merge.orcfile.stripe.level`.
    -  private def isOrcProperty(key: String) =
    -    key.startsWith("orc.") || key.contains(".orc.")
    -
    -  private def isParquetProperty(key: String) =
    -    key.startsWith("parquet.") || key.contains(".parquet.")
    -
    -  private def convert(relation: HiveTableRelation): LogicalRelation = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -
    -    // Consider table and storage properties. For properties existing in both sides, storage
    -    // properties will supersede table properties.
    -    if (serde.contains("parquet")) {
    -      val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
    -        relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
    -        conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
    -      sessionCatalog.metastoreCatalog
    -        .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
    -    } else {
    -      val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
    -        relation.tableMeta.storage.properties
    -      if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
    -          "orc")
    -      } else {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
    -          "orc")
    -      }
    -    }
    +  private def isConvertible(tableMeta: CatalogTable): Boolean = {
    +    val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    +    serde.contains("parquet") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    +      serde.contains("orc") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_ORC)
       }
     
    +  private val metastoreCatalog = sessionCatalog.metastoreCatalog
    +
       override def apply(plan: LogicalPlan): LogicalPlan = {
         plan resolveOperators {
           // Write path
           case InsertIntoTable(r: HiveTableRelation, partition, query, overwrite, ifPartitionNotExists)
             // Inserting into partitioned table is not supported in Parquet/Orc data source (yet).
               if query.resolved && DDLUtils.isHiveTable(r.tableMeta) &&
                 !r.isPartitioned && isConvertible(r) =>
    -        InsertIntoTable(convert(r), partition, query, overwrite, ifPartitionNotExists)
    +        InsertIntoTable(metastoreCatalog.convert(r), partition,
    +          query, overwrite, ifPartitionNotExists)
     
           // Read path
           case relation: HiveTableRelation
               if DDLUtils.isHiveTable(relation.tableMeta) && isConvertible(relation) =>
    -        convert(relation)
    +        metastoreCatalog.convert(relation)
    +
    +      // CTAS
    +      case CreateTable(tableDesc, mode, Some(query))
    +          if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
    +            isConvertible(tableDesc) =>
    --- End diff --
    
    This is not for perf optimization only. This is using different write paths. Thus, we could have different limits/bugs that might be exposed after this code change. We just let the community aware of this change. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99694/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228780430
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
     
       override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
    --- End diff --
    
    Some more thoughts:
    
    `CreateHiveTableAsSelectCommand` just runs another command, so we will not get any metric for this plan node. It's OK if we use the hive writer, as we indeed can't get any metrics(the writing is done by hive). However, if we can convert and use Spark's native writer, we do have metrics. I think a better fix is to replace Hive CTAS with data source CTAS during optimization.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    @cloud-fan The high level idea is not to put expose conversion details to `CreateTable`. But let `CreateHiveTableAsSelectCommand` to decide whether to do conversion. So in `HiveAnalysis` rule, no thing is changed, we still transform `CreateTable` to `CreateHiveTableAsSelectCommand` if it is a Hive table.
    
    In `CreateHiveTableAsSelectCommand`, it checks if the relation is convertible. If yes, it makes the conversion and write into data source relation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5487/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99495/testReport)** for PR 22514 at commit [`9629175`](https://github.com/apache/spark/commit/96291751c5a4992325f37bcb794ea5fd3f31593b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96614 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96614/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228009289
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
     
       override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
         val catalog = sparkSession.sessionState.catalog
    +    val metastoreCatalog = catalog.asInstanceOf[HiveSessionCatalog].metastoreCatalog
    +
    +    // Whether this table is convertible to data source relation.
    +    val isConvertible = metastoreCatalog.isConvertible(tableDesc)
    --- End diff --
    
    Made a try on this idea.
    
    There is an issue that `convertToLogicalRelation` needs that the `HiveTableRelation` is an existing relation. It is good for `InsertIntoTable` case.
    
    For CTAS now, this relation doesn't exist. Although we use an `Union` and `CreateTable` will be run first, the conversion is happened during analysis stage and the table is not created yet.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228816243
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
     
       override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
    --- End diff --
    
    then how about we create a special Hive CTAS command that follows data source CTAS command but creates Hive table?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228013784
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
     
       override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
         val catalog = sparkSession.sessionState.catalog
    +    val metastoreCatalog = catalog.asInstanceOf[HiveSessionCatalog].metastoreCatalog
    +
    +    // Whether this table is convertible to data source relation.
    +    val isConvertible = metastoreCatalog.isConvertible(tableDesc)
    --- End diff --
    
    ah makes sense, thanks for trying!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #97957 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97957/testReport)** for PR 22514 at commit [`5780a5e`](https://github.com/apache/spark/commit/5780a5ecaf671e4a7475cc7ac8fc345308368fcf).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r238933039
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -181,62 +180,39 @@ case class RelationConversions(
         conf: SQLConf,
         sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
       private def isConvertible(relation: HiveTableRelation): Boolean = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -    serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    -      serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
    +    isConvertible(relation.tableMeta)
       }
     
    -  // Return true for Apache ORC and Hive ORC-related configuration names.
    -  // Note that Spark doesn't support configurations like `hive.merge.orcfile.stripe.level`.
    -  private def isOrcProperty(key: String) =
    -    key.startsWith("orc.") || key.contains(".orc.")
    -
    -  private def isParquetProperty(key: String) =
    -    key.startsWith("parquet.") || key.contains(".parquet.")
    -
    -  private def convert(relation: HiveTableRelation): LogicalRelation = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -
    -    // Consider table and storage properties. For properties existing in both sides, storage
    -    // properties will supersede table properties.
    -    if (serde.contains("parquet")) {
    -      val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
    -        relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
    -        conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
    -      sessionCatalog.metastoreCatalog
    -        .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
    -    } else {
    -      val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
    -        relation.tableMeta.storage.properties
    -      if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
    -          "orc")
    -      } else {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
    -          "orc")
    -      }
    -    }
    +  private def isConvertible(tableMeta: CatalogTable): Boolean = {
    +    val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    +    serde.contains("parquet") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    +      serde.contains("orc") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_ORC)
       }
     
    +  private val metastoreCatalog = sessionCatalog.metastoreCatalog
    +
       override def apply(plan: LogicalPlan): LogicalPlan = {
         plan resolveOperators {
           // Write path
           case InsertIntoTable(r: HiveTableRelation, partition, query, overwrite, ifPartitionNotExists)
             // Inserting into partitioned table is not supported in Parquet/Orc data source (yet).
               if query.resolved && DDLUtils.isHiveTable(r.tableMeta) &&
                 !r.isPartitioned && isConvertible(r) =>
    -        InsertIntoTable(convert(r), partition, query, overwrite, ifPartitionNotExists)
    +        InsertIntoTable(metastoreCatalog.convert(r), partition,
    +          query, overwrite, ifPartitionNotExists)
     
           // Read path
           case relation: HiveTableRelation
               if DDLUtils.isHiveTable(relation.tableMeta) && isConvertible(relation) =>
    -        convert(relation)
    +        metastoreCatalog.convert(relation)
    +
    +      // CTAS
    +      case CreateTable(tableDesc, mode, Some(query))
    +          if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
    +            isConvertible(tableDesc) =>
    --- End diff --
    
    Since the regression was already introduced, we need to add a conf and migration guide. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239725957
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,116 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    --- End diff --
    
    Can we fix it in this PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3476/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97995/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99369/testReport)** for PR 22514 at commit [`796e964`](https://github.com/apache/spark/commit/796e9640726c6f5ed4be7dd356245ff121f4dea6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97352/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96614 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96614/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #97995 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97995/testReport)** for PR 22514 at commit [`5780a5e`](https://github.com/apache/spark/commit/5780a5ecaf671e4a7475cc7ac8fc345308368fcf).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99423 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99423/testReport)** for PR 22514 at commit [`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99495/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #97957 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97957/testReport)** for PR 22514 at commit [`5780a5e`](https://github.com/apache/spark/commit/5780a5ecaf671e4a7475cc7ac8fc345308368fcf).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3329/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5473/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99637/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5454/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    I see. Thank you for confirmation, @gatorsmile and @cloud-fan .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96625/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r238707304
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,127 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    +      query,
    +      overwrite = false,
    +      ifPartitionNotExists = false,
    +      outputColumnNames = outputColumnNames)
    +  }
    +
    +  override def writingCommandForNewTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    // For CTAS, there is no static partition values to insert.
    +    val partition = tableDesc.partitionColumnNames.map(_ -> None).toMap
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      partition,
    +      query,
    +      overwrite = true,
    +      ifPartitionNotExists = false,
    +      outputColumnNames = outputColumnNames)
    +  }
    +}
    +
    +/**
    + * Create table and insert the query result into it. This creates Hive table but inserts
    + * the query result into it by using data source.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class OptimizedCreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  private def getHadoopRelation(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): HadoopFsRelation = {
    +    val metastoreCatalog = catalog.asInstanceOf[HiveSessionCatalog].metastoreCatalog
    +    val hiveTable = DDLUtils.readHiveTable(tableDesc)
    +
    +    metastoreCatalog.convert(hiveTable) match {
    +      case LogicalRelation(t: HadoopFsRelation, _, _, _) => t
    +      case _ => throw new AnalysisException(s"$tableIdentifier should be converted to " +
    +        "HadoopFsRelation.")
    +    }
    +  }
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    val hadoopRelation = getHadoopRelation(catalog, tableDesc)
    +    InsertIntoHadoopFsRelationCommand(
    +      hadoopRelation.location.rootPaths.head,
    +      Map.empty, // We don't support to convert partitioned table.
    +      false,
    +      Seq.empty, // We don't support to convert partitioned table.
    +      hadoopRelation.bucketSpec,
    +      hadoopRelation.fileFormat,
    +      hadoopRelation.options,
    +      query,
    +      mode,
    +      Some(tableDesc),
    +      Some(hadoopRelation.location),
    +      query.output.map(_.name))
    +  }
    +
    +  override def writingCommandForNewTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    val hadoopRelation = getHadoopRelation(catalog, tableDesc)
    +    InsertIntoHadoopFsRelationCommand(
    +      hadoopRelation.location.rootPaths.head,
    +      Map.empty, // We don't support to convert partitioned table.
    +      false,
    +      Seq.empty, // We don't support to convert partitioned table.
    +      hadoopRelation.bucketSpec,
    +      hadoopRelation.fileFormat,
    +      hadoopRelation.options,
    +      query,
    +      SaveMode.Overwrite,
    --- End diff --
    
    ok :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3953/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r237361155
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -181,62 +180,39 @@ case class RelationConversions(
         conf: SQLConf,
         sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
       private def isConvertible(relation: HiveTableRelation): Boolean = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -    serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    -      serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
    +    isConvertible(relation.tableMeta)
       }
     
    -  // Return true for Apache ORC and Hive ORC-related configuration names.
    -  // Note that Spark doesn't support configurations like `hive.merge.orcfile.stripe.level`.
    -  private def isOrcProperty(key: String) =
    -    key.startsWith("orc.") || key.contains(".orc.")
    -
    -  private def isParquetProperty(key: String) =
    -    key.startsWith("parquet.") || key.contains(".parquet.")
    -
    -  private def convert(relation: HiveTableRelation): LogicalRelation = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -
    -    // Consider table and storage properties. For properties existing in both sides, storage
    -    // properties will supersede table properties.
    -    if (serde.contains("parquet")) {
    -      val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
    -        relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
    -        conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
    -      sessionCatalog.metastoreCatalog
    -        .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
    -    } else {
    -      val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
    -        relation.tableMeta.storage.properties
    -      if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
    -          "orc")
    -      } else {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
    -          "orc")
    -      }
    -    }
    +  private def isConvertible(tableMeta: CatalogTable): Boolean = {
    +    val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    +    serde.contains("parquet") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    +      serde.contains("orc") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_ORC)
       }
     
    +  private val metastoreCatalog = sessionCatalog.metastoreCatalog
    +
       override def apply(plan: LogicalPlan): LogicalPlan = {
         plan resolveOperators {
           // Write path
           case InsertIntoTable(r: HiveTableRelation, partition, query, overwrite, ifPartitionNotExists)
             // Inserting into partitioned table is not supported in Parquet/Orc data source (yet).
               if query.resolved && DDLUtils.isHiveTable(r.tableMeta) &&
                 !r.isPartitioned && isConvertible(r) =>
    -        InsertIntoTable(convert(r), partition, query, overwrite, ifPartitionNotExists)
    +        InsertIntoTable(metastoreCatalog.convert(r), partition,
    +          query, overwrite, ifPartitionNotExists)
     
           // Read path
           case relation: HiveTableRelation
               if DDLUtils.isHiveTable(relation.tableMeta) && isConvertible(relation) =>
    -        convert(relation)
    +        metastoreCatalog.convert(relation)
    +
    +      // CTAS
    +      case CreateTable(tableDesc, mode, Some(query))
    +          if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
    +            isConvertible(tableDesc) =>
    +        DDLUtils.checkDataColNames(tableDesc)
    --- End diff --
    
    why do we need this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99378 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99378/testReport)** for PR 22514 at commit [`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5495/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96401 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96401/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97957/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    retest this please...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r237749421
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala ---
    @@ -92,4 +92,18 @@ class HiveParquetSuite extends QueryTest with ParquetTest with TestHiveSingleton
           }
         }
       }
    +
    +  test("SPARK-25271: write empty map into hive parquet table") {
    --- End diff --
    
    Added a new test for that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r237363826
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala ---
    @@ -92,4 +92,18 @@ class HiveParquetSuite extends QueryTest with ParquetTest with TestHiveSingleton
           }
         }
       }
    +
    +  test("SPARK-25271: write empty map into hive parquet table") {
    --- End diff --
    
    +1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3484/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r227602783
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala ---
    @@ -34,11 +34,16 @@ import org.apache.spark.sql.types._
      * @param tableDesc the metadata of the table to be created.
      * @param mode the data writing mode
      * @param query an optional logical plan representing data to write into the created table.
    + * @param useExternalSerde whether to use external serde to write data, e.g., Hive Serde. Currently
    --- End diff --
    
    I don't have a clear idea now, but `CreateTable` is a general logical plan for CREATE TABLE, we may even public in to data source/catalog APIs in the future, we should not put hive specific concept here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228813738
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
     
       override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
    --- End diff --
    
    I think the table metadata created by data source CTAS and Hive CTAS are different?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r237364946
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala ---
    @@ -92,4 +92,18 @@ class HiveParquetSuite extends QueryTest with ParquetTest with TestHiveSingleton
           }
         }
       }
    +
    +  test("SPARK-25271: write empty map into hive parquet table") {
    --- End diff --
    
    I agreed. Now because we have two Hive CTAS commands, it is easier to test it. Will add tests later.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96618/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r238909363
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -181,62 +180,39 @@ case class RelationConversions(
         conf: SQLConf,
         sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
       private def isConvertible(relation: HiveTableRelation): Boolean = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -    serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    -      serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
    +    isConvertible(relation.tableMeta)
       }
     
    -  // Return true for Apache ORC and Hive ORC-related configuration names.
    -  // Note that Spark doesn't support configurations like `hive.merge.orcfile.stripe.level`.
    -  private def isOrcProperty(key: String) =
    -    key.startsWith("orc.") || key.contains(".orc.")
    -
    -  private def isParquetProperty(key: String) =
    -    key.startsWith("parquet.") || key.contains(".parquet.")
    -
    -  private def convert(relation: HiveTableRelation): LogicalRelation = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -
    -    // Consider table and storage properties. For properties existing in both sides, storage
    -    // properties will supersede table properties.
    -    if (serde.contains("parquet")) {
    -      val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
    -        relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
    -        conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
    -      sessionCatalog.metastoreCatalog
    -        .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
    -    } else {
    -      val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
    -        relation.tableMeta.storage.properties
    -      if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
    -          "orc")
    -      } else {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
    -          "orc")
    -      }
    -    }
    +  private def isConvertible(tableMeta: CatalogTable): Boolean = {
    +    val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    +    serde.contains("parquet") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    +      serde.contains("orc") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_ORC)
       }
     
    +  private val metastoreCatalog = sessionCatalog.metastoreCatalog
    +
       override def apply(plan: LogicalPlan): LogicalPlan = {
         plan resolveOperators {
           // Write path
           case InsertIntoTable(r: HiveTableRelation, partition, query, overwrite, ifPartitionNotExists)
             // Inserting into partitioned table is not supported in Parquet/Orc data source (yet).
               if query.resolved && DDLUtils.isHiveTable(r.tableMeta) &&
                 !r.isPartitioned && isConvertible(r) =>
    -        InsertIntoTable(convert(r), partition, query, overwrite, ifPartitionNotExists)
    +        InsertIntoTable(metastoreCatalog.convert(r), partition,
    +          query, overwrite, ifPartitionNotExists)
     
           // Read path
           case relation: HiveTableRelation
               if DDLUtils.isHiveTable(relation.tableMeta) && isConvertible(relation) =>
    -        convert(relation)
    +        metastoreCatalog.convert(relation)
    +
    +      // CTAS
    +      case CreateTable(tableDesc, mode, Some(query))
    +          if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
    +            isConvertible(tableDesc) =>
    --- End diff --
    
    ok.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3336/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239744338
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,116 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    --- End diff --
    
    That is different issue. I think it is better to have a separate PR to address it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    sounds like a clean solution. please go ahead, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #97352 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97352/testReport)** for PR 22514 at commit [`1223178`](https://github.com/apache/spark/commit/122317891cb2794d67b5e11157a43afaa25edbba).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239887300
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,116 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    --- End diff --
    
    Please hold this PR. We need to fix the regression first. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5562/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3592/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96792/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96401 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96401/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99637 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99637/testReport)** for PR 22514 at commit [`e04812d`](https://github.com/apache/spark/commit/e04812d6aff0270ab4c0e6c3b2e204a682739e23).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r238908877
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -181,62 +180,39 @@ case class RelationConversions(
         conf: SQLConf,
         sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
       private def isConvertible(relation: HiveTableRelation): Boolean = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -    serde.contains("parquet") && conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    -      serde.contains("orc") && conf.getConf(HiveUtils.CONVERT_METASTORE_ORC)
    +    isConvertible(relation.tableMeta)
       }
     
    -  // Return true for Apache ORC and Hive ORC-related configuration names.
    -  // Note that Spark doesn't support configurations like `hive.merge.orcfile.stripe.level`.
    -  private def isOrcProperty(key: String) =
    -    key.startsWith("orc.") || key.contains(".orc.")
    -
    -  private def isParquetProperty(key: String) =
    -    key.startsWith("parquet.") || key.contains(".parquet.")
    -
    -  private def convert(relation: HiveTableRelation): LogicalRelation = {
    -    val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    -
    -    // Consider table and storage properties. For properties existing in both sides, storage
    -    // properties will supersede table properties.
    -    if (serde.contains("parquet")) {
    -      val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
    -        relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
    -        conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
    -      sessionCatalog.metastoreCatalog
    -        .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
    -    } else {
    -      val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
    -        relation.tableMeta.storage.properties
    -      if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
    -          "orc")
    -      } else {
    -        sessionCatalog.metastoreCatalog.convertToLogicalRelation(
    -          relation,
    -          options,
    -          classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
    -          "orc")
    -      }
    -    }
    +  private def isConvertible(tableMeta: CatalogTable): Boolean = {
    +    val serde = tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    +    serde.contains("parquet") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET) ||
    +      serde.contains("orc") && SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_ORC)
       }
     
    +  private val metastoreCatalog = sessionCatalog.metastoreCatalog
    +
       override def apply(plan: LogicalPlan): LogicalPlan = {
         plan resolveOperators {
           // Write path
           case InsertIntoTable(r: HiveTableRelation, partition, query, overwrite, ifPartitionNotExists)
             // Inserting into partitioned table is not supported in Parquet/Orc data source (yet).
               if query.resolved && DDLUtils.isHiveTable(r.tableMeta) &&
                 !r.isPartitioned && isConvertible(r) =>
    -        InsertIntoTable(convert(r), partition, query, overwrite, ifPartitionNotExists)
    +        InsertIntoTable(metastoreCatalog.convert(r), partition,
    +          query, overwrite, ifPartitionNotExists)
     
           // Read path
           case relation: HiveTableRelation
               if DDLUtils.isHiveTable(relation.tableMeta) && isConvertible(relation) =>
    -        convert(relation)
    +        metastoreCatalog.convert(relation)
    +
    +      // CTAS
    +      case CreateTable(tableDesc, mode, Some(query))
    +          if DDLUtils.isHiveTable(tableDesc) && tableDesc.partitionColumnNames.isEmpty &&
    +            isConvertible(tableDesc) =>
    --- End diff --
    
    I don't mind to add `HiveUtils.CONVERT_METASTORE_ORC_CTAS`, maybe we can do it in a followup?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3480/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    @viirya can you explain the high-level idea about how to fix it? It seems hard to fix and we should get a consensus on the approach first.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    It's definitely not a blocker, and we don't need to hold RC5 because of it.
    
    I think it needs a little more review, and I'm going to cut RC5 today(2.4.0 has already been far delayed), so it's very likely we can't get it into 2.4.0.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99398/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99665 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99665/testReport)** for PR 22514 at commit [`3c07d74`](https://github.com/apache/spark/commit/3c07d74ec94cf85f9aad79cde6c42ea667cb2090).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class OptimizedCreateHiveTableAsSelectCommand(`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Yes this is a performance regression for users who run CTAS on Hive serde tables. This is a regression since Spark 2.3.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239668492
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,116 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    --- End diff --
    
    Yes, I've discussed with @cloud-fan. DataFrameWriter allows dynamic partition on Hive CTAS. For now seems this syntax is not allowed by SparkSqlParser.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5745/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r225079174
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala ---
    @@ -34,11 +34,16 @@ import org.apache.spark.sql.types._
      * @param tableDesc the metadata of the table to be created.
      * @param mode the data writing mode
      * @param query an optional logical plan representing data to write into the created table.
    + * @param useExternalSerde whether to use external serde to write data, e.g., Hive Serde. Currently
    --- End diff --
    
    This is too hacky. We should not leak hive specific knowledge to general logical plans.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    @cloud-fan, is this a performance regression that affects users that use Hive serde tables as well?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r237756348
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,98 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  def getDataWritingCommand(
    --- End diff --
    
    I feel it's better to have 2 methods: `writingCommandForExistingTable`, `writingCommandForNewTable`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96595 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96595/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99370 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99370/testReport)** for PR 22514 at commit [`6e2a31b`](https://github.com/apache/spark/commit/6e2a31b0db71ab3dcc8a81cb5ca585f784793736).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99694/testReport)** for PR 22514 at commit [`57fc943`](https://github.com/apache/spark/commit/57fc94383ad3c66e5b93f40378d8c94aaa726e7a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4471/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #97995 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97995/testReport)** for PR 22514 at commit [`5780a5e`](https://github.com/apache/spark/commit/5780a5ecaf671e4a7475cc7ac8fc345308368fcf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r225082659
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala ---
    @@ -34,11 +34,16 @@ import org.apache.spark.sql.types._
      * @param tableDesc the metadata of the table to be created.
      * @param mode the data writing mode
      * @param query an optional logical plan representing data to write into the created table.
    + * @param useExternalSerde whether to use external serde to write data, e.g., Hive Serde. Currently
    --- End diff --
    
    This is because all rules related to conversion to data source are located in `RelationConversions`.
    
    If we loose this requirement, we can avoid this flag and let `CreateHiveTableAsSelectCommand` decide to convert it to data source or not.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239319889
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,116 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    --- End diff --
    
    Could you point it out? I want to ensure it is covered



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4458/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239895028
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,116 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    --- End diff --
    
    https://github.com/apache/spark/pull/23255


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99694/testReport)** for PR 22514 at commit [`57fc943`](https://github.com/apache/spark/commit/57fc94383ad3c66e5b93f40378d8c94aaa726e7a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3585/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98010/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #98010 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98010/testReport)** for PR 22514 at commit [`0b0a900`](https://github.com/apache/spark/commit/0b0a90059331a19f160de35d12634cad1b98f944).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r239539992
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -95,9 +77,116 @@ case class CreateHiveTableAsSelectCommand(
         Seq.empty[Row]
       }
     
    +  // Returns `DataWritingCommand` used to write data when the table exists.
    +  def writingCommandForExistingTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
    +  // Returns `DataWritingCommand` used to write data when the table doesn't exist.
    +  def writingCommandForNewTable(
    +    catalog: SessionCatalog,
    +    tableDesc: CatalogTable): DataWritingCommand
    +
       override def argString: String = {
         s"[Database:${tableDesc.database}, " +
         s"TableName: ${tableDesc.identifier.table}, " +
         s"InsertIntoHiveTable]"
       }
     }
    +
    +/**
    + * Create table and insert the query result into it.
    + *
    + * @param tableDesc the table description, which may contain serde, storage handler etc.
    + * @param query the query whose result will be insert into the new relation
    + * @param mode SaveMode
    + */
    +case class CreateHiveTableAsSelectCommand(
    +    tableDesc: CatalogTable,
    +    query: LogicalPlan,
    +    outputColumnNames: Seq[String],
    +    mode: SaveMode)
    +  extends CreateHiveTableAsSelectBase {
    +
    +  override def writingCommandForExistingTable(
    +      catalog: SessionCatalog,
    +      tableDesc: CatalogTable): DataWritingCommand = {
    +    InsertIntoHiveTable(
    +      tableDesc,
    +      Map.empty,
    --- End diff --
    
    Does this work?
    ```
        withTable("hive_test") {
          withSQLConf(
              "hive.exec.dynamic.partition.mode" -> "nonstrict") {
            val df = Seq(("a", 100)).toDF("part", "id")
            df.write.format("hive").partitionBy("part")
              .mode("overwrite").saveAsTable("hive_test")
            df.write.format("hive").partitionBy("part")
              .mode("append").saveAsTable("hive_test")
          }
        }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96618 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96618/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96595/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99501 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99501/testReport)** for PR 22514 at commit [`9629175`](https://github.com/apache/spark/commit/96291751c5a4992325f37bcb794ea5fd3f31593b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5446/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96784 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96784/testReport)** for PR 22514 at commit [`1223178`](https://github.com/apache/spark/commit/122317891cb2794d67b5e11157a43afaa25edbba).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99369/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99369/testReport)** for PR 22514 at commit [`796e964`](https://github.com/apache/spark/commit/796e9640726c6f5ed4be7dd356245ff121f4dea6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96410 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96410/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #99398 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99398/testReport)** for PR 22514 at commit [`e42a846`](https://github.com/apache/spark/commit/e42a846f625f2119c6539bf91bf337f0497402ae).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96410/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r225152625
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala ---
    @@ -34,11 +34,16 @@ import org.apache.spark.sql.types._
      * @param tableDesc the metadata of the table to be created.
      * @param mode the data writing mode
      * @param query an optional logical plan representing data to write into the created table.
    + * @param useExternalSerde whether to use external serde to write data, e.g., Hive Serde. Currently
    --- End diff --
    
    Do you think it is better to put all this conversion stuff of Hive CTAS into `CreateHiveTableAsSelectCommand`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r227654240
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
     
       override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
         val catalog = sparkSession.sessionState.catalog
    +    val metastoreCatalog = catalog.asInstanceOf[HiveSessionCatalog].metastoreCatalog
    +
    +    // Whether this table is convertible to data source relation.
    +    val isConvertible = metastoreCatalog.isConvertible(tableDesc)
    --- End diff --
    
    another idea: can we move this logic to the `RelationConversions` rule? e.g.
    ```
    case CreateTable(tbl, mode, Some(query)) if DDLUtils.isHiveTable(tbl) && isConvertible(tbl) =>
      Union(CreateTable(tbl, mode, None), InsertIntoTable ...)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96595 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96595/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5695/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4428/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96618 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96618/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r228772719
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala ---
    @@ -92,4 +92,18 @@ class HiveParquetSuite extends QueryTest with ParquetTest with TestHiveSingleton
           }
         }
       }
    +
    +  test("SPARK-25271: write empty map into hive parquet table") {
    --- End diff --
    
    Although this is the original reported test case in SPARK-25271 `Creating parquet table with all the column null throws exception`, this PR is aiming to resolve the root cause for general issues `Hive ctas commands should use data source if it is convertible`. Can we have more additional test cases outside `HiveParquetSuite`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22514#discussion_r227655503
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala ---
    @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
     
       override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = {
         val catalog = sparkSession.sessionState.catalog
    +    val metastoreCatalog = catalog.asInstanceOf[HiveSessionCatalog].metastoreCatalog
    +
    +    // Whether this table is convertible to data source relation.
    +    val isConvertible = metastoreCatalog.isConvertible(tableDesc)
    --- End diff --
    
    That is interesting idea. Let me try it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96784 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96784/testReport)** for PR 22514 at commit [`1223178`](https://github.com/apache/spark/commit/122317891cb2794d67b5e11157a43afaa25edbba).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    **[Test build #96410 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96410/testReport)** for PR 22514 at commit [`5debc60`](https://github.com/apache/spark/commit/5debc6096ae6e505d3386fd7eb569d154f158d55).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22514
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96401/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org