You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rdblue <gi...@git.apache.org> on 2016/04/12 02:23:11 UTC

[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

GitHub user rdblue opened a pull request:

    https://github.com/apache/spark/pull/12313

    [SPARK-14543] [SQL] Fix InsertIntoTable column resolution. (WIP)

    WIP: this depends on #12239 and includes its commits for SPARK-14459.
    
    ## What changes are proposed in this pull request?
    
    # This updates the logic to resolve output table from the incoming `LogicalPlan`. It catches cases where there are too many data columns and throws an `AnalysisException` rather than silently dropping the extra data. It also improves the error message when there are too few columns and warns when the output columns appear to be out of order.
    # This combines the pre-insert casts for Hive's `MetastoreRelation` with the pre-insert cast and rename for `LogicalRelations`. Both are now handled as a single `ResolveOutputColumns` step in the analyzer that implements the above improvements. Casts are now UpCasts to avoid silently adding incorrect casts when columns are misaligned.
    # This adds a by-name column resolution strategy that matches output columns to the incoming data by name. This is exposed on the `DataFrameWriter`:
    
    ```scala
    sqlContext.table("source").write.byName.insertInto("destination")
    ```
    
    ## How was this patch tested?
    
    This patch includes unit tests that exercise the cases outlined above.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rdblue/spark SPARK-14543-fix-hive-write-cast-and-rename

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12313.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12313
    
----
commit 792be85b9c694b9967dcd863075c3230abd70225
Author: Ryan Blue <bl...@apache.org>
Date:   2016-04-01T19:02:59Z

    SPARK-14459: Detect relation partitioning and adjust the logical plan to match.
    
    This detects a relation's partitioning and adds checks to the analyzer.
    If an InsertIntoTable node has no partitioning, it is replaced by the
    relation's partition scheme and input columns are correctly adjusted,
    placing the partition columns at the end in partition order. If an
    InsertIntoTable node has partitioning, it is checked against the table's
    reported partitions.
    
    These changes required adding a PartitionedRelation trait to the catalog
    interface because Hive's MetastoreRelation doesn't extend
    CatalogRelation.
    
    This commit also includes a fix to InsertIntoTable's resolved logic,
    which now detects that all expected columns are present, including
    dynamic partition columns. Previously, the number of expected columns
    was not checked and resolved was true if there were missing columns.

commit 247f0566588259cf8d4e2dec77cb06a739c7b86f
Author: Ryan Blue <bl...@apache.org>
Date:   2016-04-07T17:17:59Z

    SPARK-14459: Add test for InsertIntoTable resolution.
    
    This tests the bug in InsertIntoTable's resolve method, where zip
    ignores expected output columns because there is no corresponding input
    column.

commit 2a807a97cf793a480dbaf172789a315624db44b9
Author: Ryan Blue <bl...@apache.org>
Date:   2016-04-08T22:23:34Z

    SPARK-14459: Update partition spec validation test.
    
    This test expected to fail the strict partition check, but with support
    for table partitioning in the analyzer the problem is caught sooner and
    has a better error message. The message now complains that the
    partitioning doesn't match rather than strict mode, which wouldn't help.

commit 6491788a3419b091dfd6ca2a14eebb9b0ec33635
Author: Ryan Blue <bl...@apache.org>
Date:   2016-04-11T19:51:02Z

    SPARK-14459: Fix error message broken by partition checks.
    
    SPARK-6941 added a test that verifies a reasonable error message when
    trying to write to a OneRowRelation. OneRowRelation's output is always
    Nil, so the additional checks were catching a different problem. This
    commit adds a OneRowRelation class and uses it to set the correct
    expected columns in the test.

commit cf82d952c731af8ca7188bf1dc99b453ff00aa4b
Author: Ryan Blue <bl...@apache.org>
Date:   2016-04-05T20:35:50Z

    Move pre-insertion casts into the analyzer and fix edge cases.
    
    This combines Hive's pre-insertion casts (without renames) that handle
    partitioning with the pre-insertion casts/renames in core. The combined
    rule, ResolveOutputColumns, will resolve columns by name or by position.
    Resolving by position will detect cases where the number of columns is
    incorrect or where the input columns are a permutation of the output
    columns and fail. When resolving by name, each output column is located
    by name in the child plan. This handles cases where a subset of a data
    frame is written out.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-221966605
  
    @rdblue Thank you for this PR. Those improvements sound good. I chatted with others. Here are two questions.
    1. `sqlContext.table("source").write.byName.insertInto("destination")` will change the behavior of insertInto and make it different from SQL's insertInto. Looks like it is better to have another command instead of changing the behavior of insertInto in this way.
    2. When we use resolution by name, seems we will complain if the data has less number of fields. What will happen if the data has more fields?
    
    How about we first improve the position-based resolution part? Then, we discuss the name-based part separately?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213205224
  
    **[Test build #56605 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56605/consoleFull)** for PR 12313 at commit [`7516ffd`](https://github.com/apache/spark/commit/7516ffd2e6510e6f9805563970637b019a4e5fd9).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r66688732
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -505,6 +506,117 @@ class Analyzer(
         }
       }
     
    +  object ResolveOutputColumns extends Rule[LogicalPlan] {
    +    def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
    +      case ins @ InsertIntoTable(relation: LogicalPlan, partition, _, _, _, _)
    +          if relation.resolved && !ins.resolved =>
    +        resolveOutputColumns(ins, expectedColumns(relation, partition), relation.toString)
    +    }
    +
    +    private def resolveOutputColumns(
    +        insertInto: InsertIntoTable,
    +        columns: Seq[Attribute],
    +        relation: String) = {
    +      val resolved = if (insertInto.isMatchByName) {
    +        projectAndCastOutputColumns(columns, insertInto.child, relation)
    +      } else {
    +        castAndRenameOutputColumns(columns, insertInto.child, relation)
    +      }
    +
    +      if (resolved == insertInto.child.output) {
    +        insertInto
    +      } else {
    +        insertInto.copy(child = Project(resolved, insertInto.child))
    +      }
    +    }
    +
    +    /**
    +     * Resolves output columns by input column name, adding casts if necessary.
    +     */
    +    private def projectAndCastOutputColumns(
    +        output: Seq[Attribute],
    +        data: LogicalPlan,
    +        relation: String): Seq[NamedExpression] = {
    +      output.map { col =>
    +        data.resolveQuoted(col.name, resolver) match {
    +          case Some(inCol) if col.dataType != inCol.dataType =>
    +            Alias(UpCast(inCol, col.dataType, Seq()), col.name)()
    +          case Some(inCol) => inCol
    +          case None =>
    +            throw new AnalysisException(
    +              s"Cannot resolve ${col.name} in ${data.output.mkString(",")}")
    +        }
    +      }
    +    }
    +
    +    private def castAndRenameOutputColumns(
    +        output: Seq[Attribute],
    +        data: LogicalPlan,
    +        relation: String): Seq[NamedExpression] = {
    +      val outputNames = output.map(_.name)
    +      // incoming expressions may not have names
    +      val inputNames = data.output.flatMap(col => Option(col.name))
    +      if (output.size > data.output.size) {
    +        // always a problem
    +        throw new AnalysisException(
    +          s"""Not enough data columns to write into $relation:
    +             |Data columns: ${data.output.mkString(",")}
    +             |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +      } else if (output.size < data.output.size) {
    +        if (outputNames.toSet.subsetOf(inputNames.toSet)) {
    --- End diff --
    
    There's a slight difference if we're exposing the write-by-name feature: if the names are a subset, then the user probably intended to write by name and we should suggest how to do that. Given that we're not exposing write-by-name, I'll remove it and we can add it back later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59850/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-222225637
  
    @rdblue Thank you for your repl. For #2, yea, I feel it is better to be strict right now. I checked with yesterday's master and seems we already require the data and the table have the same number of fields for write.insertInto (see https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/52316283059651/545869019913238/4814681571895601/latest.html)?
    
    I do agree that in DataFrame API, it is not obvious that `insertInto` follows SQL's behavior. But, I am not sure changing its behavior is the best solution. My main concern of adding `byName` is that it makes the behavior of `insertInto` different from SQL's `insertInto`. Also, I feel it will be good to have a holistic solution that handles self-describing data well (e.g. adding new columns/inner fields and missing existing columns/inner fields). 
    
    (btw, right now, `write.mode("append").saveAsTable(....)` does name-based resolution on top level columns.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59790/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-208998364
  
    **[Test build #55616 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55616/consoleFull)** for PR 12313 at commit [`f2186e5`](https://github.com/apache/spark/commit/f2186e5b2e0d569c31f1bf05072949fb0f5a4358).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  class ResolveOutputColumns extends Rule[LogicalPlan] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212644405
  
    **[Test build #56411 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56411/consoleFull)** for PR 12313 at commit [`6c107ff`](https://github.com/apache/spark/commit/6c107ff6307b279c507307e369cba1545e0c1468).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65758947
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -284,8 +284,128 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
           val data = (1 to 10).map(i => (i.toLong, s"data-$i")).toDF("id", "data")
     
           val logical = InsertIntoTable(spark.table("partitioned").logicalPlan,
    -        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false)
    +        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false,
    +        Map("matchByName" -> "true"))
           assert(!logical.resolved, "Should not resolve: missing partition data")
         }
       }
    +
    +  test("Insert unnamed expressions by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      // should be able to insert an expression when NOT mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +          .write.insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Insert expression by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +            .write.option("matchByName", true).insertInto("partitioned")
    +      }
    +
    +      // should be able to insert an expression using AS when mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id) as data")
    +          .write.option("matchByName", true).insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reject missing columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +      }
    +    }
    +  }
    +
    +  test("Reject extra columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, data string, extra string, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      val data = (1 to 10)
    +          .map(i => (i, s"data-$i", s"${i * i}", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "extra", "part")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +
    +      val expected = data.select("id", "data", "part")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Ignore names when writing by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string, data string)") // part, data transposed
    +      sql("CREATE TABLE destination (id bigint, data string, part string)")
    +
    +      val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +
    +      // write into the reordered table by name
    +      data.write.option("matchByName", true).insertInto("source")
    +      checkAnswer(sql("SELECT id, data, part FROM source"), data.collect().toSeq)
    +
    +      val expected = data.select($"id", $"part" as "data", $"data" as "part")
    +
    +      // this produces a warning, but writes src.part -> dest.data and src.data -> dest.part
    +      spark.table("source").write.insertInto("destination")
    +      checkAnswer(sql("SELECT id, data, part FROM destination"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reorder columns by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (data string, part string, id bigint)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val data = (1 to 10).map(i => (s"data-$i", if ((i % 2) == 0) "even" else "odd", i))
    +          .toDF("data", "part", "id")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    --- End diff --
    
    Yes.
    
    Users call `insertInto` when they need to insert data to an existing table and `saveAsTable` when they need to create and insert. Since the latter is backed by `CreateTableAsSelect`, it looks like there's a straight-forward mapping from Hive SQL to writer commands.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65776253
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -284,8 +284,128 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
           val data = (1 to 10).map(i => (i.toLong, s"data-$i")).toDF("id", "data")
     
           val logical = InsertIntoTable(spark.table("partitioned").logicalPlan,
    -        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false)
    +        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false,
    +        Map("matchByName" -> "true"))
           assert(!logical.resolved, "Should not resolve: missing partition data")
         }
       }
    +
    +  test("Insert unnamed expressions by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      // should be able to insert an expression when NOT mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +          .write.insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Insert expression by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +            .write.option("matchByName", true).insertInto("partitioned")
    +      }
    +
    +      // should be able to insert an expression using AS when mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id) as data")
    +          .write.option("matchByName", true).insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reject missing columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +      }
    +    }
    +  }
    +
    +  test("Reject extra columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, data string, extra string, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      val data = (1 to 10)
    +          .map(i => (i, s"data-$i", s"${i * i}", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "extra", "part")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +
    +      val expected = data.select("id", "data", "part")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Ignore names when writing by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string, data string)") // part, data transposed
    +      sql("CREATE TABLE destination (id bigint, data string, part string)")
    +
    +      val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +
    +      // write into the reordered table by name
    +      data.write.option("matchByName", true).insertInto("source")
    +      checkAnswer(sql("SELECT id, data, part FROM source"), data.collect().toSeq)
    +
    +      val expected = data.select($"id", $"part" as "data", $"data" as "part")
    +
    +      // this produces a warning, but writes src.part -> dest.data and src.data -> dest.part
    +      spark.table("source").write.insertInto("destination")
    +      checkAnswer(sql("SELECT id, data, part FROM destination"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reorder columns by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (data string, part string, id bigint)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val data = (1 to 10).map(i => (s"data-$i", if ((i % 2) == 0) "even" else "odd", i))
    +          .toDF("data", "part", "id")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    --- End diff --
    
    I agree that we should take some time to think through the semantics of the public API, but I think that it is clear that `InsertIntoTable` should support both by-name and by-position resolution. Since that's already implemented but not publicly accessible, I think it's fine to continue with this PR rather than removing support only to add it later along with the public API. Does that sound reasonable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #75047 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75047/consoleFull)** for PR 12313 at commit [`906e68d`](https://github.com/apache/spark/commit/906e68d071daf4e2f15b0f2017b248b872bb6285).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #59794 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59794/consoleFull)** for PR 12313 at commit [`ecaf95f`](https://github.com/apache/spark/commit/ecaf95fae2135e368d6d10d2bb1783ffd7599bec).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212572463
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56386/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212608804
  
    **[Test build #56411 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56411/consoleFull)** for PR 12313 at commit [`6c107ff`](https://github.com/apache/spark/commit/6c107ff6307b279c507307e369cba1545e0c1468).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    We were trying to get this in just before the 2.0 release, which was a bad time. We've just been maintaining it in our version, but I'm going to be rebasing it on to 2.1 soon so I'll see what needs to be done to get it in upstream.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #59894 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59894/consoleFull)** for PR 12313 at commit [`302464b`](https://github.com/apache/spark/commit/302464b1b7f0eb8432e7816757b393cec9230b78).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65758393
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -284,8 +284,128 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
           val data = (1 to 10).map(i => (i.toLong, s"data-$i")).toDF("id", "data")
     
           val logical = InsertIntoTable(spark.table("partitioned").logicalPlan,
    -        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false)
    +        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false,
    +        Map("matchByName" -> "true"))
           assert(!logical.resolved, "Should not resolve: missing partition data")
         }
       }
    +
    +  test("Insert unnamed expressions by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      // should be able to insert an expression when NOT mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +          .write.insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Insert expression by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +            .write.option("matchByName", true).insertInto("partitioned")
    +      }
    +
    +      // should be able to insert an expression using AS when mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id) as data")
    +          .write.option("matchByName", true).insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reject missing columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +      }
    +    }
    +  }
    +
    +  test("Reject extra columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, data string, extra string, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      val data = (1 to 10)
    +          .map(i => (i, s"data-$i", s"${i * i}", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "extra", "part")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +
    +      val expected = data.select("id", "data", "part")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Ignore names when writing by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string, data string)") // part, data transposed
    +      sql("CREATE TABLE destination (id bigint, data string, part string)")
    +
    +      val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +
    +      // write into the reordered table by name
    +      data.write.option("matchByName", true).insertInto("source")
    +      checkAnswer(sql("SELECT id, data, part FROM source"), data.collect().toSeq)
    +
    +      val expected = data.select($"id", $"part" as "data", $"data" as "part")
    +
    +      // this produces a warning, but writes src.part -> dest.data and src.data -> dest.part
    +      spark.table("source").write.insertInto("destination")
    +      checkAnswer(sql("SELECT id, data, part FROM destination"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reorder columns by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (data string, part string, id bigint)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val data = (1 to 10).map(i => (s"data-$i", if ((i % 2) == 0) "even" else "odd", i))
    +          .toDF("data", "part", "id")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    --- End diff --
    
    yea, they have different plans, but do users feel this difference? AFAIK the major difference is that `insertInto` is by-position resolution and `saveAsTable` is by-name resolution. Please correct me if I was wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213150375
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213634441
  
    **[Test build #56757 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56757/consoleFull)** for PR 12313 at commit [`c443b32`](https://github.com/apache/spark/commit/c443b324099ce4937a4fbf38408e3b0be034658f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-217947831
  
    **[Test build #58156 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58156/consoleFull)** for PR 12313 at commit [`ab23c9c`](https://github.com/apache/spark/commit/ab23c9ce22ed09c5a228b32cf8e6ca1f0e7a56b1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    I see. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213559055
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-220390952
  
    **[Test build #58876 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58876/consoleFull)** for PR 12313 at commit [`3a24e36`](https://github.com/apache/spark/commit/3a24e36ceb9b815e8c933723e22dbdbfed35c840).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212572199
  
    **[Test build #56386 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56386/consoleFull)** for PR 12313 at commit [`0a6ec77`](https://github.com/apache/spark/commit/0a6ec77ae3d3ec932c37bdd63ae20d6b71180fd1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r63912066
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -498,6 +499,117 @@ class Analyzer(
         }
       }
     
    +  val ResolveOutputColumns = new ResolveOutputColumns
    +  class ResolveOutputColumns extends Rule[LogicalPlan] {
    +    def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
    +      case ins @ InsertIntoTable(relation: LogicalPlan, partition, _, _, _, _)
    +          if ins.childrenResolved && !ins.resolved =>
    +        resolveOutputColumns(ins, expectedColumns(relation, partition), relation.toString)
    +    }
    +
    +    protected def resolveOutputColumns(insertInto: InsertIntoTable,
    +                                       columns: Seq[Attribute],
    +                                       relation: String) = {
    +      val resolved = if (insertInto.isMatchByName) {
    +        projectAndCastOutputColumns(columns, insertInto.child, relation)
    +      } else {
    +        castAndRenameOutputColumns(columns, insertInto.child, relation)
    +      }
    +
    +      if (resolved == insertInto.child.output) {
    +        insertInto
    +      } else {
    +        insertInto.copy(child = Project(resolved, insertInto.child))
    +      }
    +    }
    +
    +    /**
    +     * Resolves output columns by input column name, adding casts if necessary.
    +     */
    +    private def projectAndCastOutputColumns(output: Seq[Attribute],
    +                                            data: LogicalPlan,
    +                                            relation: String): Seq[NamedExpression] = {
    +      output.map { col =>
    +        data.resolveQuoted(col.name, resolver) match {
    +          case Some(inCol) if col.dataType != inCol.dataType =>
    +            Alias(UpCast(inCol, col.dataType, Seq()), col.name)()
    +          case Some(inCol) => inCol
    +          case None =>
    +            throw new AnalysisException(
    +              s"Cannot resolve ${col.name} in ${data.output.mkString(",")}")
    +        }
    +      }
    +    }
    +
    +    private def castAndRenameOutputColumns(output: Seq[Attribute],
    +                                           data: LogicalPlan,
    +                                           relation: String): Seq[NamedExpression] = {
    +      val outputNames = output.map(_.name)
    +      // incoming expressions may not have names
    +      val inputNames = data.output.flatMap(col => Option(col.name))
    +      if (output.size > data.output.size) {
    +        // always a problem
    +        throw new AnalysisException(
    +          s"""Not enough data columns to write into $relation:
    +             |Data columns: ${data.output.mkString(",")}
    +             |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +      } else if (output.size < data.output.size) {
    +        if (outputNames.toSet.subsetOf(inputNames.toSet)) {
    +          throw new AnalysisException(
    +            s"""Table column names are a subset of the input data columns:
    +               |Data columns: ${inputNames.mkString(",")}
    +               |Table columns: ${outputNames.mkString(",")}
    +               |To write a subset of the columns by name, use df.write.byName.insertInto(...)"""
    +                .stripMargin)
    +        } else {
    +          // be conservative and fail if there are too many columns
    +          throw new AnalysisException(
    +            s"""Extra data columns to write into $relation:
    +               |Data columns: ${data.output.mkString(",")}
    +               |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +        }
    +      } else {
    +        // check for reordered names and warn. this may be on purpose, so it isn't an error.
    +        if (outputNames.toSet == inputNames.toSet && outputNames != inputNames) {
    +          logWarning(
    +            s"""Data column names match the table in a different order:
    +                |Data columns: ${inputNames.mkString(",")}
    +                |Table columns: ${outputNames.mkString(",")}
    +                |To map columns by name, use df.write.byName.insertInto(...)""".stripMargin)
    +        }
    +      }
    +
    +      data.output.zip(output).map {
    +        case (in, out) if !in.dataType.sameType(out.dataType) =>
    +          Alias(Cast(in, out.dataType), out.name)()
    +        case (in, out) if in.name != out.name =>
    +          Alias(in, out.name)()
    +        case (in, _) => in
    +      }
    +    }
    +
    +    private def expectedColumns(data: LogicalPlan,
    +                                partitionData: Map[String, Option[String]]): Seq[Attribute] = {
    --- End diff --
    
    Same as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65583506
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -284,8 +284,128 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
           val data = (1 to 10).map(i => (i.toLong, s"data-$i")).toDF("id", "data")
     
           val logical = InsertIntoTable(spark.table("partitioned").logicalPlan,
    -        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false)
    +        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false,
    +        Map("matchByName" -> "true"))
           assert(!logical.resolved, "Should not resolve: missing partition data")
         }
       }
    +
    +  test("Insert unnamed expressions by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      // should be able to insert an expression when NOT mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +          .write.insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Insert expression by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +            .write.option("matchByName", true).insertInto("partitioned")
    +      }
    +
    +      // should be able to insert an expression using AS when mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id) as data")
    +          .write.option("matchByName", true).insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reject missing columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +      }
    +    }
    +  }
    +
    +  test("Reject extra columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, data string, extra string, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      val data = (1 to 10)
    +          .map(i => (i, s"data-$i", s"${i * i}", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "extra", "part")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +
    +      val expected = data.select("id", "data", "part")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Ignore names when writing by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string, data string)") // part, data transposed
    +      sql("CREATE TABLE destination (id bigint, data string, part string)")
    +
    +      val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +
    +      // write into the reordered table by name
    +      data.write.option("matchByName", true).insertInto("source")
    +      checkAnswer(sql("SELECT id, data, part FROM source"), data.collect().toSeq)
    +
    +      val expected = data.select($"id", $"part" as "data", $"data" as "part")
    +
    +      // this produces a warning, but writes src.part -> dest.data and src.data -> dest.part
    +      spark.table("source").write.insertInto("destination")
    +      checkAnswer(sql("SELECT id, data, part FROM destination"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reorder columns by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (data string, part string, id bigint)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val data = (1 to 10).map(i => (s"data-$i", if ((i % 2) == 0) "even" else "odd", i))
    +          .toDF("data", "part", "id")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    --- End diff --
    
    I'm not sure if I understand the previous discussion between you and @yhuai , but looks like @yhuai suggested that we should only allow position-based resolution for `insertInto` and add name-bases resolution in follow-up PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212082957
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56239/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212082374
  
    **[Test build #56239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56239/consoleFull)** for PR 12313 at commit [`c3e8561`](https://github.com/apache/spark/commit/c3e8561b501cf97b654a06ed917a08a1dcd84ab6).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds the following public classes _(experimental)_:
      * `  class ResolveOutputColumns extends Rule[LogicalPlan] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-220417058
  
    **[Test build #58876 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58876/consoleFull)** for PR 12313 at commit [`3a24e36`](https://github.com/apache/spark/commit/3a24e36ceb9b815e8c933723e22dbdbfed35c840).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59794/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    @yhuai, whatever release you want to target is fine with me, but I don't think we should block this on a design doc for cleaning up the `DataFrameWriter`. I'm all for writing one and I plan to participate in that design, but the scope of that work is quite a bit larger than what this PR does.
    
    Since this PR doesn't change the public API, I don't think we should wait until we have a plan for the public API (design doc goal 1) to commit it. Similarly for the second goal, missing columns, extra columns, and metastore updates are beyond the scope here, when this can simply require that the number of columns matches since that's the most conservative strategy (that's what is now implemented).
    
    The remaining issue is whether it is a good idea to include the by-name code. I think the concern is that it may change based on the API design doc, but that's really unlikely. For example, adding the option to `InsertIntoTable` will be required unless we duplicate that logical node, which I think is an unlikely choice.
    
    I'm just trying to avoid dragging this out for a lot longer, or the work of splitting it up needlessly and having to keep rebasing the changes on master. I could be wrong, so if I'm missing something here, then please let me know (and thanks for being patient).
    
    Also, I addressed @cloud-fan's comments and rebased on master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    This is addressed by #21305.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65582703
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -348,28 +348,41 @@ case class InsertIntoTable(
         partition: Map[String, Option[String]],
         child: LogicalPlan,
         overwrite: Boolean,
    -    ifNotExists: Boolean)
    +    ifNotExists: Boolean,
    +    options: Map[String, String])
    --- End diff --
    
    As this PR won't do the `byName` resolution, should we remove this map parameter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #59790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59790/consoleFull)** for PR 12313 at commit [`fbfdad1`](https://github.com/apache/spark/commit/fbfdad1aeb3da64562304ceedc04cc3a2825008b).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #59894 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59894/consoleFull)** for PR 12313 at commit [`302464b`](https://github.com/apache/spark/commit/302464b1b7f0eb8432e7816757b393cec9230b78).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213634735
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213205812
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56605/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213559057
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56708/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213107828
  
    **[Test build #56586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56586/consoleFull)** for PR 12313 at commit [`fcacba5`](https://github.com/apache/spark/commit/fcacba51fd3d765ab4085146629c4366d333a038).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-221151120
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59159/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-208656926
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82245/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65767128
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -284,8 +284,128 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
           val data = (1 to 10).map(i => (i.toLong, s"data-$i")).toDF("id", "data")
     
           val logical = InsertIntoTable(spark.table("partitioned").logicalPlan,
    -        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false)
    +        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false,
    +        Map("matchByName" -> "true"))
           assert(!logical.resolved, "Should not resolve: missing partition data")
         }
       }
    +
    +  test("Insert unnamed expressions by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      // should be able to insert an expression when NOT mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +          .write.insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Insert expression by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +            .write.option("matchByName", true).insertInto("partitioned")
    +      }
    +
    +      // should be able to insert an expression using AS when mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id) as data")
    +          .write.option("matchByName", true).insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reject missing columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +      }
    +    }
    +  }
    +
    +  test("Reject extra columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, data string, extra string, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      val data = (1 to 10)
    +          .map(i => (i, s"data-$i", s"${i * i}", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "extra", "part")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +
    +      val expected = data.select("id", "data", "part")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Ignore names when writing by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string, data string)") // part, data transposed
    +      sql("CREATE TABLE destination (id bigint, data string, part string)")
    +
    +      val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +
    +      // write into the reordered table by name
    +      data.write.option("matchByName", true).insertInto("source")
    +      checkAnswer(sql("SELECT id, data, part FROM source"), data.collect().toSeq)
    +
    +      val expected = data.select($"id", $"part" as "data", $"data" as "part")
    +
    +      // this produces a warning, but writes src.part -> dest.data and src.data -> dest.part
    +      spark.table("source").write.insertInto("destination")
    +      checkAnswer(sql("SELECT id, data, part FROM destination"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reorder columns by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (data string, part string, id bigint)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val data = (1 to 10).map(i => (s"data-$i", if ((i % 2) == 0) "even" else "odd", i))
    +          .toDF("data", "part", "id")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    --- End diff --
    
    It sounds like you're suggesting that `saveAsTable` should be used instead of `insertInto`?
    
    It may be more powerful, but that wouldn't address the problem. When I need to insert into a table, I naturally call `insertInto`. That implies that I'm writing into an existing table and `saveAsTable` does not. That makes sense, considering the plans that implement them. `insertInto` should fail if it can't find the table, while `saveAsTable` clearly should not.
    
    Also, what about the expectations that come from `saveAsTable`? If the table could always get created, that requires extra permissions. Configuration is also a problem: the user would need to configure the writer in case it creates the table and would always need to specify partitioning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65773964
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -284,8 +284,128 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
           val data = (1 to 10).map(i => (i.toLong, s"data-$i")).toDF("id", "data")
     
           val logical = InsertIntoTable(spark.table("partitioned").logicalPlan,
    -        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false)
    +        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false,
    +        Map("matchByName" -> "true"))
           assert(!logical.resolved, "Should not resolve: missing partition data")
         }
       }
    +
    +  test("Insert unnamed expressions by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      // should be able to insert an expression when NOT mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +          .write.insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Insert expression by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +            .write.option("matchByName", true).insertInto("partitioned")
    +      }
    +
    +      // should be able to insert an expression using AS when mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id) as data")
    +          .write.option("matchByName", true).insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reject missing columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +      }
    +    }
    +  }
    +
    +  test("Reject extra columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, data string, extra string, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      val data = (1 to 10)
    +          .map(i => (i, s"data-$i", s"${i * i}", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "extra", "part")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +
    +      val expected = data.select("id", "data", "part")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Ignore names when writing by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string, data string)") // part, data transposed
    +      sql("CREATE TABLE destination (id bigint, data string, part string)")
    +
    +      val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +
    +      // write into the reordered table by name
    +      data.write.option("matchByName", true).insertInto("source")
    +      checkAnswer(sql("SELECT id, data, part FROM source"), data.collect().toSeq)
    +
    +      val expected = data.select($"id", $"part" as "data", $"data" as "part")
    +
    +      // this produces a warning, but writes src.part -> dest.data and src.data -> dest.part
    +      spark.table("source").write.insertInto("destination")
    +      checkAnswer(sql("SELECT id, data, part FROM destination"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reorder columns by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (data string, part string, id bigint)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val data = (1 to 10).map(i => (s"data-$i", if ((i % 2) == 0) "even" else "odd", i))
    +          .toDF("data", "part", "id")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    --- End diff --
    
    `saveAsTable` is really not a good name, according to the fact that it supports `append`, `overwrite` and `create` mode.
    
    I think we should have `saveAsTable` always create a new table, `insertInto` always insert into an existing table. And both of them should support both by-position and by-name resolution.
    
    But this needs a lot more refactor/abstraction, maybe we should do it later and only focus on improving by-position resolution of `insertIno` in this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    @rdblue Thank you for updating the patch. I was out of town late last week and was busy on spark summit early this week. Sorry for my late reply. Having name-based resolution is very useful! Since this is a major improvement and we are pretty late in the 2.0 QA period, how about we re-target the work to 2.1?
    
    Also, I feel it will be good to have a short design doc to discuss the following:
    1. DataframeWriter and SQL interfaces and how to switch between matching by ordinals and matching by names.
    2. Approaches for handling missing fields and new fields (both top level fields and inner fields). It will be good to discuss the semantics when we have missing fields and new fields as well as how to maintain the schema in the metastore.
    
    Based on the design, we can break the work to multiple pieces and get them merged for 2.1 release. 
    
    What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-222566966
  
    @rdblue How about we use a separate PR for the work of adding `insertByNameInto`? It will be easier to review and the discussion on the API name/semantic will not block other improvements in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #82245 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82245/consoleFull)** for PR 12313 at commit [`906e68d`](https://github.com/apache/spark/commit/906e68d071daf4e2f15b0f2017b248b872bb6285).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-208656931
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55558/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-217942335
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58152/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-217973164
  
    **[Test build #58156 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58156/consoleFull)** for PR 12313 at commit [`ab23c9c`](https://github.com/apache/spark/commit/ab23c9ce22ed09c5a228b32cf8e6ca1f0e7a56b1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65757163
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -284,8 +284,128 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
           val data = (1 to 10).map(i => (i.toLong, s"data-$i")).toDF("id", "data")
     
           val logical = InsertIntoTable(spark.table("partitioned").logicalPlan,
    -        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false)
    +        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false,
    +        Map("matchByName" -> "true"))
           assert(!logical.resolved, "Should not resolve: missing partition data")
         }
       }
    +
    +  test("Insert unnamed expressions by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      // should be able to insert an expression when NOT mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +          .write.insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Insert expression by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +            .write.option("matchByName", true).insertInto("partitioned")
    +      }
    +
    +      // should be able to insert an expression using AS when mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id) as data")
    +          .write.option("matchByName", true).insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reject missing columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +      }
    +    }
    +  }
    +
    +  test("Reject extra columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, data string, extra string, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      val data = (1 to 10)
    +          .map(i => (i, s"data-$i", s"${i * i}", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "extra", "part")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +
    +      val expected = data.select("id", "data", "part")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Ignore names when writing by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string, data string)") // part, data transposed
    +      sql("CREATE TABLE destination (id bigint, data string, part string)")
    +
    +      val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +
    +      // write into the reordered table by name
    +      data.write.option("matchByName", true).insertInto("source")
    +      checkAnswer(sql("SELECT id, data, part FROM source"), data.collect().toSeq)
    +
    +      val expected = data.select($"id", $"part" as "data", $"data" as "part")
    +
    +      // this produces a warning, but writes src.part -> dest.data and src.data -> dest.part
    +      spark.table("source").write.insertInto("destination")
    +      checkAnswer(sql("SELECT id, data, part FROM destination"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reorder columns by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (data string, part string, id bigint)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val data = (1 to 10).map(i => (s"data-$i", if ((i % 2) == 0) "even" else "odd", i))
    +          .toDF("data", "part", "id")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    --- End diff --
    
    `saveAsTable` uses a `CreateTableUsingAsSelect`, while `insertInto` uses `InsertIntoTable`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-221151119
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213150245
  
    **[Test build #56586 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56586/consoleFull)** for PR 12313 at commit [`fcacba5`](https://github.com/apache/spark/commit/fcacba51fd3d765ab4085146629c4366d333a038).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213634733
  
    **[Test build #56757 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56757/consoleFull)** for PR 12313 at commit [`c443b32`](https://github.com/apache/spark/commit/c443b324099ce4937a4fbf38408e3b0be034658f).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212644576
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-208638398
  
    **[Test build #55558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55558/consoleFull)** for PR 12313 at commit [`c4c6820`](https://github.com/apache/spark/commit/c4c6820e2fa4bfeb178834d67dffdc0a26b1dda4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-208968933
  
    **[Test build #55616 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55616/consoleFull)** for PR 12313 at commit [`f2186e5`](https://github.com/apache/spark/commit/f2186e5b2e0d569c31f1bf05072949fb0f5a4358).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-222205450
  
    @yhuai, I'll answer #2 first since it's quick: the column names are used to create a projection of the incoming data frame so any extra columns aren't selected and are ignored. If we think that's a bad assumption, I'm happy to add an analysis exception or an option to turn on/off strict checking for that case. What do you think is the right thing to do?
    
    For #1, can you explain why you think it is better to separate the by-name column resolution into a separate command? The problem I'm trying to fix is that the SQL-based behavior of `insertInto` is not obvious when using data frames because users expect the API to work like normal serialization libraries, where names matter rather than order. If we were to leave `insertInto` as it is and add another command, then I think it would be more confusing, not less. Adding another command won't make it obvious that `insertInto` doesn't do what a user expects.
    
    The data frame API doesn't make the ordering explicit, unlike SQL. For example, `df.groupBy("c").agg(max("d") as "m")` results in a dataframe with `c` and `m`. Adding `c` automatically isn't carried over from SQL and it isn't obvious whether it will be missing, at the beginning, or at the end. I think this flexibility is a strength of data frames, but it means that we should handle some situations, like writing, in the way that users expect. Users are commonly referring to columns by name so I think it makes sense to have an option for this in the writer API.
    
    Say I have one table with columns `a`, `b`, `c` and second with columns `a`, `c` that's partitioned by `b`. Using `spark.table("one").write.insertInto("two")` works as expected. But if I go the other way, `spark.table("two").write.insertInto("one")` then it silently maps `c -> b` and `b -> c` even if `b` is a string and `c` is an int. I don't think this aligns with user expectations because the user didn't select columns in either case. In SQL, the column order would be explicit; even selecting `*` implies that there is an ordering to worry about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Hi all, I just wonder where we are on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-221120531
  
    **[Test build #59159 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59159/consoleFull)** for PR 12313 at commit [`06030c2`](https://github.com/apache/spark/commit/06030c2cd8947680c0127c4b09ab99ee3371214b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r63896316
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -348,28 +348,41 @@ case class InsertIntoTable(
         partition: Map[String, Option[String]],
         child: LogicalPlan,
         overwrite: Boolean,
    -    ifNotExists: Boolean)
    +    ifNotExists: Boolean,
    +    options: Map[String, String])
    --- End diff --
    
    Does this options map only contain `matchByName`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #82245 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82245/consoleFull)** for PR 12313 at commit [`906e68d`](https://github.com/apache/spark/commit/906e68d071daf4e2f15b0f2017b248b872bb6285).
     * This patch **fails PySpark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65743060
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -284,8 +284,128 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
           val data = (1 to 10).map(i => (i.toLong, s"data-$i")).toDF("id", "data")
     
           val logical = InsertIntoTable(spark.table("partitioned").logicalPlan,
    -        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false)
    +        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false,
    +        Map("matchByName" -> "true"))
           assert(!logical.resolved, "Should not resolve: missing partition data")
         }
       }
    +
    +  test("Insert unnamed expressions by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      // should be able to insert an expression when NOT mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +          .write.insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Insert expression by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +            .write.option("matchByName", true).insertInto("partitioned")
    +      }
    +
    +      // should be able to insert an expression using AS when mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id) as data")
    +          .write.option("matchByName", true).insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reject missing columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +      }
    +    }
    +  }
    +
    +  test("Reject extra columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, data string, extra string, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      val data = (1 to 10)
    +          .map(i => (i, s"data-$i", s"${i * i}", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "extra", "part")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +
    +      val expected = data.select("id", "data", "part")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Ignore names when writing by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string, data string)") // part, data transposed
    +      sql("CREATE TABLE destination (id bigint, data string, part string)")
    +
    +      val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +
    +      // write into the reordered table by name
    +      data.write.option("matchByName", true).insertInto("source")
    +      checkAnswer(sql("SELECT id, data, part FROM source"), data.collect().toSeq)
    +
    +      val expected = data.select($"id", $"part" as "data", $"data" as "part")
    +
    +      // this produces a warning, but writes src.part -> dest.data and src.data -> dest.part
    +      spark.table("source").write.insertInto("destination")
    +      checkAnswer(sql("SELECT id, data, part FROM destination"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reorder columns by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (data string, part string, id bigint)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val data = (1 to 10).map(i => (s"data-$i", if ((i % 2) == 0) "even" else "odd", i))
    +          .toDF("data", "part", "id")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    --- End diff --
    
    I thought the suggestion was to remove the public API so we could discuss that separately. I think we certainly want the insert by name functionality, considering some of the write methods already do it. It's just a question of how to expose it to the user. Considering this already works correctly and is tested, I think the right path forward is to get it in now, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65822892
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -505,6 +506,117 @@ class Analyzer(
         }
       }
     
    +  object ResolveOutputColumns extends Rule[LogicalPlan] {
    +    def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
    +      case ins @ InsertIntoTable(relation: LogicalPlan, partition, _, _, _, _)
    +          if relation.resolved && !ins.resolved =>
    +        resolveOutputColumns(ins, expectedColumns(relation, partition), relation.toString)
    +    }
    +
    +    private def resolveOutputColumns(
    +        insertInto: InsertIntoTable,
    +        columns: Seq[Attribute],
    +        relation: String) = {
    +      val resolved = if (insertInto.isMatchByName) {
    +        projectAndCastOutputColumns(columns, insertInto.child, relation)
    +      } else {
    +        castAndRenameOutputColumns(columns, insertInto.child, relation)
    +      }
    +
    +      if (resolved == insertInto.child.output) {
    +        insertInto
    +      } else {
    +        insertInto.copy(child = Project(resolved, insertInto.child))
    +      }
    +    }
    +
    +    /**
    +     * Resolves output columns by input column name, adding casts if necessary.
    +     */
    +    private def projectAndCastOutputColumns(
    +        output: Seq[Attribute],
    +        data: LogicalPlan,
    +        relation: String): Seq[NamedExpression] = {
    +      output.map { col =>
    +        data.resolveQuoted(col.name, resolver) match {
    +          case Some(inCol) if col.dataType != inCol.dataType =>
    +            Alias(UpCast(inCol, col.dataType, Seq()), col.name)()
    +          case Some(inCol) => inCol
    +          case None =>
    +            throw new AnalysisException(
    +              s"Cannot resolve ${col.name} in ${data.output.mkString(",")}")
    +        }
    +      }
    +    }
    +
    +    private def castAndRenameOutputColumns(
    +        output: Seq[Attribute],
    +        data: LogicalPlan,
    +        relation: String): Seq[NamedExpression] = {
    +      val outputNames = output.map(_.name)
    +      // incoming expressions may not have names
    +      val inputNames = data.output.flatMap(col => Option(col.name))
    +      if (output.size > data.output.size) {
    +        // always a problem
    +        throw new AnalysisException(
    +          s"""Not enough data columns to write into $relation:
    +             |Data columns: ${data.output.mkString(",")}
    +             |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +      } else if (output.size < data.output.size) {
    +        if (outputNames.toSet.subsetOf(inputNames.toSet)) {
    --- End diff --
    
    do we really need to distinguish these 2 cases? How about we just say that the number of columns mismatch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-220417372
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75047/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-208656669
  
    **[Test build #55558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55558/consoleFull)** for PR 12313 at commit [`c4c6820`](https://github.com/apache/spark/commit/c4c6820e2fa4bfeb178834d67dffdc0a26b1dda4).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  class ResolveOutputColumns extends Rule[LogicalPlan] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213150377
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56586/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213157028
  
    **[Test build #56605 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56605/consoleFull)** for PR 12313 at commit [`7516ffd`](https://github.com/apache/spark/commit/7516ffd2e6510e6f9805563970637b019a4e5fd9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60327/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #59794 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59794/consoleFull)** for PR 12313 at commit [`ecaf95f`](https://github.com/apache/spark/commit/ecaf95fae2135e368d6d10d2bb1783ffd7599bec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-217974405
  
    @liancheng, @cloud-fan, this commit is a follow-up to #12239 that fixes column resolution when writing to both Hive MetastoreRelations and HadoopFsRelations. Could you review it? I think it would be a good addition to 2.0.0 as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-217973445
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58156/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-220389967
  
    @cloud-fan, @liancheng, thanks for reviewing! I've rebased on master and fixed your comments so far.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-217942331
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65759591
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -284,8 +284,128 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
           val data = (1 to 10).map(i => (i.toLong, s"data-$i")).toDF("id", "data")
     
           val logical = InsertIntoTable(spark.table("partitioned").logicalPlan,
    -        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false)
    +        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false,
    +        Map("matchByName" -> "true"))
           assert(!logical.resolved, "Should not resolve: missing partition data")
         }
       }
    +
    +  test("Insert unnamed expressions by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      // should be able to insert an expression when NOT mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +          .write.insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Insert expression by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +            .write.option("matchByName", true).insertInto("partitioned")
    +      }
    +
    +      // should be able to insert an expression using AS when mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id) as data")
    +          .write.option("matchByName", true).insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reject missing columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +      }
    +    }
    +  }
    +
    +  test("Reject extra columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, data string, extra string, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      val data = (1 to 10)
    +          .map(i => (i, s"data-$i", s"${i * i}", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "extra", "part")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +
    +      val expected = data.select("id", "data", "part")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Ignore names when writing by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string, data string)") // part, data transposed
    +      sql("CREATE TABLE destination (id bigint, data string, part string)")
    +
    +      val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +
    +      // write into the reordered table by name
    +      data.write.option("matchByName", true).insertInto("source")
    +      checkAnswer(sql("SELECT id, data, part FROM source"), data.collect().toSeq)
    +
    +      val expected = data.select($"id", $"part" as "data", $"data" as "part")
    +
    +      // this produces a warning, but writes src.part -> dest.data and src.data -> dest.part
    +      spark.table("source").write.insertInto("destination")
    +      checkAnswer(sql("SELECT id, data, part FROM destination"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reorder columns by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (data string, part string, id bigint)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val data = (1 to 10).map(i => (s"data-$i", if ((i % 2) == 0) "even" else "odd", i))
    +          .toDF("data", "part", "id")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    --- End diff --
    
    `saveAsTable` also supports `overwrite` mode, looks like `saveAsTable` is more powerful than `insertInto` except the by-position resolution?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-208998941
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55616/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59792/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-221150973
  
    **[Test build #59159 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59159/consoleFull)** for PR 12313 at commit [`06030c2`](https://github.com/apache/spark/commit/06030c2cd8947680c0127c4b09ab99ee3371214b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r63895917
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -498,6 +499,117 @@ class Analyzer(
         }
       }
     
    +  val ResolveOutputColumns = new ResolveOutputColumns
    +  class ResolveOutputColumns extends Rule[LogicalPlan] {
    --- End diff --
    
    We can simply use `object ResolveOutputColumns` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213503342
  
    **[Test build #56708 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56708/consoleFull)** for PR 12313 at commit [`42ff7b7`](https://github.com/apache/spark/commit/42ff7b7e533f6d9620a2c5253f75780bca1805a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r63971653
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -348,28 +348,41 @@ case class InsertIntoTable(
         partition: Map[String, Option[String]],
         child: LogicalPlan,
         overwrite: Boolean,
    -    ifNotExists: Boolean)
    +    ifNotExists: Boolean,
    +    options: Map[String, String])
    --- End diff --
    
    I just opened another PR that requires this map, #13206. That adds a hint for the number of writers per output partition so the optimizer can add the appropriate repartition and sort operation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59894/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r66687753
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -505,6 +506,117 @@ class Analyzer(
         }
       }
     
    +  object ResolveOutputColumns extends Rule[LogicalPlan] {
    +    def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
    +      case ins @ InsertIntoTable(relation: LogicalPlan, partition, _, _, _, _)
    +          if relation.resolved && !ins.resolved =>
    +        resolveOutputColumns(ins, expectedColumns(relation, partition), relation.toString)
    +    }
    +
    +    private def resolveOutputColumns(
    +        insertInto: InsertIntoTable,
    +        columns: Seq[Attribute],
    +        relation: String) = {
    +      val resolved = if (insertInto.isMatchByName) {
    +        projectAndCastOutputColumns(columns, insertInto.child, relation)
    +      } else {
    +        castAndRenameOutputColumns(columns, insertInto.child, relation)
    +      }
    +
    +      if (resolved == insertInto.child.output) {
    +        insertInto
    +      } else {
    +        insertInto.copy(child = Project(resolved, insertInto.child))
    +      }
    +    }
    +
    +    /**
    +     * Resolves output columns by input column name, adding casts if necessary.
    +     */
    +    private def projectAndCastOutputColumns(
    +        output: Seq[Attribute],
    +        data: LogicalPlan,
    +        relation: String): Seq[NamedExpression] = {
    +      output.map { col =>
    +        data.resolveQuoted(col.name, resolver) match {
    +          case Some(inCol) if col.dataType != inCol.dataType =>
    +            Alias(UpCast(inCol, col.dataType, Seq()), col.name)()
    +          case Some(inCol) => inCol
    +          case None =>
    +            throw new AnalysisException(
    +              s"Cannot resolve ${col.name} in ${data.output.mkString(",")}")
    +        }
    +      }
    +    }
    +
    +    private def castAndRenameOutputColumns(
    +        output: Seq[Attribute],
    +        data: LogicalPlan,
    +        relation: String): Seq[NamedExpression] = {
    +      val outputNames = output.map(_.name)
    +      // incoming expressions may not have names
    +      val inputNames = data.output.flatMap(col => Option(col.name))
    +      if (output.size > data.output.size) {
    +        // always a problem
    +        throw new AnalysisException(
    +          s"""Not enough data columns to write into $relation:
    +             |Data columns: ${data.output.mkString(",")}
    +             |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +      } else if (output.size < data.output.size) {
    +        if (outputNames.toSet.subsetOf(inputNames.toSet)) {
    +          throw new AnalysisException(
    +            s"""Table column names are a subset of the input data columns:
    +               |Data columns: ${inputNames.mkString(",")}
    +               |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +        } else {
    +          // be conservative and fail if there are too many columns
    +          throw new AnalysisException(
    +            s"""Extra data columns to write into $relation:
    +               |Data columns: ${data.output.mkString(",")}
    +               |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +        }
    +      } else {
    +        // check for reordered names and warn. this may be on purpose, so it isn't an error.
    +        if (outputNames.toSet == inputNames.toSet && outputNames != inputNames) {
    +          logWarning(
    +            s"""Data column names match the table in a different order:
    +               |Data columns: ${inputNames.mkString(",")}
    +               |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +        }
    +      }
    +
    +      data.output.zip(output).map {
    +        case (in, out) if !in.dataType.sameType(out.dataType) =>
    +          Alias(Cast(in, out.dataType), out.name)()
    +        case (in, out) if in.name != out.name =>
    +          Alias(in, out.name)()
    +        case (in, _) => in
    +      }
    +    }
    +
    +    private def expectedColumns(
    +        data: LogicalPlan,
    +        partitionData: Map[String, Option[String]]): Seq[Attribute] = {
    +      data match {
    +        case partitioned: CatalogRelation =>
    +          val tablePartitionNames = partitioned.catalogTable.partitionColumns.map(_.name)
    +          val (inputPartCols, dataColumns) = data.output.partition { attr =>
    +            tablePartitionNames.contains(attr.name)
    +          }
    +          // Get the dynamic partition columns in partition order
    +          val dynamicNames = tablePartitionNames.filter(
    +            name => partitionData.getOrElse(name, None).isEmpty)
    --- End diff --
    
    Using `contains` would include partitions that are set statically for this query, which have values like `Some("static-val")`. This doesn't happen through the `DataFrameWriter`, but is valid HiveQL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212546172
  
    **[Test build #56386 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56386/consoleFull)** for PR 12313 at commit [`0a6ec77`](https://github.com/apache/spark/commit/0a6ec77ae3d3ec932c37bdd63ae20d6b71180fd1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #59792 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59792/consoleFull)** for PR 12313 at commit [`928e21d`](https://github.com/apache/spark/commit/928e21d962104af6e8f9f1d04251a1512e3e4924).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212644577
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56411/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212038454
  
    **[Test build #56239 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56239/consoleFull)** for PR 12313 at commit [`c3e8561`](https://github.com/apache/spark/commit/c3e8561b501cf97b654a06ed917a08a1dcd84ab6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-208998937
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213634737
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56757/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r63912028
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -498,6 +499,117 @@ class Analyzer(
         }
       }
     
    +  val ResolveOutputColumns = new ResolveOutputColumns
    +  class ResolveOutputColumns extends Rule[LogicalPlan] {
    +    def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
    +      case ins @ InsertIntoTable(relation: LogicalPlan, partition, _, _, _, _)
    +          if ins.childrenResolved && !ins.resolved =>
    +        resolveOutputColumns(ins, expectedColumns(relation, partition), relation.toString)
    +    }
    +
    +    protected def resolveOutputColumns(insertInto: InsertIntoTable,
    +                                       columns: Seq[Attribute],
    +                                       relation: String) = {
    +      val resolved = if (insertInto.isMatchByName) {
    +        projectAndCastOutputColumns(columns, insertInto.child, relation)
    +      } else {
    +        castAndRenameOutputColumns(columns, insertInto.child, relation)
    +      }
    +
    +      if (resolved == insertInto.child.output) {
    +        insertInto
    +      } else {
    +        insertInto.copy(child = Project(resolved, insertInto.child))
    +      }
    +    }
    +
    +    /**
    +     * Resolves output columns by input column name, adding casts if necessary.
    +     */
    +    private def projectAndCastOutputColumns(output: Seq[Attribute],
    +                                            data: LogicalPlan,
    +                                            relation: String): Seq[NamedExpression] = {
    +      output.map { col =>
    +        data.resolveQuoted(col.name, resolver) match {
    +          case Some(inCol) if col.dataType != inCol.dataType =>
    +            Alias(UpCast(inCol, col.dataType, Seq()), col.name)()
    +          case Some(inCol) => inCol
    +          case None =>
    +            throw new AnalysisException(
    +              s"Cannot resolve ${col.name} in ${data.output.mkString(",")}")
    +        }
    +      }
    +    }
    +
    +    private def castAndRenameOutputColumns(output: Seq[Attribute],
    +                                           data: LogicalPlan,
    +                                           relation: String): Seq[NamedExpression] = {
    --- End diff --
    
    Same as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #75047 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75047/consoleFull)** for PR 12313 at commit [`906e68d`](https://github.com/apache/spark/commit/906e68d071daf4e2f15b0f2017b248b872bb6285).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    @yhuai, I've removed the public API additions so we can get the changes in as you suggest. I also rebased on the current master so it can be merged. I'll fix any test failures that come up. Thanks for looking at this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65778807
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -284,8 +284,128 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
           val data = (1 to 10).map(i => (i.toLong, s"data-$i")).toDF("id", "data")
     
           val logical = InsertIntoTable(spark.table("partitioned").logicalPlan,
    -        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false)
    +        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false,
    +        Map("matchByName" -> "true"))
           assert(!logical.resolved, "Should not resolve: missing partition data")
         }
       }
    +
    +  test("Insert unnamed expressions by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      // should be able to insert an expression when NOT mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +          .write.insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Insert expression by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +            .write.option("matchByName", true).insertInto("partitioned")
    +      }
    +
    +      // should be able to insert an expression using AS when mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id) as data")
    +          .write.option("matchByName", true).insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reject missing columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +      }
    +    }
    +  }
    +
    +  test("Reject extra columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, data string, extra string, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      val data = (1 to 10)
    +          .map(i => (i, s"data-$i", s"${i * i}", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "extra", "part")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +
    +      val expected = data.select("id", "data", "part")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Ignore names when writing by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string, data string)") // part, data transposed
    +      sql("CREATE TABLE destination (id bigint, data string, part string)")
    +
    +      val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +
    +      // write into the reordered table by name
    +      data.write.option("matchByName", true).insertInto("source")
    +      checkAnswer(sql("SELECT id, data, part FROM source"), data.collect().toSeq)
    +
    +      val expected = data.select($"id", $"part" as "data", $"data" as "part")
    +
    +      // this produces a warning, but writes src.part -> dest.data and src.data -> dest.part
    +      spark.table("source").write.insertInto("destination")
    +      checkAnswer(sql("SELECT id, data, part FROM destination"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reorder columns by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (data string, part string, id bigint)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val data = (1 to 10).map(i => (s"data-$i", if ((i % 2) == 0) "even" else "odd", i))
    +          .toDF("data", "part", "id")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    --- End diff --
    
    I'll review this PR anyway, cc @marmbrus @andrewor14 to discuss if we should add by-name resolution here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #60327 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60327/consoleFull)** for PR 12313 at commit [`906e68d`](https://github.com/apache/spark/commit/906e68d071daf4e2f15b0f2017b248b872bb6285).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-220417377
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58876/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    If this is too large for merging to 2.0, could @rdblue deliver a small fix for capturing the illegal user inputs? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r63912009
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -498,6 +499,117 @@ class Analyzer(
         }
       }
     
    +  val ResolveOutputColumns = new ResolveOutputColumns
    +  class ResolveOutputColumns extends Rule[LogicalPlan] {
    +    def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
    +      case ins @ InsertIntoTable(relation: LogicalPlan, partition, _, _, _, _)
    +          if ins.childrenResolved && !ins.resolved =>
    +        resolveOutputColumns(ins, expectedColumns(relation, partition), relation.toString)
    +    }
    +
    +    protected def resolveOutputColumns(insertInto: InsertIntoTable,
    +                                       columns: Seq[Attribute],
    +                                       relation: String) = {
    +      val resolved = if (insertInto.isMatchByName) {
    +        projectAndCastOutputColumns(columns, insertInto.child, relation)
    +      } else {
    +        castAndRenameOutputColumns(columns, insertInto.child, relation)
    +      }
    +
    +      if (resolved == insertInto.child.output) {
    +        insertInto
    +      } else {
    +        insertInto.copy(child = Project(resolved, insertInto.child))
    +      }
    +    }
    +
    +    /**
    +     * Resolves output columns by input column name, adding casts if necessary.
    +     */
    +    private def projectAndCastOutputColumns(output: Seq[Attribute],
    +                                            data: LogicalPlan,
    +                                            relation: String): Seq[NamedExpression] = {
    --- End diff --
    
    Same code style issue as @cloud-fan mentioned above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r63896053
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -498,6 +499,117 @@ class Analyzer(
         }
       }
     
    +  val ResolveOutputColumns = new ResolveOutputColumns
    +  class ResolveOutputColumns extends Rule[LogicalPlan] {
    +    def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
    +      case ins @ InsertIntoTable(relation: LogicalPlan, partition, _, _, _, _)
    +          if ins.childrenResolved && !ins.resolved =>
    +        resolveOutputColumns(ins, expectedColumns(relation, partition), relation.toString)
    +    }
    +
    +    protected def resolveOutputColumns(insertInto: InsertIntoTable,
    --- End diff --
    
    nit: code style should be
    ```
    def xxx(
      param1: A,
      param2: B): R = {
      ...
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r63908465
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -348,28 +348,41 @@ case class InsertIntoTable(
         partition: Map[String, Option[String]],
         child: LogicalPlan,
         overwrite: Boolean,
    -    ifNotExists: Boolean)
    +    ifNotExists: Boolean,
    +    options: Map[String, String])
    --- End diff --
    
    Right now, yes. But I changed from a boolean to a map because there are other options that will be added, like `filesPerPartition`. Changing the case class requires code changes in several places just to add `_` to a match expression or pass the argument on. Using a map means we can extend InsertIntoTable without updating the signature all the time. Patches should be smaller and have fewer conflicts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-221300255
  
    @liancheng, @cloud-fan, I rebased this on master since it had conflicts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #59850 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59850/consoleFull)** for PR 12313 at commit [`201d900`](https://github.com/apache/spark/commit/201d900a9ccccd0533e60437893837a34308b4e2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212082951
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-222231124
  
    @yhuai, I'll add an option for strict checking.
    
    I agree with you that we need to have a holistic solution. It's also not ideal that some write methods already map by name while others don't. That's a good argument against adding the `byName` modifier because it should apply to how anything is written using that `DataFrameWriter`.
    
    What do you think about combining the two by adding a `insertByNameInto` method? That way we preserve the behavior of `insertInto` but make it clear by way of contrast that there are other options. Then we also don't have a `byName` method that doesn't make sense with the current behavior of `saveAsTable`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue closed the pull request at:

    https://github.com/apache/spark/pull/12313


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-217946972
  
    Retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65822816
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -505,6 +506,117 @@ class Analyzer(
         }
       }
     
    +  object ResolveOutputColumns extends Rule[LogicalPlan] {
    +    def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
    --- End diff --
    
    should use `plan.resolveOperators`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #59850 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59850/consoleFull)** for PR 12313 at commit [`201d900`](https://github.com/apache/spark/commit/201d900a9ccccd0533e60437893837a34308b4e2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65748370
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala ---
    @@ -284,8 +284,128 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef
           val data = (1 to 10).map(i => (i.toLong, s"data-$i")).toDF("id", "data")
     
           val logical = InsertIntoTable(spark.table("partitioned").logicalPlan,
    -        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false)
    +        Map("part" -> None), data.logicalPlan, overwrite = false, ifNotExists = false,
    +        Map("matchByName" -> "true"))
           assert(!logical.resolved, "Should not resolve: missing partition data")
         }
       }
    +
    +  test("Insert unnamed expressions by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      // should be able to insert an expression when NOT mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +          .write.insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Insert expression by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val expected = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +      val data = expected.select("id", "part")
    +
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").selectExpr("id", "part", "CONCAT('data-', id)")
    +            .write.option("matchByName", true).insertInto("partitioned")
    +      }
    +
    +      // should be able to insert an expression using AS when mapping columns by name
    +      spark.table("source").selectExpr("id", "part", "CONCAT('data-', id) as data")
    +          .write.option("matchByName", true).insertInto("partitioned")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reject missing columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      intercept[AnalysisException] {
    +        // also a problem when mapping by name
    +        spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +      }
    +    }
    +  }
    +
    +  test("Reject extra columns") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, data string, extra string, part string)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      intercept[AnalysisException] {
    +        spark.table("source").write.insertInto("partitioned")
    +      }
    +
    +      val data = (1 to 10)
    +          .map(i => (i, s"data-$i", s"${i * i}", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "extra", "part")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    +
    +      val expected = data.select("id", "data", "part")
    +      checkAnswer(sql("SELECT * FROM partitioned"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Ignore names when writing by position") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (id bigint, part string, data string)") // part, data transposed
    +      sql("CREATE TABLE destination (id bigint, data string, part string)")
    +
    +      val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd"))
    +          .toDF("id", "data", "part")
    +
    +      // write into the reordered table by name
    +      data.write.option("matchByName", true).insertInto("source")
    +      checkAnswer(sql("SELECT id, data, part FROM source"), data.collect().toSeq)
    +
    +      val expected = data.select($"id", $"part" as "data", $"data" as "part")
    +
    +      // this produces a warning, but writes src.part -> dest.data and src.data -> dest.part
    +      spark.table("source").write.insertInto("destination")
    +      checkAnswer(sql("SELECT id, data, part FROM destination"), expected.collect().toSeq)
    +    }
    +  }
    +
    +  test("Reorder columns by name") {
    +    withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) {
    +      sql("CREATE TABLE source (data string, part string, id bigint)")
    +      sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)")
    +
    +      val data = (1 to 10).map(i => (s"data-$i", if ((i % 2) == 0) "even" else "odd", i))
    +          .toDF("data", "part", "id")
    +      data.write.insertInto("source")
    +      checkAnswer(sql("SELECT * FROM source"), data.collect().toSeq)
    +
    +      spark.table("source").write.option("matchByName", true).insertInto("partitioned")
    --- End diff --
    
    what's the difference between by-name `insertInto` and `saveAsTable`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Really like this PR! It removes one of Hive-specific rule! : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213558556
  
    **[Test build #56708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56708/consoleFull)** for PR 12313 at commit [`42ff7b7`](https://github.com/apache/spark/commit/42ff7b7e533f6d9620a2c5253f75780bca1805a2).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #59792 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59792/consoleFull)** for PR 12313 at commit [`928e21d`](https://github.com/apache/spark/commit/928e21d962104af6e8f9f1d04251a1512e3e4924).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12313: [SPARK-14543] [SQL] Improve InsertIntoTable colum...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r65822835
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -505,6 +506,117 @@ class Analyzer(
         }
       }
     
    +  object ResolveOutputColumns extends Rule[LogicalPlan] {
    +    def apply(plan: LogicalPlan): LogicalPlan = plan.transform {
    +      case ins @ InsertIntoTable(relation: LogicalPlan, partition, _, _, _, _)
    +          if relation.resolved && !ins.resolved =>
    +        resolveOutputColumns(ins, expectedColumns(relation, partition), relation.toString)
    +    }
    +
    +    private def resolveOutputColumns(
    +        insertInto: InsertIntoTable,
    +        columns: Seq[Attribute],
    +        relation: String) = {
    +      val resolved = if (insertInto.isMatchByName) {
    +        projectAndCastOutputColumns(columns, insertInto.child, relation)
    +      } else {
    +        castAndRenameOutputColumns(columns, insertInto.child, relation)
    +      }
    +
    +      if (resolved == insertInto.child.output) {
    +        insertInto
    +      } else {
    +        insertInto.copy(child = Project(resolved, insertInto.child))
    +      }
    +    }
    +
    +    /**
    +     * Resolves output columns by input column name, adding casts if necessary.
    +     */
    +    private def projectAndCastOutputColumns(
    +        output: Seq[Attribute],
    +        data: LogicalPlan,
    +        relation: String): Seq[NamedExpression] = {
    +      output.map { col =>
    +        data.resolveQuoted(col.name, resolver) match {
    +          case Some(inCol) if col.dataType != inCol.dataType =>
    +            Alias(UpCast(inCol, col.dataType, Seq()), col.name)()
    +          case Some(inCol) => inCol
    +          case None =>
    +            throw new AnalysisException(
    +              s"Cannot resolve ${col.name} in ${data.output.mkString(",")}")
    +        }
    +      }
    +    }
    +
    +    private def castAndRenameOutputColumns(
    +        output: Seq[Attribute],
    +        data: LogicalPlan,
    +        relation: String): Seq[NamedExpression] = {
    +      val outputNames = output.map(_.name)
    +      // incoming expressions may not have names
    +      val inputNames = data.output.flatMap(col => Option(col.name))
    +      if (output.size > data.output.size) {
    +        // always a problem
    +        throw new AnalysisException(
    +          s"""Not enough data columns to write into $relation:
    +             |Data columns: ${data.output.mkString(",")}
    +             |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +      } else if (output.size < data.output.size) {
    +        if (outputNames.toSet.subsetOf(inputNames.toSet)) {
    +          throw new AnalysisException(
    +            s"""Table column names are a subset of the input data columns:
    +               |Data columns: ${inputNames.mkString(",")}
    +               |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +        } else {
    +          // be conservative and fail if there are too many columns
    +          throw new AnalysisException(
    +            s"""Extra data columns to write into $relation:
    +               |Data columns: ${data.output.mkString(",")}
    +               |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +        }
    +      } else {
    +        // check for reordered names and warn. this may be on purpose, so it isn't an error.
    +        if (outputNames.toSet == inputNames.toSet && outputNames != inputNames) {
    +          logWarning(
    +            s"""Data column names match the table in a different order:
    +               |Data columns: ${inputNames.mkString(",")}
    +               |Table columns: ${outputNames.mkString(",")}""".stripMargin)
    +        }
    +      }
    +
    +      data.output.zip(output).map {
    +        case (in, out) if !in.dataType.sameType(out.dataType) =>
    +          Alias(Cast(in, out.dataType), out.name)()
    +        case (in, out) if in.name != out.name =>
    +          Alias(in, out.name)()
    +        case (in, _) => in
    +      }
    +    }
    +
    +    private def expectedColumns(
    +        data: LogicalPlan,
    +        partitionData: Map[String, Option[String]]): Seq[Attribute] = {
    +      data match {
    +        case partitioned: CatalogRelation =>
    +          val tablePartitionNames = partitioned.catalogTable.partitionColumns.map(_.name)
    +          val (inputPartCols, dataColumns) = data.output.partition { attr =>
    +            tablePartitionNames.contains(attr.name)
    +          }
    +          // Get the dynamic partition columns in partition order
    +          val dynamicNames = tablePartitionNames.filter(
    +            name => partitionData.getOrElse(name, None).isEmpty)
    --- End diff --
    
    `partitionData.contains`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #59790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59790/consoleFull)** for PR 12313 at commit [`fbfdad1`](https://github.com/apache/spark/commit/fbfdad1aeb3da64562304ceedc04cc3a2825008b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12313: [SPARK-14543] [SQL] Improve InsertIntoTable column resol...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12313
  
    **[Test build #60327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60327/consoleFull)** for PR 12313 at commit [`906e68d`](https://github.com/apache/spark/commit/906e68d071daf4e2f15b0f2017b248b872bb6285).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12313#discussion_r63908615
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -498,6 +499,117 @@ class Analyzer(
         }
       }
     
    +  val ResolveOutputColumns = new ResolveOutputColumns
    +  class ResolveOutputColumns extends Rule[LogicalPlan] {
    --- End diff --
    
    I'll fix this and the `protected` method. I think I was instantiating this rule multiple times at one point and this is left-over.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-212572461
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14543] [SQL] Fix InsertIntoTable column...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-213205807
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org