You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by fjh100456 <gi...@git.apache.org> on 2018/10/12 09:50:33 UTC

[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

GitHub user fjh100456 opened a pull request:

    https://github.com/apache/spark/pull/22707

    [SPARK-25717][SQL] Insert overwrite a recreated external and partitioned table may result in incorrect query results

    ## What changes were proposed in this pull request?
    
    Consider the following scenario:
    
    ```
    spark.range(100).createTempView("temp")
    (0 until 3).foreach { _ =>
      spark.sql("drop table if exists tableA")
      spark.sql("create table if not exists tableA(a int) partitioned by (p int) location 'file:/e:/study/warehouse/tableA'")
      spark.sql("insert overwrite table tableA partition(p=1) select * from temp")
      spark.sql("select count(1) from tableA where p=1").show
    }
    ```
    
    We expect the count always be 100, but the actual results are as follows:
    
    ```
    +--------+
    |count(1)|
    +--------+
    |     100|
    +--------+
    
    +--------+
    |count(1)|
    +--------+
    |   200|
    +--------+
    
    +--------+
    |count(1)|
    +--------+
    |   300|
    +--------+
    ```
    
    when spark executes an `insert overwrite` command,  it gets the historical partition first, and then delete it from fileSystem.
    
    But for recreated external and partitioned table, the partitions were all deleted by the `drop  table` command with data unremoved. So the historical data is preserved which lead to the query results incorrect.
    
    ## How was this patch tested?
    Manual test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/fjh100456/spark InsertOverwriteCommit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22707.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22707
    
----
commit 8e1b1be1ac7f7cc4fe31ccf99fb92dd7d8fc8918
Author: fjh100456 <fu...@...>
Date:   2018-10-12T09:45:39Z

    [SPARK-25717][SQL] Insert overwrite a recreated external and partitioned table may result in incorrect query results.
    
    ## What changes were proposed in this pull request?
    
    when spark executes an `insert overwrite` command,  it gets the historical partition first, and then delete it from fileSystem.
    
    But for recreated external and partitioned table, the partitions were all deleted by the `drop  table` command with data unremoved. So the historical data is preserved which lead to the query results incorrect.
    
    ## How was this patch tested?
    Manual test.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r240077883
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
    @@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
           }
         }
       }
    +
    +  test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
    --- End diff --
    
    Ok, thank you.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    **[Test build #99860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99860/testReport)** for PR 22707 at commit [`89f7223`](https://github.com/apache/spark/commit/89f7223593b171133bb716b0004aa13a592dee6a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r240139057
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
               // Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
               // version and we may not want to catch up new Hive version every time. We delete the
               // Hive partition first and then load data file into the Hive partition.
    -          if (oldPart.nonEmpty && overwrite) {
    -            oldPart.get.storage.locationUri.foreach { uri =>
    -              val partitionPath = new Path(uri)
    -              val fs = partitionPath.getFileSystem(hadoopConf)
    -              if (fs.exists(partitionPath)) {
    -                if (!fs.delete(partitionPath, true)) {
    -                  throw new RuntimeException(
    -                    "Cannot remove partition directory '" + partitionPath.toString)
    -                }
    -                // Don't let Hive do overwrite operation since it is slower.
    -                doHiveOverwrite = false
    +          if (overwrite) {
    +            val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
    +              .getOrElse {
    +                ExternalCatalogUtils.generatePartitionPath(
    +                  partitionSpec,
    +                  partitionColumnNames,
    +                  new Path(table.location))
    --- End diff --
    
    For this advince, I had test it manually. If we do not consider `oldPart`,  the `oldPartitionPath ` will be a default partition path under the table directory, as a contrast, the original way will be the real partition path. So it will be different.
    Can we keep it like this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Is there any more suggestions? @wangyum @viirya 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r239997805
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
    @@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
           }
         }
       }
    +
    +  test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
    +    + "should remove the historical partition first") {
    +    withTempDir { tmpDir =>
    +      withTable("test_table") {
    +        (0 until 3).foreach { _ =>
    +          sql("DROP TABLE IF EXISTS test_table")
    +          sql(
    +            s"""
    +               |CREATE EXTERNAL TABLE test_table (key int)
    +               |PARTITIONED BY (p int)
    +               |LOCATION '${tmpDir.toURI.toString.stripSuffix("/")}/test_table'
    +           """.stripMargin)
    +          sql("INSERT OVERWRITE TABLE test_table PARTITION(p=1) SELECT 1")
    +        }
    +        checkAnswer(sql("SELECT COUNT(*) FROM test_table"), Row(1))
    +      }
    +    }
    --- End diff --
    
    How about wrigint tests for this?;
    ```
        withTempDir { tmpDir =>
          withTable("test_table") {
            // Prepare table data
            sql(
              s"""
                 |CREATE EXTERNAL TABLE test_table(key int)
                 |PARTITIONED BY (p int)
                 |LOCATION '${tmpDir.toURI}'
             """.stripMargin)
            sql("INSERT INTO test_table PARTITION(p=1) SELECT 1")
            checkAnswer(sql("SELECT COUNT(*) FROM test_table"), Row(1))
    
            // Run the test...
            sql("DROP TABLE test_table")
            sql(
              s"""
                 |CREATE EXTERNAL TABLE test_table(key int)
                 |PARTITIONED BY (p int)
                 |LOCATION '${tmpDir.toURI}'
             """.stripMargin)
            sql("INSERT OVERWRITE TABLE test_table PARTITION(p=1) SELECT 1")
            checkAnswer(sql("SELECT COUNT(*) FROM test_table"), Row(1))
          }
        }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    @wangyum I have added the test. Thank you.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r239996822
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
    @@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
           }
         }
       }
    +
    +  test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
    +    + "should remove the historical partition first") {
    +    withTempDir { tmpDir =>
    +      withTable("test_table") {
    +        (0 until 3).foreach { _ =>
    +          sql("DROP TABLE IF EXISTS test_table")
    +          sql(
    +            s"""
    +               |CREATE EXTERNAL TABLE test_table (key int)
    +               |PARTITIONED BY (p int)
    +               |LOCATION '${tmpDir.toURI.toString.stripSuffix("/")}/test_table'
    --- End diff --
    
    nit: `|LOCATION '${tmpDir.toURI}'`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r225589876
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
               // Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
               // version and we may not want to catch up new Hive version every time. We delete the
               // Hive partition first and then load data file into the Hive partition.
    -          if (oldPart.nonEmpty && overwrite) {
    -            oldPart.get.storage.locationUri.foreach { uri =>
    -              val partitionPath = new Path(uri)
    -              val fs = partitionPath.getFileSystem(hadoopConf)
    -              if (fs.exists(partitionPath)) {
    -                if (!fs.delete(partitionPath, true)) {
    -                  throw new RuntimeException(
    -                    "Cannot remove partition directory '" + partitionPath.toString)
    -                }
    -                // Don't let Hive do overwrite operation since it is slower.
    -                doHiveOverwrite = false
    +          if (overwrite) {
    +            val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
    +              .getOrElse {
    +                ExternalCatalogUtils.generatePartitionPath(
    +                  partitionSpec,
    +                  partitionColumnNames,
    +                  HiveClientImpl.toHiveTable(table).getDataLocation)
    --- End diff --
    
    `HiveClientImpl.toHiveTable(table).getDataLocation` -> `new Path(table.location)`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r240073151
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
    @@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
           }
         }
       }
    +
    +  test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
    --- End diff --
    
    How about `SPARK-25717: Insert overwrites remove old partition dirs correctly`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    **[Test build #99905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99905/testReport)** for PR 22707 at commit [`327ecd9`](https://github.com/apache/spark/commit/327ecd9f58e8bcd42cdfa2674fd5108907a27b15).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    cc @wangyum


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    **[Test build #99860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99860/testReport)** for PR 22707 at commit [`89f7223`](https://github.com/apache/spark/commit/89f7223593b171133bb716b0004aa13a592dee6a).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r225590134
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
               // Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
               // version and we may not want to catch up new Hive version every time. We delete the
               // Hive partition first and then load data file into the Hive partition.
    -          if (oldPart.nonEmpty && overwrite) {
    -            oldPart.get.storage.locationUri.foreach { uri =>
    -              val partitionPath = new Path(uri)
    -              val fs = partitionPath.getFileSystem(hadoopConf)
    -              if (fs.exists(partitionPath)) {
    -                if (!fs.delete(partitionPath, true)) {
    -                  throw new RuntimeException(
    -                    "Cannot remove partition directory '" + partitionPath.toString)
    -                }
    -                // Don't let Hive do overwrite operation since it is slower.
    -                doHiveOverwrite = false
    +          if (overwrite) {
    +            val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
    +              .getOrElse {
    +                ExternalCatalogUtils.generatePartitionPath(
    +                  partitionSpec,
    +                  partitionColumnNames,
    +                  HiveClientImpl.toHiveTable(table).getDataLocation)
    --- End diff --
    
    cc @viirya What do you think of this change?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    **[Test build #99905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99905/testReport)** for PR 22707 at commit [`327ecd9`](https://github.com/apache/spark/commit/327ecd9f58e8bcd42cdfa2674fd5108907a27b15).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5918/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99905/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5878/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r239996528
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
    @@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
           }
         }
       }
    +
    +  test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
    --- End diff --
    
    Can you make the title shorter in a single line?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99860/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r240070378
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
               // Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
               // version and we may not want to catch up new Hive version every time. We delete the
               // Hive partition first and then load data file into the Hive partition.
    -          if (oldPart.nonEmpty && overwrite) {
    -            oldPart.get.storage.locationUri.foreach { uri =>
    -              val partitionPath = new Path(uri)
    -              val fs = partitionPath.getFileSystem(hadoopConf)
    -              if (fs.exists(partitionPath)) {
    -                if (!fs.delete(partitionPath, true)) {
    -                  throw new RuntimeException(
    -                    "Cannot remove partition directory '" + partitionPath.toString)
    -                }
    -                // Don't let Hive do overwrite operation since it is slower.
    -                doHiveOverwrite = false
    +          if (overwrite) {
    +            val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
    +              .getOrElse {
    +                ExternalCatalogUtils.generatePartitionPath(
    +                  partitionSpec,
    +                  partitionColumnNames,
    +                  new Path(table.location))
    --- End diff --
    
    Oops, seems it's a mistake. The `oldPart` is empty.
    
    Thank you very much,  I'll change the code.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r240067762
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
    @@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
           }
         }
       }
    +
    +  test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
    --- End diff --
    
    ```SPARK-25717: Insert overwrite a recreated external table should remove old-part first```
    
    Would it be ok?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r240002357
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
               // Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
               // version and we may not want to catch up new Hive version every time. We delete the
               // Hive partition first and then load data file into the Hive partition.
    -          if (oldPart.nonEmpty && overwrite) {
    -            oldPart.get.storage.locationUri.foreach { uri =>
    -              val partitionPath = new Path(uri)
    -              val fs = partitionPath.getFileSystem(hadoopConf)
    -              if (fs.exists(partitionPath)) {
    -                if (!fs.delete(partitionPath, true)) {
    -                  throw new RuntimeException(
    -                    "Cannot remove partition directory '" + partitionPath.toString)
    -                }
    -                // Don't let Hive do overwrite operation since it is slower.
    -                doHiveOverwrite = false
    +          if (overwrite) {
    +            val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
    +              .getOrElse {
    +                ExternalCatalogUtils.generatePartitionPath(
    +                  partitionSpec,
    +                  partitionColumnNames,
    +                  new Path(table.location))
    --- End diff --
    
    We still need to consider the old path, `oldPart`? Can't we write this?;
    ```
                val oldPartitionPath = ExternalCatalogUtils.generatePartitionPath(
                  partitionSpec,
                  partitionColumnNames,
                  new Path(table.location))
    ```
    Also, can you write a comment about how to solve this issue here and in the pr description?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Could you add test case for this change?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22707
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r240068636
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
    @@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
           }
         }
       }
    +
    +  test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
    +    + "should remove the historical partition first") {
    +    withTempDir { tmpDir =>
    +      withTable("test_table") {
    +        (0 until 3).foreach { _ =>
    +          sql("DROP TABLE IF EXISTS test_table")
    +          sql(
    +            s"""
    +               |CREATE EXTERNAL TABLE test_table (key int)
    +               |PARTITIONED BY (p int)
    +               |LOCATION '${tmpDir.toURI.toString.stripSuffix("/")}/test_table'
    +           """.stripMargin)
    +          sql("INSERT OVERWRITE TABLE test_table PARTITION(p=1) SELECT 1")
    +        }
    +        checkAnswer(sql("SELECT COUNT(*) FROM test_table"), Row(1))
    +      }
    +    }
    --- End diff --
    
    It will make the test case a little longer, but I'm ok.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r225759293
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
               // Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
               // version and we may not want to catch up new Hive version every time. We delete the
               // Hive partition first and then load data file into the Hive partition.
    -          if (oldPart.nonEmpty && overwrite) {
    -            oldPart.get.storage.locationUri.foreach { uri =>
    -              val partitionPath = new Path(uri)
    -              val fs = partitionPath.getFileSystem(hadoopConf)
    -              if (fs.exists(partitionPath)) {
    -                if (!fs.delete(partitionPath, true)) {
    -                  throw new RuntimeException(
    -                    "Cannot remove partition directory '" + partitionPath.toString)
    -                }
    -                // Don't let Hive do overwrite operation since it is slower.
    -                doHiveOverwrite = false
    +          if (overwrite) {
    +            val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
    +              .getOrElse {
    +                ExternalCatalogUtils.generatePartitionPath(
    +                  partitionSpec,
    +                  partitionColumnNames,
    +                  HiveClientImpl.toHiveTable(table).getDataLocation)
    --- End diff --
    
    Looks correct as I saw we assign `CatalogTable.storage.locationUr` to HiveTable's data location.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...

Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22707#discussion_r225756219
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
    @@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
               // Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
               // version and we may not want to catch up new Hive version every time. We delete the
               // Hive partition first and then load data file into the Hive partition.
    -          if (oldPart.nonEmpty && overwrite) {
    -            oldPart.get.storage.locationUri.foreach { uri =>
    -              val partitionPath = new Path(uri)
    -              val fs = partitionPath.getFileSystem(hadoopConf)
    -              if (fs.exists(partitionPath)) {
    -                if (!fs.delete(partitionPath, true)) {
    -                  throw new RuntimeException(
    -                    "Cannot remove partition directory '" + partitionPath.toString)
    -                }
    -                // Don't let Hive do overwrite operation since it is slower.
    -                doHiveOverwrite = false
    +          if (overwrite) {
    +            val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
    +              .getOrElse {
    +                ExternalCatalogUtils.generatePartitionPath(
    +                  partitionSpec,
    +                  partitionColumnNames,
    +                  HiveClientImpl.toHiveTable(table).getDataLocation)
    --- End diff --
    
    > 
    > 
    > `HiveClientImpl.toHiveTable(table).getDataLocation` -> `new Path(table.location)`?
    
    Yes, they get the same value. I'll change it , thank you very much.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org