You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by fjh100456 <gi...@git.apache.org> on 2018/10/12 09:50:33 UTC
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
GitHub user fjh100456 opened a pull request:
https://github.com/apache/spark/pull/22707
[SPARK-25717][SQL] Insert overwrite a recreated external and partitioned table may result in incorrect query results
## What changes were proposed in this pull request?
Consider the following scenario:
```
spark.range(100).createTempView("temp")
(0 until 3).foreach { _ =>
spark.sql("drop table if exists tableA")
spark.sql("create table if not exists tableA(a int) partitioned by (p int) location 'file:/e:/study/warehouse/tableA'")
spark.sql("insert overwrite table tableA partition(p=1) select * from temp")
spark.sql("select count(1) from tableA where p=1").show
}
```
We expect the count always be 100, but the actual results are as follows:
```
+--------+
|count(1)|
+--------+
| 100|
+--------+
+--------+
|count(1)|
+--------+
| 200|
+--------+
+--------+
|count(1)|
+--------+
| 300|
+--------+
```
when spark executes an `insert overwrite` command, it gets the historical partition first, and then delete it from fileSystem.
But for recreated external and partitioned table, the partitions were all deleted by the `drop table` command with data unremoved. So the historical data is preserved which lead to the query results incorrect.
## How was this patch tested?
Manual test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/fjh100456/spark InsertOverwriteCommit
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22707.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22707
----
commit 8e1b1be1ac7f7cc4fe31ccf99fb92dd7d8fc8918
Author: fjh100456 <fu...@...>
Date: 2018-10-12T09:45:39Z
[SPARK-25717][SQL] Insert overwrite a recreated external and partitioned table may result in incorrect query results.
## What changes were proposed in this pull request?
when spark executes an `insert overwrite` command, it gets the historical partition first, and then delete it from fileSystem.
But for recreated external and partitioned table, the partitions were all deleted by the `drop table` command with data unremoved. So the historical data is preserved which lead to the query results incorrect.
## How was this patch tested?
Manual test.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r240077883
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
}
}
}
+
+ test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
--- End diff --
Ok, thank you.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22707
**[Test build #99860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99860/testReport)** for PR 22707 at commit [`89f7223`](https://github.com/apache/spark/commit/89f7223593b171133bb716b0004aa13a592dee6a).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r240139057
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
@@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
// Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
// version and we may not want to catch up new Hive version every time. We delete the
// Hive partition first and then load data file into the Hive partition.
- if (oldPart.nonEmpty && overwrite) {
- oldPart.get.storage.locationUri.foreach { uri =>
- val partitionPath = new Path(uri)
- val fs = partitionPath.getFileSystem(hadoopConf)
- if (fs.exists(partitionPath)) {
- if (!fs.delete(partitionPath, true)) {
- throw new RuntimeException(
- "Cannot remove partition directory '" + partitionPath.toString)
- }
- // Don't let Hive do overwrite operation since it is slower.
- doHiveOverwrite = false
+ if (overwrite) {
+ val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
+ .getOrElse {
+ ExternalCatalogUtils.generatePartitionPath(
+ partitionSpec,
+ partitionColumnNames,
+ new Path(table.location))
--- End diff --
For this advince, I had test it manually. If we do not consider `oldPart`, the `oldPartitionPath ` will be a default partition path under the table directory, as a contrast, the original way will be the real partition path. So it will be different.
Can we keep it like this?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on the issue:
https://github.com/apache/spark/pull/22707
Is there any more suggestions? @wangyum @viirya
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r239997805
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
}
}
}
+
+ test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
+ + "should remove the historical partition first") {
+ withTempDir { tmpDir =>
+ withTable("test_table") {
+ (0 until 3).foreach { _ =>
+ sql("DROP TABLE IF EXISTS test_table")
+ sql(
+ s"""
+ |CREATE EXTERNAL TABLE test_table (key int)
+ |PARTITIONED BY (p int)
+ |LOCATION '${tmpDir.toURI.toString.stripSuffix("/")}/test_table'
+ """.stripMargin)
+ sql("INSERT OVERWRITE TABLE test_table PARTITION(p=1) SELECT 1")
+ }
+ checkAnswer(sql("SELECT COUNT(*) FROM test_table"), Row(1))
+ }
+ }
--- End diff --
How about wrigint tests for this?;
```
withTempDir { tmpDir =>
withTable("test_table") {
// Prepare table data
sql(
s"""
|CREATE EXTERNAL TABLE test_table(key int)
|PARTITIONED BY (p int)
|LOCATION '${tmpDir.toURI}'
""".stripMargin)
sql("INSERT INTO test_table PARTITION(p=1) SELECT 1")
checkAnswer(sql("SELECT COUNT(*) FROM test_table"), Row(1))
// Run the test...
sql("DROP TABLE test_table")
sql(
s"""
|CREATE EXTERNAL TABLE test_table(key int)
|PARTITIONED BY (p int)
|LOCATION '${tmpDir.toURI}'
""".stripMargin)
sql("INSERT OVERWRITE TABLE test_table PARTITION(p=1) SELECT 1")
checkAnswer(sql("SELECT COUNT(*) FROM test_table"), Row(1))
}
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on the issue:
https://github.com/apache/spark/pull/22707
@wangyum I have added the test. Thank you.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r239996822
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
}
}
}
+
+ test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
+ + "should remove the historical partition first") {
+ withTempDir { tmpDir =>
+ withTable("test_table") {
+ (0 until 3).foreach { _ =>
+ sql("DROP TABLE IF EXISTS test_table")
+ sql(
+ s"""
+ |CREATE EXTERNAL TABLE test_table (key int)
+ |PARTITIONED BY (p int)
+ |LOCATION '${tmpDir.toURI.toString.stripSuffix("/")}/test_table'
--- End diff --
nit: `|LOCATION '${tmpDir.toURI}'`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22707
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r225589876
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
@@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
// Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
// version and we may not want to catch up new Hive version every time. We delete the
// Hive partition first and then load data file into the Hive partition.
- if (oldPart.nonEmpty && overwrite) {
- oldPart.get.storage.locationUri.foreach { uri =>
- val partitionPath = new Path(uri)
- val fs = partitionPath.getFileSystem(hadoopConf)
- if (fs.exists(partitionPath)) {
- if (!fs.delete(partitionPath, true)) {
- throw new RuntimeException(
- "Cannot remove partition directory '" + partitionPath.toString)
- }
- // Don't let Hive do overwrite operation since it is slower.
- doHiveOverwrite = false
+ if (overwrite) {
+ val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
+ .getOrElse {
+ ExternalCatalogUtils.generatePartitionPath(
+ partitionSpec,
+ partitionColumnNames,
+ HiveClientImpl.toHiveTable(table).getDataLocation)
--- End diff --
`HiveClientImpl.toHiveTable(table).getDataLocation` -> `new Path(table.location)`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22707
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r240073151
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
}
}
}
+
+ test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
--- End diff --
How about `SPARK-25717: Insert overwrites remove old partition dirs correctly`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22707
**[Test build #99905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99905/testReport)** for PR 22707 at commit [`327ecd9`](https://github.com/apache/spark/commit/327ecd9f58e8bcd42cdfa2674fd5108907a27b15).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by heary-cao <gi...@git.apache.org>.
Github user heary-cao commented on the issue:
https://github.com/apache/spark/pull/22707
cc @wangyum
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22707
**[Test build #99860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99860/testReport)** for PR 22707 at commit [`89f7223`](https://github.com/apache/spark/commit/89f7223593b171133bb716b0004aa13a592dee6a).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/22707
ok to test
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r225590134
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
@@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
// Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
// version and we may not want to catch up new Hive version every time. We delete the
// Hive partition first and then load data file into the Hive partition.
- if (oldPart.nonEmpty && overwrite) {
- oldPart.get.storage.locationUri.foreach { uri =>
- val partitionPath = new Path(uri)
- val fs = partitionPath.getFileSystem(hadoopConf)
- if (fs.exists(partitionPath)) {
- if (!fs.delete(partitionPath, true)) {
- throw new RuntimeException(
- "Cannot remove partition directory '" + partitionPath.toString)
- }
- // Don't let Hive do overwrite operation since it is slower.
- doHiveOverwrite = false
+ if (overwrite) {
+ val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
+ .getOrElse {
+ ExternalCatalogUtils.generatePartitionPath(
+ partitionSpec,
+ partitionColumnNames,
+ HiveClientImpl.toHiveTable(table).getDataLocation)
--- End diff --
cc @viirya What do you think of this change?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22707
**[Test build #99905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99905/testReport)** for PR 22707 at commit [`327ecd9`](https://github.com/apache/spark/commit/327ecd9f58e8bcd42cdfa2674fd5108907a27b15).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22707
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5918/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22707
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22707
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99905/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22707
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5878/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r239996528
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
}
}
}
+
+ test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
--- End diff --
Can you make the title shorter in a single line?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22707
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99860/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22707
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22707
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22707
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r240070378
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
@@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
// Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
// version and we may not want to catch up new Hive version every time. We delete the
// Hive partition first and then load data file into the Hive partition.
- if (oldPart.nonEmpty && overwrite) {
- oldPart.get.storage.locationUri.foreach { uri =>
- val partitionPath = new Path(uri)
- val fs = partitionPath.getFileSystem(hadoopConf)
- if (fs.exists(partitionPath)) {
- if (!fs.delete(partitionPath, true)) {
- throw new RuntimeException(
- "Cannot remove partition directory '" + partitionPath.toString)
- }
- // Don't let Hive do overwrite operation since it is slower.
- doHiveOverwrite = false
+ if (overwrite) {
+ val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
+ .getOrElse {
+ ExternalCatalogUtils.generatePartitionPath(
+ partitionSpec,
+ partitionColumnNames,
+ new Path(table.location))
--- End diff --
Oops, seems it's a mistake. The `oldPart` is empty.
Thank you very much, I'll change the code.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r240067762
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
}
}
}
+
+ test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
--- End diff --
```SPARK-25717: Insert overwrite a recreated external table should remove old-part first```
Would it be ok?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r240002357
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
@@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
// Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
// version and we may not want to catch up new Hive version every time. We delete the
// Hive partition first and then load data file into the Hive partition.
- if (oldPart.nonEmpty && overwrite) {
- oldPart.get.storage.locationUri.foreach { uri =>
- val partitionPath = new Path(uri)
- val fs = partitionPath.getFileSystem(hadoopConf)
- if (fs.exists(partitionPath)) {
- if (!fs.delete(partitionPath, true)) {
- throw new RuntimeException(
- "Cannot remove partition directory '" + partitionPath.toString)
- }
- // Don't let Hive do overwrite operation since it is slower.
- doHiveOverwrite = false
+ if (overwrite) {
+ val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
+ .getOrElse {
+ ExternalCatalogUtils.generatePartitionPath(
+ partitionSpec,
+ partitionColumnNames,
+ new Path(table.location))
--- End diff --
We still need to consider the old path, `oldPart`? Can't we write this?;
```
val oldPartitionPath = ExternalCatalogUtils.generatePartitionPath(
partitionSpec,
partitionColumnNames,
new Path(table.location))
```
Also, can you write a comment about how to solve this issue here and in the pr description?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:
https://github.com/apache/spark/pull/22707
Could you add test case for this change?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22707
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r240068636
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala ---
@@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
}
}
}
+
+ test("SPARK-25717: Insert overwrite a recreated external and partitioned table "
+ + "should remove the historical partition first") {
+ withTempDir { tmpDir =>
+ withTable("test_table") {
+ (0 until 3).foreach { _ =>
+ sql("DROP TABLE IF EXISTS test_table")
+ sql(
+ s"""
+ |CREATE EXTERNAL TABLE test_table (key int)
+ |PARTITIONED BY (p int)
+ |LOCATION '${tmpDir.toURI.toString.stripSuffix("/")}/test_table'
+ """.stripMargin)
+ sql("INSERT OVERWRITE TABLE test_table PARTITION(p=1) SELECT 1")
+ }
+ checkAnswer(sql("SELECT COUNT(*) FROM test_table"), Row(1))
+ }
+ }
--- End diff --
It will make the test case a little longer, but I'm ok.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r225759293
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
@@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
// Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
// version and we may not want to catch up new Hive version every time. We delete the
// Hive partition first and then load data file into the Hive partition.
- if (oldPart.nonEmpty && overwrite) {
- oldPart.get.storage.locationUri.foreach { uri =>
- val partitionPath = new Path(uri)
- val fs = partitionPath.getFileSystem(hadoopConf)
- if (fs.exists(partitionPath)) {
- if (!fs.delete(partitionPath, true)) {
- throw new RuntimeException(
- "Cannot remove partition directory '" + partitionPath.toString)
- }
- // Don't let Hive do overwrite operation since it is slower.
- doHiveOverwrite = false
+ if (overwrite) {
+ val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
+ .getOrElse {
+ ExternalCatalogUtils.generatePartitionPath(
+ partitionSpec,
+ partitionColumnNames,
+ HiveClientImpl.toHiveTable(table).getDataLocation)
--- End diff --
Looks correct as I saw we assign `CatalogTable.storage.locationUr` to HiveTable's data location.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Posted by fjh100456 <gi...@git.apache.org>.
Github user fjh100456 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22707#discussion_r225756219
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala ---
@@ -227,18 +227,22 @@ case class InsertIntoHiveTable(
// Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
// version and we may not want to catch up new Hive version every time. We delete the
// Hive partition first and then load data file into the Hive partition.
- if (oldPart.nonEmpty && overwrite) {
- oldPart.get.storage.locationUri.foreach { uri =>
- val partitionPath = new Path(uri)
- val fs = partitionPath.getFileSystem(hadoopConf)
- if (fs.exists(partitionPath)) {
- if (!fs.delete(partitionPath, true)) {
- throw new RuntimeException(
- "Cannot remove partition directory '" + partitionPath.toString)
- }
- // Don't let Hive do overwrite operation since it is slower.
- doHiveOverwrite = false
+ if (overwrite) {
+ val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_)))
+ .getOrElse {
+ ExternalCatalogUtils.generatePartitionPath(
+ partitionSpec,
+ partitionColumnNames,
+ HiveClientImpl.toHiveTable(table).getDataLocation)
--- End diff --
>
>
> `HiveClientImpl.toHiveTable(table).getDataLocation` -> `new Path(table.location)`?
Yes, they get the same value. I'll change it , thank you very much.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org