You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/06 20:34:59 UTC

[GitHub] [spark] MaxGekk opened a new pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

MaxGekk opened a new pull request #31499:
URL: https://github.com/apache/spark/pull/31499


   ### What changes were proposed in this pull request?
   TODO
   
   Closes #31097
   
   ### Why are the changes needed?
   The changes allow to recover tables with removed partitions. The example below portraits the problem:
   ```sql
   spark-sql> create table tbl2 (col int, part int) partitioned by (part);
   spark-sql> insert into tbl2 partition (part=1) select 1;
   spark-sql> insert into tbl2 partition (part=0) select 0;
   spark-sql> show table extended like 'tbl2' partition (part = 0);
   default	tbl2	false	Partition Values: [part=0]
   Location: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0
   ...
   ```
   Remove the partition (part = 0) from the filesystem:
   ```
   $ rm -rf /Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0
   ```
   Even after recovering, we cannot query the table:
   ```sql
   spark-sql> msck repair table tbl2;
   spark-sql> select * from tbl2;
   21/01/08 22:49:13 ERROR SparkSQLDriver: Failed in [select * from tbl2]
   org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. After the changes, we can query recovered table:
   ```sql
   spark-sql> msck repair table tbl2 sync partitions;
   spark-sql> select * from tbl2;
   1	1
   spark-sql> show partitions tbl2;
   part=1
   ```
   
   ### How was this patch tested?
   By running the modified test suite:
   ```
   $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *MsckRepairTableParserSuite"
   $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *PlanResolutionSuite"
   $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsSuite"
   $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsParallelSuite"
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779915997


   **[Test build #135175 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135175/testReport)** for PR 31499 at commit [`7b9ee52`](https://github.com/apache/spark/commit/7b9ee52da7b09f6b073bd388a77940725f36cd1a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774707998


   @cloud-fan @dongjoon-hyun @viirya @HyukjinKwon May I ask you to review this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576330642



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/MsckRepairTableSuite.scala
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command.v2
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.execution.command
+
+/**
+ * The class contains tests for the `MSCK REPAIR TABLE` command
+ * to check V2 table catalogs.
+ */
+class MsckRepairTableSuite
+  extends command.MsckRepairTableSuiteBase
+  with CommandSuiteBase {
+
+  // TODO(SPARK-34397): Support v2 `MSCK REPAIR TABLE`

Review comment:
       Thank you for adding this IDed TODO.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784534813


   Merged to master for Apache Spark 3.2.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576695492



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##########
@@ -3659,11 +3659,24 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
    *
    * For example:
    * {{{
-   *   MSCK REPAIR TABLE multi_part_name
+   *   MSCK REPAIR TABLE multi_part_name [[ADD|DROP|SYNC] PARTITIONS]

Review comment:
       Changed to `{ADD|DROP|SYNC}`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #31499:
URL: https://github.com/apache/spark/pull/31499


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576721877



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/MsckRepairTableParserSuite.scala
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.catalyst.analysis.{AnalysisTest, UnresolvedTable}
+import org.apache.spark.sql.catalyst.parser.CatalystSqlParser.parsePlan
+import org.apache.spark.sql.catalyst.plans.logical.RepairTable
+import org.apache.spark.sql.test.SharedSparkSession
+
+class MsckRepairTableParserSuite extends AnalysisTest with SharedSparkSession {

Review comment:
       > Is there a reason not to keep in DDLParserSuite of catalyst module?
   
   Actually, there are two `DDLParserSuite`:
   1. In `sql/core`: org.apache.spark.sql.execution.command.DDLParserSuite which checks parsing of DDL commands
   2. In `sql/catalyst`: org.apache.spark.sql.catalyst.parser.DDLParserSuite which checks parsing of everything
   
   Since the parser tests check parsing of particular DDL commands, I moved them closely to unified tests.
   
   > SharedSparkSession is required?
   
   No, it is not. Since it was copy-pasted from other *ParserSuite, I opened the JIRA SPARK-34447 to refactor other tests later (maybe improve other stuff), and to do not touch unrelated code in this PR.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r581786708



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -597,11 +597,13 @@ case class PartitionStatistics(numFiles: Int, totalSize: Long)
  * The syntax of this command is:
  * {{{
  *   ALTER TABLE table RECOVER PARTITIONS;
- *   MSCK REPAIR TABLE table;
+ *   MSCK REPAIR TABLE table [{ADD|DROP|SYNC} PARTITIONS];
  * }}}
  */
 case class AlterTableRecoverPartitionsCommand(

Review comment:
       Here is the PR https://github.com/apache/spark/pull/31635 for renaming.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779383014


   1. For the first your example, `[ NOSCAN | FOR COLUMNS col [ , ... ] | FOR ALL COLUMNS ]` means we can omit all. So, we cannot write like that here.
   2. For the second example, you mean `{}` is correct for this case because `(` and `)` is used as literal. Did I understand correctly?
   
   +1 for your suggestion (2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774777715


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134991/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774777715


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134991/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774557483


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39544/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774724478


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39574/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r581671180



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/MsckRepairTableSuite.scala
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command.v1
+
+import java.io.File
+
+import org.apache.commons.io.FileUtils
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.execution.command
+
+/**
+ * This base suite contains unified tests for the `MSCK REPAIR TABLE` command that
+ * check V1 table catalogs. The tests that cannot run for all V1 catalogs are located in more
+ * specific test suites:
+ *
+ *   - V1 In-Memory catalog:
+ *     `org.apache.spark.sql.execution.command.v1.MsckRepairTableSuite`
+ *   - V1 Hive External catalog:
+ *     `org.apache.spark.sql.hive.execution.command.MsckRepairTableSuite`
+ */
+trait MsckRepairTableSuiteBase extends command.MsckRepairTableSuiteBase {
+  def deletePartitionDir(tableName: String, part: String): Unit = {
+    val partLoc = getPartitionLocation(tableName, part)
+    FileUtils.deleteDirectory(new File(partLoc))
+  }
+
+  test("drop partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(spark.table(t), Seq(Row(0, 0), Row(1, 1)))
+      deletePartitionDir(t, "part=1")
+      sql(s"MSCK REPAIR TABLE $t DROP PARTITIONS")
+      checkPartitions(t, Map("part" -> "0"))
+      checkAnswer(spark.table(t), Seq(Row(0, 0)))
+    }
+  }
+
+  test("sync partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(0, 0), Row(1, 1)))
+      copyPartition(t, "part=0", "part=2")
+      deletePartitionDir(t, "part=0")
+      sql(s"MSCK REPAIR TABLE $t SYNC PARTITIONS")
+      checkPartitions(t, Map("part" -> "1"), Map("part" -> "2"))
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(1, 1), Row(0, 2)))
+    }
+  }

Review comment:
       @cloud-fan If you would like to `MSCK REPAIR TABLE .. ADD PARTITIONS`, we could mix the `v1.AlterTableRecoverPartitionsSuiteBase` to `MsckRepairTableSuiteBase` to run the existing tests automatically.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784095187


   **[Test build #135374 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135374/testReport)** for PR 31499 at commit [`fefea57`](https://github.com/apache/spark/commit/fefea57d097e64b5b506dea45aac24c537f27abd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576683233



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -37,6 +37,13 @@ MSCK REPAIR TABLE table_identifier
 
     **Syntax:** `[ database_name. ] table_name`
 
+* **`[[ADD|DROP|SYNC] PARTITIONS]`**

Review comment:
       the syntax is from to the SQL standard but functionality is from Hive.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576328812



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/MsckRepairTableParserSuite.scala
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.catalyst.analysis.{AnalysisTest, UnresolvedTable}
+import org.apache.spark.sql.catalyst.parser.CatalystSqlParser.parsePlan
+import org.apache.spark.sql.catalyst.plans.logical.RepairTable
+import org.apache.spark.sql.test.SharedSparkSession
+
+class MsckRepairTableParserSuite extends AnalysisTest with SharedSparkSession {

Review comment:
       - Is there a reason not to keep in `DDLParserSuite` of `catalyst` module?
   - `SharedSparkSession` is required?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r580904273



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -791,8 +799,26 @@ case class AlterTableRecoverPartitionsCommand(
       logDebug(s"Recovered ${parts.length} partitions ($done/$total so far)")
     }
   }
-}
 
+  // Drops the partitions that do not exist in the file system
+  private def dropPartitions(catalog: SessionCatalog, fs: FileSystem): Int = {
+    val dropPartSpecs = ThreadUtils.parmap(
+      catalog.listPartitions(tableName),
+      "AlterTableRecoverPartitionsCommand: non-existing partitions",
+      maxThreads = 8) { partition =>
+      partition.storage.locationUri.flatMap { uri =>
+        if (fs.exists(new Path(uri))) None else Some(partition.spec)
+      }
+    }.flatten
+    catalog.dropPartitions(
+      tableName,
+      dropPartSpecs,
+      ignoreIfNotExists = true,
+      purge = false,
+      retainData = true)

Review comment:
       > ... we don't want addition file system calls. Did I understand correctly?
   
   Yep, if we set `retainData` to `true`, the `deleteData` flag will `false` at https://github.com/apache/hive/blob/release-3.1.3-rc0/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L4360-L4378 . So, Hive MeteStore will not try to delete the partition folders.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576682339



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -26,7 +26,7 @@ license: |
 ### Syntax
 
 ```sql
-MSCK REPAIR TABLE table_identifier
+MSCK REPAIR TABLE table_identifier [[ADD|DROP|SYNC] PARTITIONS]

Review comment:
       Will change it to `[{ADD|DROP|SYNC} PARTITIONS]`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576299130



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -37,6 +37,13 @@ MSCK REPAIR TABLE table_identifier
 
     **Syntax:** `[ database_name. ] table_name`
 
+* **`[[ADD|DROP|SYNC] PARTITIONS]`**
+
+    * If the option is not specified, `MSCK REPAIR TABLE` adds partitions to the Hive external catalog only.
+    * **ADD**, the command adds new partitions in the catalog for all sub-folder in the base table folder that don't belong to any table partitions.

Review comment:
       Like line 42 (before this line) and line 44 (after this line), `the catalog` -> `the Hive external catalog` consistently?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774549602


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39544/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r581718451



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -39,6 +39,13 @@ MSCK REPAIR TABLE table_identifier
 
     **Syntax:** `[ database_name. ] table_name`
 
+* **`{ADD|DROP|SYNC} PARTITIONS`**
+
+    * If specified, `MSCK REPAIR TABLE` only adds partitions to the session catalog.

Review comment:
       Here is the PR https://github.com/apache/spark/pull/31633




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774552426


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39544/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774661841


   **[Test build #134981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134981/testReport)** for PR 31499 at commit [`1168390`](https://github.com/apache/spark/commit/11683902aa39616adbb8db4ab98b246f286bf833).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r578792008



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -597,11 +597,13 @@ case class PartitionStatistics(numFiles: Int, totalSize: Long)
  * The syntax of this command is:
  * {{{
  *   ALTER TABLE table RECOVER PARTITIONS;
- *   MSCK REPAIR TABLE table;
+ *   MSCK REPAIR TABLE table [{ADD|DROP|SYNC} PARTITIONS];
  * }}}
  */
 case class AlterTableRecoverPartitionsCommand(

Review comment:
       Well, it means many additional changes including the doc changes at line 594, `Recover Partitions in ALTER TABLE`. Why don't we do that renaming in another PR separately if you want?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r578772784



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -597,11 +597,13 @@ case class PartitionStatistics(numFiles: Int, totalSize: Long)
  * The syntax of this command is:
  * {{{
  *   ALTER TABLE table RECOVER PARTITIONS;
- *   MSCK REPAIR TABLE table;
+ *   MSCK REPAIR TABLE table [{ADD|DROP|SYNC} PARTITIONS];
  * }}}
  */
 case class AlterTableRecoverPartitionsCommand(

Review comment:
       @cloud-fan @dongjoon-hyun Should we rename this to `RepairTableCommand` (or `MsckRepairTableCommad`) since `ALTER TABLE .. RECOVER PARTITIONS` is just a specific case of `MSCK REPAIR TABLE table [{ADD|DROP|SYNC} PARTITIONS]`, and it doesn't support `{DROP|SYNC} PARTITIONS`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774546233


   **[Test build #134961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134961/testReport)** for PR 31499 at commit [`c313bdb`](https://github.com/apache/spark/commit/c313bdb4309b085701029d8c0cc3a00d4b9c0654).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784805608


   late LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784161384


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39954/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779408914


   @dongjoon-hyun I have looked at the SQL standard, it uses both notions:
   ```sql
   { UTF8 | UTF16 | UTF32}
   GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY
   ```
   and
   ```sql
   [ INSTANCE | STATIC | CONSTRUCTOR ] METHOD
   ```
   
   It seems `{alternative1 | alternative2 }` means one of the alternatives must be specified, `[alternative1 | alternative2]` - both are optional. 
   
   I do believe we should use the syntax `MSCK REPAIR TABLE .. [ { ADD | DROP | SYNC } PARTITIONS ]` in docs/comments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774752836


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39574/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774546233


   **[Test build #134961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134961/testReport)** for PR 31499 at commit [`c313bdb`](https://github.com/apache/spark/commit/c313bdb4309b085701029d8c0cc3a00d4b9c0654).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-778525619


   ~I added me as a reviewer in order not to forget this.~ It doesn't work as expected. I assigned it to myself.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576299652



##########
File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##########
@@ -229,7 +229,8 @@ statement
     | LOAD DATA LOCAL? INPATH path=STRING OVERWRITE? INTO TABLE
         multipartIdentifier partitionSpec?                             #loadData
     | TRUNCATE TABLE multipartIdentifier partitionSpec?                #truncateTable
-    | MSCK REPAIR TABLE multipartIdentifier                            #repairTable
+    | MSCK REPAIR TABLE multipartIdentifier
+      (option=(ADD|DROP|SYNC) PARTITIONS)?                             #repairTable

Review comment:
       indentation?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576305248



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -26,7 +26,7 @@ license: |
 ### Syntax
 
 ```sql
-MSCK REPAIR TABLE table_identifier
+MSCK REPAIR TABLE table_identifier [[ADD|DROP|SYNC] PARTITIONS]

Review comment:
       Well, could you check the PR title together?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774776053


   **[Test build #134991 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134991/testReport)** for PR 31499 at commit [`79ec7a1`](https://github.com/apache/spark/commit/79ec7a162ccef01b0b0b45d69b6a284c90c80c31).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r581644015



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -39,6 +39,13 @@ MSCK REPAIR TABLE table_identifier
 
     **Syntax:** `[ database_name. ] table_name`
 
+* **`{ADD|DROP|SYNC} PARTITIONS`**
+
+    * If specified, `MSCK REPAIR TABLE` only adds partitions to the session catalog.

Review comment:
       typo: should be `If not specified`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779849862


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39756/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576693104



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -37,6 +37,13 @@ MSCK REPAIR TABLE table_identifier
 
     **Syntax:** `[ database_name. ] table_name`
 
+* **`[[ADD|DROP|SYNC] PARTITIONS]`**
+
+    * If the option is not specified, `MSCK REPAIR TABLE` adds partitions to the Hive external catalog only.
+    * **ADD**, the command adds new partitions in the catalog for all sub-folder in the base table folder that don't belong to any table partitions.

Review comment:
       Actually, the command calls the session catalog which is an internal catalog servers as a proxy to an external catalog (In-Memory or Hive). Strictly speaking, the command adds/drops partitions in the session catalog. Let me update the doc to be more precise.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576708616



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -597,11 +597,13 @@ case class PartitionStatistics(numFiles: Int, totalSize: Long)
  * The syntax of this command is:
  * {{{
  *   ALTER TABLE table RECOVER PARTITIONS;
- *   MSCK REPAIR TABLE table;
+ *   MSCK REPAIR TABLE table [[ADD|DROP|SYNC] PARTITIONS];

Review comment:
       Replaced by `{ADD|DROP|SYNC}`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk edited a comment on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
MaxGekk edited a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779364958


   @dongjoon-hyun Thank you for reviewing this PR. Regarding to syntax for alternatives, I wasn't sure what should I use in docs. I just looked at existing examples:
   
   1. `ANALYZE TABLE` (similar in [EXPLAIN](http://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html))
   
   In docs ([link](http://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html)):
   ```
   [ NOSCAN | FOR COLUMNS col [ , ... ] | FOR ALL COLUMNS ]
   ```
   In SqlBase.g4:
   https://github.com/apache/spark/blob/ba974ea8e4cc8075056682c2badab5ca64b90047/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L136
   
   2. `SHOW TABLES`
   
   In docs ([link](http://spark.apache.org/docs/latest/sql-ref-syntax-aux-show-tables.html))
   ```
   { FROM | IN }
   ```
   In SqlBase.g4:
   https://github.com/apache/spark/blob/ba974ea8e4cc8075056682c2badab5ca64b90047/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L198
   
   You proposed `(ADD|DROP|SYNC)` but I haven't found any examples with such syntax. Do you know what the syntax the Spark SQL guide uses?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r580906686



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -791,8 +799,26 @@ case class AlterTableRecoverPartitionsCommand(
       logDebug(s"Recovered ${parts.length} partitions ($done/$total so far)")
     }
   }
-}
 
+  // Drops the partitions that do not exist in the file system
+  private def dropPartitions(catalog: SessionCatalog, fs: FileSystem): Int = {
+    val dropPartSpecs = ThreadUtils.parmap(
+      catalog.listPartitions(tableName),
+      "AlterTableRecoverPartitionsCommand: non-existing partitions",
+      maxThreads = 8) { partition =>
+      partition.storage.locationUri.flatMap { uri =>
+        if (fs.exists(new Path(uri))) None else Some(partition.spec)
+      }
+    }.flatten
+    catalog.dropPartitions(
+      tableName,
+      dropPartSpecs,
+      ignoreIfNotExists = true,
+      purge = false,
+      retainData = true)

Review comment:
       The same for the `In-Memory` catalog: https://github.com/apache/spark/blob/bfc023501379d28ae2db8708928f4e658ccaa07f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala#L464-L473
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576328812



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/MsckRepairTableParserSuite.scala
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.catalyst.analysis.{AnalysisTest, UnresolvedTable}
+import org.apache.spark.sql.catalyst.parser.CatalystSqlParser.parsePlan
+import org.apache.spark.sql.catalyst.plans.logical.RepairTable
+import org.apache.spark.sql.test.SharedSparkSession
+
+class MsckRepairTableParserSuite extends AnalysisTest with SharedSparkSession {

Review comment:
       - Is there a reason not to keep in `DDLParserSuite`?
   - `SharedSparkSession` is required?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779765039


   **[Test build #135175 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135175/testReport)** for PR 31499 at commit [`7b9ee52`](https://github.com/apache/spark/commit/7b9ee52da7b09f6b073bd388a77940725f36cd1a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576335456



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/MsckRepairTableSuiteBase.scala
##########
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.QueryTest
+
+/**
+ * This base suite contains unified tests for the `MSCK REPAIR TABLE` command that

Review comment:
       This doc is very helpful. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774639920


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39564/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576695008



##########
File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##########
@@ -229,7 +229,8 @@ statement
     | LOAD DATA LOCAL? INPATH path=STRING OVERWRITE? INTO TABLE
         multipartIdentifier partitionSpec?                             #loadData
     | TRUNCATE TABLE multipartIdentifier partitionSpec?                #truncateTable
-    | MSCK REPAIR TABLE multipartIdentifier                            #repairTable
+    | MSCK REPAIR TABLE multipartIdentifier
+      (option=(ADD|DROP|SYNC) PARTITIONS)?                             #repairTable

Review comment:
       fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774635734


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39564/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774714041


   **[Test build #134991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134991/testReport)** for PR 31499 at commit [`79ec7a1`](https://github.com/apache/spark/commit/79ec7a162ccef01b0b0b45d69b6a284c90c80c31).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784533817


   Thank you for updating, @MaxGekk !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576321026



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala
##########
@@ -1919,12 +1919,6 @@ class DDLParserSuite extends AnalysisTest {
       "missing 'COLUMNS' at '<EOF>'")
   }
 
-  test("MSCK REPAIR TABLE") {

Review comment:
       ~Do we need to remove this instead of revising?~




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779788596


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39756/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784095187


   **[Test build #135374 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135374/testReport)** for PR 31499 at commit [`fefea57`](https://github.com/apache/spark/commit/fefea57d097e64b5b506dea45aac24c537f27abd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774752836


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39574/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774745498


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39574/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576687827



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -37,6 +37,13 @@ MSCK REPAIR TABLE table_identifier
 
     **Syntax:** `[ database_name. ] table_name`
 
+* **`[[ADD|DROP|SYNC] PARTITIONS]`**
+
+    * If the option is not specified, `MSCK REPAIR TABLE` adds partitions to the Hive external catalog only.

Review comment:
       I will remove `the option`, I found it in another command: http://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html
   ```
   If no analyze option is specified, ANALYZE TABLE collects ...
   ```
   but docs for other commands say `If specified, ...`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-776473007


   @dongjoon-hyun @viirya Are you ok with the changes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r581647130



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/MsckRepairTableSuite.scala
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command.v1
+
+import java.io.File
+
+import org.apache.commons.io.FileUtils
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.execution.command
+
+/**
+ * This base suite contains unified tests for the `MSCK REPAIR TABLE` command that
+ * check V1 table catalogs. The tests that cannot run for all V1 catalogs are located in more
+ * specific test suites:
+ *
+ *   - V1 In-Memory catalog:
+ *     `org.apache.spark.sql.execution.command.v1.MsckRepairTableSuite`
+ *   - V1 Hive External catalog:
+ *     `org.apache.spark.sql.hive.execution.command.MsckRepairTableSuite`
+ */
+trait MsckRepairTableSuiteBase extends command.MsckRepairTableSuiteBase {
+  def deletePartitionDir(tableName: String, part: String): Unit = {
+    val partLoc = getPartitionLocation(tableName, part)
+    FileUtils.deleteDirectory(new File(partLoc))
+  }
+
+  test("drop partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(spark.table(t), Seq(Row(0, 0), Row(1, 1)))
+      deletePartitionDir(t, "part=1")
+      sql(s"MSCK REPAIR TABLE $t DROP PARTITIONS")
+      checkPartitions(t, Map("part" -> "0"))
+      checkAnswer(spark.table(t), Seq(Row(0, 0)))
+    }
+  }
+
+  test("sync partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(0, 0), Row(1, 1)))
+      copyPartition(t, "part=0", "part=2")
+      deletePartitionDir(t, "part=0")
+      sql(s"MSCK REPAIR TABLE $t SYNC PARTITIONS")
+      checkPartitions(t, Map("part" -> "1"), Map("part" -> "2"))
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(1, 1), Row(0, 2)))
+    }
+  }

Review comment:
       shall we test ADD?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576322871



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -597,11 +597,13 @@ case class PartitionStatistics(numFiles: Int, totalSize: Long)
  * The syntax of this command is:
  * {{{
  *   ALTER TABLE table RECOVER PARTITIONS;
- *   MSCK REPAIR TABLE table;
+ *   MSCK REPAIR TABLE table [[ADD|DROP|SYNC] PARTITIONS];

Review comment:
       ditto. `[ADD|DROP|SYNC]` -> `(ADD|DROP|SYNC)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779849862


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39756/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779838044


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39756/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576321026



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala
##########
@@ -1919,12 +1919,6 @@ class DDLParserSuite extends AnalysisTest {
       "missing 'COLUMNS' at '<EOF>'")
   }
 
-  test("MSCK REPAIR TABLE") {

Review comment:
       Do we need to remove this instead of revising?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r581644177



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -39,6 +39,13 @@ MSCK REPAIR TABLE table_identifier
 
     **Syntax:** `[ database_name. ] table_name`
 
+* **`{ADD|DROP|SYNC} PARTITIONS`**
+
+    * If specified, `MSCK REPAIR TABLE` only adds partitions to the session catalog.

Review comment:
       I think it's better to put it in the end, and say `If not specified, ADD is the default.`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774664451


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134981/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576708009



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
##########
@@ -366,8 +366,12 @@ class ResolveSessionCatalog(val catalogManager: CatalogManager)
     case AnalyzeColumn(ResolvedV1TableOrViewIdentifier(ident), columnNames, allColumns) =>
       AnalyzeColumnCommand(ident.asTableIdentifier, columnNames, allColumns)
 
-    case RepairTable(ResolvedV1TableIdentifier(ident)) =>
-      AlterTableRecoverPartitionsCommand(ident.asTableIdentifier, "MSCK REPAIR TABLE")
+    case RepairTable(ResolvedV1TableIdentifier(ident), addPartitions, dropPartitions) =>
+      AlterTableRecoverPartitionsCommand(
+        ident.asTableIdentifier,
+        addPartitions,
+        dropPartitions,
+        "MSCK REPAIR TABLE")

Review comment:
       Unfortunately, we lost the info in `UnresolvedTable` -> `ResolvedTable` but we can reconstruct the command name from the flags `addPartitions` and `dropPartitions`. Though not the original one because `MSCK REPAIR TABLE table` has `addPartitions = true, dropPartitions  = false` can be re-constructed as `MSCK REPAIR TABLE table ADD PARTITIONS`. Are you ok with that?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779364958


   @dongjoon-hyun Thank you for reviewing this PR. Regarding to syntax for alternatives, I wasn't sure what should I use in docs. I just looked at existing examples:
   
   1. `ANALYZE TABLE` (similar in [EXPLAIN](http://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html))
   
   In docs ([link](http://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html)):
   ```
   [ NOSCAN | FOR COLUMNS col [ , ... ] | FOR ALL COLUMNS ]
   ```
   In SqlBase.g4:
   https://github.com/apache/spark/blob/ba974ea8e4cc8075056682c2badab5ca64b90047/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L136
   
   2. `SHOW TABLES`
   
   In docs ([link](http://spark.apache.org/docs/latest/sql-ref-syntax-aux-show-tables.html))
   ```
   { FROM | IN }
   ```
   In SqlBase.g4:
   https://github.com/apache/spark/blob/ba974ea8e4cc8075056682c2badab5ca64b90047/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L198
   
   You proposed `(ADD|DROP|SYNC)` but I haven't found any examples with such syntax. Do you know what's the syntax do public doc use?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-783098295


   @dongjoon-hyun Could you take a look at this PR one more time, please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r581666689



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/MsckRepairTableSuite.scala
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command.v1
+
+import java.io.File
+
+import org.apache.commons.io.FileUtils
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.execution.command
+
+/**
+ * This base suite contains unified tests for the `MSCK REPAIR TABLE` command that
+ * check V1 table catalogs. The tests that cannot run for all V1 catalogs are located in more
+ * specific test suites:
+ *
+ *   - V1 In-Memory catalog:
+ *     `org.apache.spark.sql.execution.command.v1.MsckRepairTableSuite`
+ *   - V1 Hive External catalog:
+ *     `org.apache.spark.sql.hive.execution.command.MsckRepairTableSuite`
+ */
+trait MsckRepairTableSuiteBase extends command.MsckRepairTableSuiteBase {
+  def deletePartitionDir(tableName: String, part: String): Unit = {
+    val partLoc = getPartitionLocation(tableName, part)
+    FileUtils.deleteDirectory(new File(partLoc))
+  }
+
+  test("drop partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(spark.table(t), Seq(Row(0, 0), Row(1, 1)))
+      deletePartitionDir(t, "part=1")
+      sql(s"MSCK REPAIR TABLE $t DROP PARTITIONS")
+      checkPartitions(t, Map("part" -> "0"))
+      checkAnswer(spark.table(t), Seq(Row(0, 0)))
+    }
+  }
+
+  test("sync partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(0, 0), Row(1, 1)))
+      copyPartition(t, "part=0", "part=2")
+      deletePartitionDir(t, "part=0")
+      sql(s"MSCK REPAIR TABLE $t SYNC PARTITIONS")
+      checkPartitions(t, Map("part" -> "1"), Map("part" -> "2"))
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(1, 1), Row(0, 2)))
+    }
+  }

Review comment:
       We already have many tests for ADD in `AlterTableRecoverPartitionsSuite` since `ALTER TABLE .. RECOVER PARTITIONS` is equal to `MSCK REPAIR TABLE .. ADD PARTITIONS` semantically.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576294124



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -37,6 +37,13 @@ MSCK REPAIR TABLE table_identifier
 
     **Syntax:** `[ database_name. ] table_name`
 
+* **`[[ADD|DROP|SYNC] PARTITIONS]`**

Review comment:
       Is the syntax and function aiming to be identical with `HIVE-17824`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774570682


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134961/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784277158


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135374/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-778525224


   Sorry for being late, @MaxGekk . I'll take a look at this during weekend.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784257684


   **[Test build #135374 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135374/testReport)** for PR 31499 at commit [`fefea57`](https://github.com/apache/spark/commit/fefea57d097e64b5b506dea45aac24c537f27abd).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774664451


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134981/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r581670571



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/MsckRepairTableSuite.scala
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command.v1
+
+import java.io.File
+
+import org.apache.commons.io.FileUtils
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.execution.command
+
+/**
+ * This base suite contains unified tests for the `MSCK REPAIR TABLE` command that
+ * check V1 table catalogs. The tests that cannot run for all V1 catalogs are located in more
+ * specific test suites:
+ *
+ *   - V1 In-Memory catalog:
+ *     `org.apache.spark.sql.execution.command.v1.MsckRepairTableSuite`
+ *   - V1 Hive External catalog:
+ *     `org.apache.spark.sql.hive.execution.command.MsckRepairTableSuite`
+ */
+trait MsckRepairTableSuiteBase extends command.MsckRepairTableSuiteBase {
+  def deletePartitionDir(tableName: String, part: String): Unit = {
+    val partLoc = getPartitionLocation(tableName, part)
+    FileUtils.deleteDirectory(new File(partLoc))
+  }
+
+  test("drop partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(spark.table(t), Seq(Row(0, 0), Row(1, 1)))
+      deletePartitionDir(t, "part=1")
+      sql(s"MSCK REPAIR TABLE $t DROP PARTITIONS")
+      checkPartitions(t, Map("part" -> "0"))
+      checkAnswer(spark.table(t), Seq(Row(0, 0)))
+    }
+  }
+
+  test("sync partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(0, 0), Row(1, 1)))
+      copyPartition(t, "part=0", "part=2")
+      deletePartitionDir(t, "part=0")
+      sql(s"MSCK REPAIR TABLE $t SYNC PARTITIONS")
+      checkPartitions(t, Map("part" -> "1"), Map("part" -> "2"))
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(1, 1), Row(0, 2)))
+    }
+  }

Review comment:
       I see




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r580765104



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -791,8 +799,26 @@ case class AlterTableRecoverPartitionsCommand(
       logDebug(s"Recovered ${parts.length} partitions ($done/$total so far)")
     }
   }
-}
 
+  // Drops the partitions that do not exist in the file system
+  private def dropPartitions(catalog: SessionCatalog, fs: FileSystem): Int = {
+    val dropPartSpecs = ThreadUtils.parmap(
+      catalog.listPartitions(tableName),
+      "AlterTableRecoverPartitionsCommand: non-existing partitions",
+      maxThreads = 8) { partition =>
+      partition.storage.locationUri.flatMap { uri =>
+        if (fs.exists(new Path(uri))) None else Some(partition.spec)
+      }
+    }.flatten
+    catalog.dropPartitions(
+      tableName,
+      dropPartSpecs,
+      ignoreIfNotExists = true,
+      purge = false,
+      retainData = true)

Review comment:
       Could you add a comment about the reason why we use `retainData=true`? I guess the reason is that `fs.exists(..)` is already `false` and we don't want addition file system calls. Did I understand correctly?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774626893


   **[Test build #134981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134981/testReport)** for PR 31499 at commit [`1168390`](https://github.com/apache/spark/commit/11683902aa39616adbb8db4ab98b246f286bf833).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-776459632


   @cloud-fan @HyukjinKwon Any objections to the changes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774567901


   **[Test build #134961 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134961/testReport)** for PR 31499 at commit [`c313bdb`](https://github.com/apache/spark/commit/c313bdb4309b085701029d8c0cc3a00d4b9c0654).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `case class RepairTable(`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576303589



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -26,7 +26,7 @@ license: |
 ### Syntax
 
 ```sql
-MSCK REPAIR TABLE table_identifier
+MSCK REPAIR TABLE table_identifier [[ADD|DROP|SYNC] PARTITIONS]

Review comment:
       This looks inconsistent from the implemented one. Do you want `[[ADD|DROP|SYNC] PARTITIONS]` -> `[(ADD|DROP|SYNC) PARTITIONS]` because we always need one of `ADD/DROP/SYNC`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784145106


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39954/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576298299



##########
File path: docs/sql-ref-syntax-ddl-repair-table.md
##########
@@ -37,6 +37,13 @@ MSCK REPAIR TABLE table_identifier
 
     **Syntax:** `[ database_name. ] table_name`
 
+* **`[[ADD|DROP|SYNC] PARTITIONS]`**
+
+    * If the option is not specified, `MSCK REPAIR TABLE` adds partitions to the Hive external catalog only.

Review comment:
       `the option` sounds a little mismatched because this is a SQL syntax. Are you mentioned the optional syntaxes like this in our SQL docs?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576322208



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
##########
@@ -366,8 +366,12 @@ class ResolveSessionCatalog(val catalogManager: CatalogManager)
     case AnalyzeColumn(ResolvedV1TableOrViewIdentifier(ident), columnNames, allColumns) =>
       AnalyzeColumnCommand(ident.asTableIdentifier, columnNames, allColumns)
 
-    case RepairTable(ResolvedV1TableIdentifier(ident)) =>
-      AlterTableRecoverPartitionsCommand(ident.asTableIdentifier, "MSCK REPAIR TABLE")
+    case RepairTable(ResolvedV1TableIdentifier(ident), addPartitions, dropPartitions) =>
+      AlterTableRecoverPartitionsCommand(
+        ident.asTableIdentifier,
+        addPartitions,
+        dropPartitions,
+        "MSCK REPAIR TABLE")

Review comment:
       Just a question: can we propagate the original `commandName` instead of having `"MSCK REPAIR TABLE"`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779765039


   **[Test build #135175 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135175/testReport)** for PR 31499 at commit [`7b9ee52`](https://github.com/apache/spark/commit/7b9ee52da7b09f6b073bd388a77940725f36cd1a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r581672432



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/MsckRepairTableSuite.scala
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command.v1
+
+import java.io.File
+
+import org.apache.commons.io.FileUtils
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.execution.command
+
+/**
+ * This base suite contains unified tests for the `MSCK REPAIR TABLE` command that
+ * check V1 table catalogs. The tests that cannot run for all V1 catalogs are located in more
+ * specific test suites:
+ *
+ *   - V1 In-Memory catalog:
+ *     `org.apache.spark.sql.execution.command.v1.MsckRepairTableSuite`
+ *   - V1 Hive External catalog:
+ *     `org.apache.spark.sql.hive.execution.command.MsckRepairTableSuite`
+ */
+trait MsckRepairTableSuiteBase extends command.MsckRepairTableSuiteBase {
+  def deletePartitionDir(tableName: String, part: String): Unit = {
+    val partLoc = getPartitionLocation(tableName, part)
+    FileUtils.deleteDirectory(new File(partLoc))
+  }
+
+  test("drop partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(spark.table(t), Seq(Row(0, 0), Row(1, 1)))
+      deletePartitionDir(t, "part=1")
+      sql(s"MSCK REPAIR TABLE $t DROP PARTITIONS")
+      checkPartitions(t, Map("part" -> "0"))
+      checkAnswer(spark.table(t), Seq(Row(0, 0)))
+    }
+  }
+
+  test("sync partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(0, 0), Row(1, 1)))
+      copyPartition(t, "part=0", "part=2")
+      deletePartitionDir(t, "part=0")
+      sql(s"MSCK REPAIR TABLE $t SYNC PARTITIONS")
+      checkPartitions(t, Map("part" -> "1"), Map("part" -> "2"))
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(1, 1), Row(0, 2)))
+    }
+  }

Review comment:
       no need to




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r576304654



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##########
@@ -3659,11 +3659,24 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
    *
    * For example:
    * {{{
-   *   MSCK REPAIR TABLE multi_part_name
+   *   MSCK REPAIR TABLE multi_part_name [[ADD|DROP|SYNC] PARTITIONS]

Review comment:
       `[ADD|DROP|SYNC]` -> `(ADD|DROP|SYNC)`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774714041


   **[Test build #134991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134991/testReport)** for PR 31499 at commit [`79ec7a1`](https://github.com/apache/spark/commit/79ec7a162ccef01b0b0b45d69b6a284c90c80c31).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r581671180



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/MsckRepairTableSuite.scala
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command.v1
+
+import java.io.File
+
+import org.apache.commons.io.FileUtils
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.execution.command
+
+/**
+ * This base suite contains unified tests for the `MSCK REPAIR TABLE` command that
+ * check V1 table catalogs. The tests that cannot run for all V1 catalogs are located in more
+ * specific test suites:
+ *
+ *   - V1 In-Memory catalog:
+ *     `org.apache.spark.sql.execution.command.v1.MsckRepairTableSuite`
+ *   - V1 Hive External catalog:
+ *     `org.apache.spark.sql.hive.execution.command.MsckRepairTableSuite`
+ */
+trait MsckRepairTableSuiteBase extends command.MsckRepairTableSuiteBase {
+  def deletePartitionDir(tableName: String, part: String): Unit = {
+    val partLoc = getPartitionLocation(tableName, part)
+    FileUtils.deleteDirectory(new File(partLoc))
+  }
+
+  test("drop partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(spark.table(t), Seq(Row(0, 0), Row(1, 1)))
+      deletePartitionDir(t, "part=1")
+      sql(s"MSCK REPAIR TABLE $t DROP PARTITIONS")
+      checkPartitions(t, Map("part" -> "0"))
+      checkAnswer(spark.table(t), Seq(Row(0, 0)))
+    }
+  }
+
+  test("sync partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(0, 0), Row(1, 1)))
+      copyPartition(t, "part=0", "part=2")
+      deletePartitionDir(t, "part=0")
+      sql(s"MSCK REPAIR TABLE $t SYNC PARTITIONS")
+      checkPartitions(t, Map("part" -> "1"), Map("part" -> "2"))
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(1, 1), Row(0, 2)))
+    }
+  }

Review comment:
       @cloud-fan If you would like to test `MSCK REPAIR TABLE .. ADD PARTITIONS` explicitly, we could mix the `v1.AlterTableRecoverPartitionsSuiteBase` to `MsckRepairTableSuiteBase` to run the existing tests automatically.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779927501


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135175/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784131407


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39954/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774626893


   **[Test build #134981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134981/testReport)** for PR 31499 at commit [`1168390`](https://github.com/apache/spark/commit/11683902aa39616adbb8db4ab98b246f286bf833).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774557483


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39544/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r578792008



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -597,11 +597,13 @@ case class PartitionStatistics(numFiles: Int, totalSize: Long)
  * The syntax of this command is:
  * {{{
  *   ALTER TABLE table RECOVER PARTITIONS;
- *   MSCK REPAIR TABLE table;
+ *   MSCK REPAIR TABLE table [{ADD|DROP|SYNC} PARTITIONS];
  * }}}
  */
 case class AlterTableRecoverPartitionsCommand(

Review comment:
       Well, it means many additional changes including the doc changes at line 594, `Recover Partitions in ALTER TABLE`. Why don't we do that renaming in another PR separately if you want? For example, like #31584 ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774570682


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134961/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-779927501


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135175/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [ADD|DROP|SYNC] PARTITIONS`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-778525619


   I added me as a reviewer in order not to forget this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774631219


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39564/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31499:
URL: https://github.com/apache/spark/pull/31499#discussion_r581666689



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/MsckRepairTableSuite.scala
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command.v1
+
+import java.io.File
+
+import org.apache.commons.io.FileUtils
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.execution.command
+
+/**
+ * This base suite contains unified tests for the `MSCK REPAIR TABLE` command that
+ * check V1 table catalogs. The tests that cannot run for all V1 catalogs are located in more
+ * specific test suites:
+ *
+ *   - V1 In-Memory catalog:
+ *     `org.apache.spark.sql.execution.command.v1.MsckRepairTableSuite`
+ *   - V1 Hive External catalog:
+ *     `org.apache.spark.sql.hive.execution.command.MsckRepairTableSuite`
+ */
+trait MsckRepairTableSuiteBase extends command.MsckRepairTableSuiteBase {
+  def deletePartitionDir(tableName: String, part: String): Unit = {
+    val partLoc = getPartitionLocation(tableName, part)
+    FileUtils.deleteDirectory(new File(partLoc))
+  }
+
+  test("drop partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(spark.table(t), Seq(Row(0, 0), Row(1, 1)))
+      deletePartitionDir(t, "part=1")
+      sql(s"MSCK REPAIR TABLE $t DROP PARTITIONS")
+      checkPartitions(t, Map("part" -> "0"))
+      checkAnswer(spark.table(t), Seq(Row(0, 0)))
+    }
+  }
+
+  test("sync partitions") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (col INT, part INT) $defaultUsing PARTITIONED BY (part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
+
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(0, 0), Row(1, 1)))
+      copyPartition(t, "part=0", "part=2")
+      deletePartitionDir(t, "part=0")
+      sql(s"MSCK REPAIR TABLE $t SYNC PARTITIONS")
+      checkPartitions(t, Map("part" -> "1"), Map("part" -> "2"))
+      checkAnswer(sql(s"SELECT col, part FROM $t"), Seq(Row(1, 1), Row(0, 2)))
+    }
+  }

Review comment:
       We already have many tests for ADD in `AlterTableRecoverPartitionsSuite` since `ALTER TABLE .. RECOVER PARTITIONS` maps to `MSCK REPAIR TABLE .. ADD PARTITIONS`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784277158


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135374/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-783870399


   Oops. Sorry, @MaxGekk . Could you rebase to the master once more please?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31499: [WIP][SPARK-31891][SQL] Support MSCK REPAIR TABLE .. (ADD|DROP|SYNC) PARTITIONS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-774639920


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39564/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31499: [SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31499:
URL: https://github.com/apache/spark/pull/31499#issuecomment-784161384


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39954/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org