You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/11/21 15:11:58 UTC

[GitHub] [spark] MaxGekk opened a new pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

MaxGekk opened a new pull request #30452:
URL: https://github.com/apache/spark/pull/30452


   ### What changes were proposed in this pull request?
   1. Add new method `listPartitionByNames` to the `SupportsPartitionManagement` interface. It allows to list partitions by partition names and their values.
   2. Implement new method in `InMemoryPartitionTable` which is used in DSv2 tests. 
   
   ### Why are the changes needed?
   Currently, the `SupportsPartitionManagement` interface exposes only `listPartitionIdentifiers` which allows to list partitions by partition values, and requires to specify all values for partition schema fields. This restriction does not allow to list partitions by some of partition names (not all of them).
   
   For example, the table `tableA` is partitioned by two column `year` and `month`
   ```
   CREATE TABLE tableA (price int, year int, month int)
   USING _
   partitioned by (year, month)
   ```
   and has the following partitions:
   ```
   PARTITION(year = 2015, month = 1)
   PARTITION(year = 2015, month = 2)
   PARTITION(year = 2016, month = 2)
   PARTITION(year = 2016, month = 3)
   ```
   If we want to list all partitions with `month = 2`, we have to specify `year` for **listPartitionIdentifiers()** which not always possible as we don't know all `year` values in advance. New method **listPartitionByNames()** allows to specify partition values only for `month`, and get two partitions:
   ```
   PARTITION(year = 2015, month = 2)
   PARTITION(year = 2016, month = 2)
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   By running the affected test suite `SupportsPartitionManagementSuite`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731627252


   **[Test build #131472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131472/testReport)** for PR 30452 at commit [`3f20ee8`](https://github.com/apache/spark/commit/3f20ee8289107f10a65165e4a6dbb69250039efa).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] stczwd commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
stczwd commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-734020787


   > Since `SupportsPartitionManagement` already have the API `partitionSchema`, which means that the implementations will pick a name for partition transforms, I think it's OK to use `String[]` in the `listPartitionIdentifiers` API parameter.
   
   Yeah, I prefer extend the `listPartitionIdentifiers` instead of add new API `listPartitionsByNames`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #30452:
URL: https://github.com/apache/spark/pull/30452#discussion_r528608060



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java
##########
@@ -106,10 +106,19 @@ void replacePartitionMetadata(
         throws UnsupportedOperationException;
 
     /**
-     * List the identifiers of all partitions that contains the ident in a table.
+     * List the identifiers of all partitions that have the ident prefix in a table.
      *
      * @param ident a prefix of partition identifier
      * @return an array of Identifiers for the partitions
      */
     InternalRow[] listPartitionIdentifiers(InternalRow ident);
+
+    /**
+     * List the identifiers of all partitions that match to the ident by names.
+     *
+     * @param names the names of partition values in the identifier.
+     * @param ident a partition identifier values.
+     * @return an array of Identifiers for the partitions
+     */
+    InternalRow[] listPartitionByNames(String[] names, InternalRow ident);

Review comment:
       > Yea we should remove the old API as it's not released yet.
   
   Frankly speaking, I would remove `listPartitionIdentifiers()` separately as this requires unrelated changes to list partition by names.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731601531






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731601531






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #30452:
URL: https://github.com/apache/spark/pull/30452


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731592151


   **[Test build #131472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131472/testReport)** for PR 30452 at commit [`3f20ee8`](https://github.com/apache/spark/commit/3f20ee8289107f10a65165e4a6dbb69250039efa).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #30452:
URL: https://github.com/apache/spark/pull/30452#discussion_r528606230



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java
##########
@@ -106,10 +106,19 @@ void replacePartitionMetadata(
         throws UnsupportedOperationException;
 
     /**
-     * List the identifiers of all partitions that contains the ident in a table.
+     * List the identifiers of all partitions that have the ident prefix in a table.
      *
      * @param ident a prefix of partition identifier
      * @return an array of Identifiers for the partitions
      */
     InternalRow[] listPartitionIdentifiers(InternalRow ident);
+
+    /**
+     * List the identifiers of all partitions that match to the ident by names.
+     *
+     * @param names the names of partition values in the identifier.
+     * @param ident a partition identifier values.
+     * @return an array of Identifiers for the partitions
+     */
+    InternalRow[] listPartitionByNames(String[] names, InternalRow ident);

Review comment:
       > Looks a bit odd that listPartitionByNames interface is added but not used in the Spark internal side.
   
   @HyukjinKwon `listPartitionByNames()` will be used by V2 commands like `SHOW PARTITIONS` and `SHOW TABLE EXTENDED`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-733683326


   merging to master, thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-733683029


   I'm merging it to unblock the following work. @stczwd  @rdblue please leave comments if you have any concerns, so that we can address them.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731964256


   > ... shall we just partition index in the API instead?
   
   @cloud-fan Sure. I see at least two variants of the implementation:
   - Pass an array of indexes, or
   - a [BitSet](https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html)
   
   The second one could save some memory, I guess.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731592151


   **[Test build #131472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131472/testReport)** for PR 30452 at commit [`3f20ee8`](https://github.com/apache/spark/commit/3f20ee8289107f10a65165e4a6dbb69250039efa).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30452:
URL: https://github.com/apache/spark/pull/30452#discussion_r528654216



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java
##########
@@ -106,10 +106,19 @@ void replacePartitionMetadata(
         throws UnsupportedOperationException;
 
     /**
-     * List the identifiers of all partitions that contains the ident in a table.
+     * List the identifiers of all partitions that have the ident prefix in a table.
      *
      * @param ident a prefix of partition identifier
      * @return an array of Identifiers for the partitions
      */
     InternalRow[] listPartitionIdentifiers(InternalRow ident);
+
+    /**
+     * List the identifiers of all partitions that match to the ident by names.
+     *
+     * @param names the names of partition values in the identifier.
+     * @param ident a partition identifier values.
+     * @return an array of Identifiers for the partitions
+     */
+    InternalRow[] listPartitionByNames(String[] names, InternalRow ident);

Review comment:
       We can remove the old one in your next PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #30452:
URL: https://github.com/apache/spark/pull/30452#discussion_r528606230



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java
##########
@@ -106,10 +106,19 @@ void replacePartitionMetadata(
         throws UnsupportedOperationException;
 
     /**
-     * List the identifiers of all partitions that contains the ident in a table.
+     * List the identifiers of all partitions that have the ident prefix in a table.
      *
      * @param ident a prefix of partition identifier
      * @return an array of Identifiers for the partitions
      */
     InternalRow[] listPartitionIdentifiers(InternalRow ident);
+
+    /**
+     * List the identifiers of all partitions that match to the ident by names.
+     *
+     * @param names the names of partition values in the identifier.
+     * @param ident a partition identifier values.
+     * @return an array of Identifiers for the partitions
+     */
+    InternalRow[] listPartitionByNames(String[] names, InternalRow ident);

Review comment:
       > Looks a bit odd that listPartitionByNames interface is added but not used in the Spark internal side.
   
   @HyukjinKwon `listPartitionByNames()` will be used by V2 commands like `SHOW PARTITIONS` (in https://github.com/apache/spark/pull/30398) and `SHOW TABLE EXTENDED`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-733657403


   @cloud-fan @HyukjinKwon Can we continue with this PR or do you have some objections?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731592567


   @cloud-fan Could you take a look at this PR, please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-734163062


   I removed `listPartitionIdentifiers()` and renamed `listPartitionsByNames()` to `listPartitionIdentifiers()`: https://github.com/apache/spark/pull/30514


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731601527


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36079/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731627543






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731936589


   The partition can be a transform like `year(ts_col)`, shall we just partition index in the API instead?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #30452:
URL: https://github.com/apache/spark/pull/30452#discussion_r528598891



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/SupportsPartitionManagementSuite.scala
##########
@@ -140,4 +140,45 @@ class SupportsPartitionManagementSuite extends SparkFunSuite {
     partTable.dropPartition(partIdent1)
     assert(partTable.listPartitionIdentifiers(InternalRow.empty).isEmpty)
   }
+
+  test("listPartitionByNames") {
+    val partCatalog = new InMemoryPartitionTableCatalog
+    partCatalog.initialize("test", CaseInsensitiveStringMap.empty())
+    val table = partCatalog.createTable(
+      ident,
+      new StructType()
+        .add("col0", IntegerType)
+        .add("part0", IntegerType)
+        .add("part1", StringType),
+      Array(LogicalExpressions.identity(ref("part0")), LogicalExpressions.identity(ref("part1"))),
+      util.Collections.emptyMap[String, String])
+    val partTable = table.asInstanceOf[InMemoryPartitionTable]
+
+    Seq(
+      InternalRow(0, "abc"),
+      InternalRow(0, "def"),
+      InternalRow(1, "abc")).foreach { partIdent =>
+      partTable.createPartition(partIdent, new util.HashMap[String, String]())
+    }
+
+    Seq(
+      (Array("part0", "part1"), InternalRow(0, "abc")) -> Set(InternalRow(0, "abc")),
+      (Array("part0"), InternalRow(0)) -> Set(InternalRow(0, "abc"), InternalRow(0, "def")),
+      (Array("part1"), InternalRow("abc")) -> Set(InternalRow(0, "abc"), InternalRow(1, "abc")),
+      (Array.empty[String], InternalRow.empty) ->

Review comment:
       This is a special case which allows to list all partitions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #30452:
URL: https://github.com/apache/spark/pull/30452#discussion_r528438394



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java
##########
@@ -106,10 +106,19 @@ void replacePartitionMetadata(
         throws UnsupportedOperationException;
 
     /**
-     * List the identifiers of all partitions that contains the ident in a table.
+     * List the identifiers of all partitions that have the ident prefix in a table.
      *
      * @param ident a prefix of partition identifier
      * @return an array of Identifiers for the partitions
      */
     InternalRow[] listPartitionIdentifiers(InternalRow ident);
+
+    /**
+     * List the identifiers of all partitions that match to the ident by names.
+     *
+     * @param names the names of partition values in the identifier.
+     * @param ident a partition identifier values.
+     * @return an array of Identifiers for the partitions
+     */
+    InternalRow[] listPartitionByNames(String[] names, InternalRow ident);

Review comment:
       Is a user supposed to access to `Table` directly? Looks a bit odd that `listPartitionByNames` interface is added but not used in the Spark internal side.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30452:
URL: https://github.com/apache/spark/pull/30452#discussion_r528478390



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java
##########
@@ -106,10 +106,19 @@ void replacePartitionMetadata(
         throws UnsupportedOperationException;
 
     /**
-     * List the identifiers of all partitions that contains the ident in a table.
+     * List the identifiers of all partitions that have the ident prefix in a table.
      *
      * @param ident a prefix of partition identifier
      * @return an array of Identifiers for the partitions
      */
     InternalRow[] listPartitionIdentifiers(InternalRow ident);
+
+    /**
+     * List the identifiers of all partitions that match to the ident by names.
+     *
+     * @param names the names of partition values in the identifier.
+     * @param ident a partition identifier values.
+     * @return an array of Identifiers for the partitions
+     */
+    InternalRow[] listPartitionByNames(String[] names, InternalRow ident);

Review comment:
       Yea we should remove the old API as it's not released yet.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #30452:
URL: https://github.com/apache/spark/pull/30452#discussion_r528489451



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryPartitionTable.scala
##########
@@ -96,4 +97,25 @@ class InMemoryPartitionTable(
   override protected def addPartitionKey(key: Seq[Any]): Unit = {
     memoryTablePartitions.put(InternalRow.fromSeq(key), Map.empty[String, String].asJava)
   }
+
+  override def listPartitionByNames(
+      names: Array[String],
+      ident: InternalRow): Array[InternalRow] = {
+    assert(names.length == ident.numFields,
+      s"Number of partition names (${names.length}) must be equal to " +
+      s"the number of partition values (${ident.numFields}).")
+    val schema = partitionSchema
+    assert(names.forall(fieldName => schema.fieldNames.contains(fieldName)),
+      s"Some partition names ${names.mkString("[", ", ", "]")} don't belong to " +
+      s"the partition schema '${schema.sql}'.")
+    val indexes = names.map(schema.fieldIndex)
+    val dataTypes = names.map(schema(_).dataType)

Review comment:
       The names should be normalized after https://github.com/apache/spark/pull/30454, so, we shouldn't care of case sensitivity here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-734021147


   > The partition can be a transform like year(ts_col), shall we just partition index in the API instead?
   
   If I remember correctly, there should be a schema exposed by the table that describes these. We should get the name from that schema.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] stczwd commented on a change in pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
stczwd commented on a change in pull request #30452:
URL: https://github.com/apache/spark/pull/30452#discussion_r530720264



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java
##########
@@ -106,10 +106,19 @@ void replacePartitionMetadata(
         throws UnsupportedOperationException;
 
     /**
-     * List the identifiers of all partitions that contains the ident in a table.
+     * List the identifiers of all partitions that have the ident prefix in a table.
      *
      * @param ident a prefix of partition identifier
      * @return an array of Identifiers for the partitions
      */
     InternalRow[] listPartitionIdentifiers(InternalRow ident);
+
+    /**
+     * List the identifiers of all partitions that match to the ident by names.
+     *
+     * @param names the names of partition values in the identifier.
+     * @param ident a partition identifier values.
+     * @return an array of Identifiers for the partitions
+     */
+    InternalRow[] listPartitionByNames(String[] names, InternalRow ident);

Review comment:
       > I think the final API name should still be `listPartitionIdentifiers`
   
   Agree with this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731946693


   As the changes are related to `listPartitionIdentifiers()` added by https://github.com/apache/spark/pull/28617. @stczwd @rdblue @RussellSpitzer @emkornfield @dongjoon-hyun May I ask you to review this PR.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731627543






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #30452:
URL: https://github.com/apache/spark/pull/30452#discussion_r528653936



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java
##########
@@ -106,10 +106,19 @@ void replacePartitionMetadata(
         throws UnsupportedOperationException;
 
     /**
-     * List the identifiers of all partitions that contains the ident in a table.
+     * List the identifiers of all partitions that have the ident prefix in a table.
      *
      * @param ident a prefix of partition identifier
      * @return an array of Identifiers for the partitions
      */
     InternalRow[] listPartitionIdentifiers(InternalRow ident);
+
+    /**
+     * List the identifiers of all partitions that match to the ident by names.
+     *
+     * @param names the names of partition values in the identifier.
+     * @param ident a partition identifier values.
+     * @return an array of Identifiers for the partitions
+     */
+    InternalRow[] listPartitionByNames(String[] names, InternalRow ident);

Review comment:
       I think the final API name should still be `listPartitionIdentifiers`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #30452:
URL: https://github.com/apache/spark/pull/30452#discussion_r528568106



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java
##########
@@ -106,10 +106,19 @@ void replacePartitionMetadata(
         throws UnsupportedOperationException;
 
     /**
-     * List the identifiers of all partitions that contains the ident in a table.
+     * List the identifiers of all partitions that have the ident prefix in a table.
      *
      * @param ident a prefix of partition identifier
      * @return an array of Identifiers for the partitions
      */
     InternalRow[] listPartitionIdentifiers(InternalRow ident);
+
+    /**
+     * List the identifiers of all partitions that match to the ident by names.
+     *
+     * @param names the names of partition values in the identifier.
+     * @param ident a partition identifier values.
+     * @return an array of Identifiers for the partitions
+     */
+    InternalRow[] listPartitionByNames(String[] names, InternalRow ident);

Review comment:
       The method `listPartitionIdentifiers()` is used only in `partitionExists()` and in tests. Let me remove it and replace its usage by `listPartitionByNames()`.
   
   How about renaming `listPartitionByNames()` to just `listPartitions()`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-732118721


   Since `SupportsPartitionManagement` already have the API `partitionSchema`, which means that the implementations will pick a name for partition transforms, I think it's OK to use `String[]` in the `listPartitionIdentifiers` API parameter.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30452: [SPARK-33509][SQL] List partition by names from a V2 table which supports partition management

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30452:
URL: https://github.com/apache/spark/pull/30452#issuecomment-731597113


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36079/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org