You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/09 18:43:09 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request, #6163: Core: Method for building common partition type

aokolnychyi opened a new pull request, #6163:
URL: https://github.com/apache/iceberg/pull/6163

   This PR adds a method to `Partitioning` to build an intersection of all partition types. This intersection can be used as a clustering key for storage-partitioned joins.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018492377


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.

Review Comment:
   Looking below the behavior is 
   
   Common(2 specs with something in common) Struct of common fields
   Common( two specs with nothing in common ) Exception
   Common( 1 spec) " Return passed in Spec
   Common(0 specs) return empty spec
   
   I think we probably should just return the empty spec for the validation failure mode



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018475572


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type
    */
+  public static StructType commonPartitionType(Collection<PartitionSpec> specs) {
+    return buildPartitionType(specs, true /* only common fields */);
+  }
+
+  /**
+   * Builds a unified partition type considering all specs in a table.
+   *
+   * <p>Whenever a table has multiple specs, the partition type is a struct containing all fields
+   * that have ever been a part of any spec in the table. In other words, the struct fields
+   * represent a union of all known partition fields.
+   *
+   * @param table a table with one or many specs
+   * @return the constructed unified partition type
+   */
   public static StructType partitionType(Table table) {
+    return buildPartitionType(table.specs().values(), false /* only common fields */);

Review Comment:
   Simply indicating what that parameter means. It is hard to make sense of true/false without such comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1019738831


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,75 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds a grouping key type considering all provided specs.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>A grouping key defines how data is split between files and consists of partition fields with
+   * non-void transforms that are present in each provided spec. Iceberg guarantees that records
+   * with different values for the grouping key are disjoint and are stored in separate files.
+   *
+   * <p>If there is only one spec, the grouping key will include all partition fields with non-void
+   * transforms from that spec. Whenever there are multiple specs, the grouping key will represent
+   * an intersection of all partition fields with non-void transforms. If a partition field is
+   * present only in a subset of specs, Iceberg cannot guarantee data distribution on that field.
+   * That's why it will not be part of the grouping key. Unpartitioned tables or tables with
+   * non-overlapping specs have empty grouping keys.
+   *
+   * <p>When partition fields are dropped in v1 tables, they are replaced with new partition fields
+   * that have the same field ID but use a void transform under the hood. Such fields cannot be part
+   * of the grouping key as void transforms always return null.
+   *
+   * @param specs one or many specs
+   * @return the constructed grouping key type
+   */
+  public static StructType groupingKeyType(Collection<PartitionSpec> specs) {

Review Comment:
   @sunchao, we should be able to check if this type is empty to decide if can report a distribution to Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018466921


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.

Review Comment:
   I think it may help to also note that if the table contains a single spec it will just be returned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1019734099


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type

Review Comment:
   I've changed the naming and logic, documented the behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018477109


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.

Review Comment:
   I think either are valid ... but I think we tend in utility methods to rely on the caller to decided whether or not output of the method is valid



##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.

Review Comment:
   I think either are ok ... but I think we tend in utility methods to rely on the caller to decided whether or not output of the method is valid



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018476321


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.

Review Comment:
   I throw an exception if there is no common partition type for now. Let me think on returning an empty struct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1019584432


##########
core/src/test/java/org/apache/iceberg/TestPartitioning.java:
##########
@@ -43,6 +43,12 @@ public class TestPartitioning {
           required(1, "id", Types.IntegerType.get()),
           required(2, "data", Types.StringType.get()),
           required(3, "category", Types.StringType.get()));
+  private static final PartitionSpec BY_DATA_SPEC =
+      PartitionSpec.builderFor(SCHEMA).identity("data").build();
+  private static final PartitionSpec BY_CATEGORY_DATA_SPEC =
+      PartitionSpec.builderFor(SCHEMA).identity("category").identity("data").build();
+  private static final PartitionSpec BY_DATA_CATEGORY_BUCKET_SPEC =
+      PartitionSpec.builderFor(SCHEMA).identity("data").bucket("category", 8).build();
 

Review Comment:
   Tests all look good to me, I think you covered all the cases we were discussing yesterday. One nit is that we could probably parameterize this test suite on format version, it looks like we expect the same output and run the same code so we could shorten this file a bit with the parameterization.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] sunchao commented on a diff in pull request #6163: Core: Method for building grouping key type

Posted by GitBox <gi...@apache.org>.

sunchao commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1023031018


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,75 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds a grouping key type considering all provided specs.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>A grouping key defines how data is split between files and consists of partition fields with
+   * non-void transforms that are present in each provided spec. Iceberg guarantees that records
+   * with different values for the grouping key are disjoint and are stored in separate files.
+   *
+   * <p>If there is only one spec, the grouping key will include all partition fields with non-void
+   * transforms from that spec. Whenever there are multiple specs, the grouping key will represent
+   * an intersection of all partition fields with non-void transforms. If a partition field is
+   * present only in a subset of specs, Iceberg cannot guarantee data distribution on that field.
+   * That's why it will not be part of the grouping key. Unpartitioned tables or tables with
+   * non-overlapping specs have empty grouping keys.
+   *
+   * <p>When partition fields are dropped in v1 tables, they are replaced with new partition fields
+   * that have the same field ID but use a void transform under the hood. Such fields cannot be part
+   * of the grouping key as void transforms always return null.
+   *
+   * @param specs one or many specs
+   * @return the constructed grouping key type
+   */
+  public static StructType groupingKeyType(Collection<PartitionSpec> specs) {

Review Comment:
   👍 thanks, make sense
   
   why we don't use `Table` as parameter but instead `Collection<PartitionSpec>` here?



##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -298,4 +331,37 @@ private static boolean compatibleTransforms(Transform<?, ?> t1, Transform<?, ?>
         || t1.equals(Transforms.alwaysNull())
         || t2.equals(Transforms.alwaysNull());
   }
+
+  // collects IDs of all partition field used across specs
+  private static Set<Integer> allFieldIds(Collection<PartitionSpec> specs) {
+    return FluentIterable.from(specs)
+        .transformAndConcat(PartitionSpec::fields)
+        .transform(PartitionField::fieldId)
+        .toSet();
+  }
+
+  // collects IDs of partition fields with non-void transforms that are present in each spec
+  private static Set<Integer> commonActiveFieldIds(Collection<PartitionSpec> specs) {
+    Set<Integer> commonActiveFieldIds = Sets.newHashSet();
+
+    int specIndex = 0;
+    for (PartitionSpec spec : specs) {
+      if (specIndex == 0) {
+        commonActiveFieldIds.addAll(activeFieldIds(spec));
+      } else {
+        commonActiveFieldIds.retainAll(activeFieldIds(spec));

Review Comment:
   curious: will the same field have different field IDs across different specs? I'm reading this sentence from Iceberg specification:
   
   > A partition field id that is used to identify a partition field and is unique within a partition spec. In v2 table metadata, it is unique across all partition specs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018295880


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type
    */
+  public static StructType commonPartitionType(Collection<PartitionSpec> specs) {

Review Comment:
   I need to think of a better name here. Ideas are welcome.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018485333


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type

Review Comment:
   If I remember correctly, void transforms are only used in v1 tables to replace existing fields since we can't drop them. We want to use this as a clustering key in storage-partitioned joins, meaning void transforms can't be returned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018486198


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.

Review Comment:
   I feel we either need one more method to check if a common type exists or return an empty struct from this one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building grouping key type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1023081485


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,75 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds a grouping key type considering all provided specs.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>A grouping key defines how data is split between files and consists of partition fields with
+   * non-void transforms that are present in each provided spec. Iceberg guarantees that records
+   * with different values for the grouping key are disjoint and are stored in separate files.
+   *
+   * <p>If there is only one spec, the grouping key will include all partition fields with non-void
+   * transforms from that spec. Whenever there are multiple specs, the grouping key will represent
+   * an intersection of all partition fields with non-void transforms. If a partition field is
+   * present only in a subset of specs, Iceberg cannot guarantee data distribution on that field.
+   * That's why it will not be part of the grouping key. Unpartitioned tables or tables with
+   * non-overlapping specs have empty grouping keys.
+   *
+   * <p>When partition fields are dropped in v1 tables, they are replaced with new partition fields
+   * that have the same field ID but use a void transform under the hood. Such fields cannot be part
+   * of the grouping key as void transforms always return null.
+   *
+   * @param specs one or many specs
+   * @return the constructed grouping key type
+   */
+  public static StructType groupingKeyType(Collection<PartitionSpec> specs) {

Review Comment:
   @sunchao, the idea is that a scan cover a subset of files, which may mean we will only query particular specs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building grouping key type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1023102332


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -298,4 +331,37 @@ private static boolean compatibleTransforms(Transform<?, ?> t1, Transform<?, ?>
         || t1.equals(Transforms.alwaysNull())
         || t2.equals(Transforms.alwaysNull());
   }
+
+  // collects IDs of all partition field used across specs
+  private static Set<Integer> allFieldIds(Collection<PartitionSpec> specs) {
+    return FluentIterable.from(specs)
+        .transformAndConcat(PartitionSpec::fields)
+        .transform(PartitionField::fieldId)
+        .toSet();
+  }
+
+  // collects IDs of partition fields with non-void transforms that are present in each spec
+  private static Set<Integer> commonActiveFieldIds(Collection<PartitionSpec> specs) {
+    Set<Integer> commonActiveFieldIds = Sets.newHashSet();
+
+    int specIndex = 0;
+    for (PartitionSpec spec : specs) {
+      if (specIndex == 0) {
+        commonActiveFieldIds.addAll(activeFieldIds(spec));
+      } else {
+        commonActiveFieldIds.retainAll(activeFieldIds(spec));

Review Comment:
   There is a very very limited chance the ID is not unique in v1 tables but that should be covered by a check above. I have a test that checks we throw an exception in that case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018464405


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type
    */
+  public static StructType commonPartitionType(Collection<PartitionSpec> specs) {
+    return buildPartitionType(specs, true /* only common fields */);
+  }
+
+  /**
+   * Builds a unified partition type considering all specs in a table.
+   *
+   * <p>Whenever a table has multiple specs, the partition type is a struct containing all fields
+   * that have ever been a part of any spec in the table. In other words, the struct fields
+   * represent a union of all known partition fields.
+   *
+   * @param table a table with one or many specs
+   * @return the constructed unified partition type
+   */
   public static StructType partitionType(Table table) {
+    return buildPartitionType(table.specs().values(), false /* only common fields */);

Review Comment:
   The comment here is a little confusing, don't we mean `false /* keeps all fields */`, or are you commenting on the parameter ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018488935


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -298,4 +324,33 @@ private static boolean compatibleTransforms(Transform<?, ?> t1, Transform<?, ?>
         || t1.equals(Transforms.alwaysNull())
         || t2.equals(Transforms.alwaysNull());
   }
+
+  private static Set<Integer> commonPartitionFieldIds(Collection<PartitionSpec> specs) {
+    Set<Integer> commonFieldIds = Sets.newHashSet();
+
+    int specIndex = 0;
+    for (PartitionSpec spec : specs) {
+      if (specIndex == 0) {
+        commonFieldIds.addAll(activeFieldIds(spec));
+      } else {
+        commonFieldIds.retainAll(activeFieldIds(spec));
+      }
+
+      specIndex++;
+    }
+
+    ValidationException.check(
+        commonFieldIds.size() > 0 || specs.size() == 1,
+        "Specs do not have any common fields: %s",

Review Comment:
   Unpartitioned tables don't have a spec right? Behavior there is to return the empty struct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1019736124


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,75 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds a grouping key type considering all provided specs.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>A grouping key defines how data is split between files and consists of partition fields with
+   * non-void transforms that are present in each provided spec. Iceberg guarantees that records
+   * with different values for the grouping key are disjoint and are stored in separate files.
+   *
+   * <p>If there is only one spec, the grouping key will include all partition fields with non-void
+   * transforms from that spec. Whenever there are multiple specs, the grouping key will represent
+   * an intersection of all partition fields with non-void transforms. If a partition field is
+   * present only in a subset of specs, Iceberg cannot guarantee data distribution on that field.
+   * That's why it will not be part of the grouping key. Unpartitioned tables or tables with
+   * non-overlapping specs have empty grouping keys.
+   *
+   * <p>When partition fields are dropped in v1 tables, they are replaced with new partition fields
+   * that have the same field ID but use a void transform under the hood. Such fields cannot be part
+   * of the grouping key as void transforms always return null.
+   *
+   * @param specs one or many specs
+   * @return the constructed grouping key type
+   */
+  public static StructType groupingKeyType(Collection<PartitionSpec> specs) {

Review Comment:
   I decided to use a name that actually represents what we will use this for. Spark has `KeyGroupedPartitioning` so I used a similar name. It seems to better represent what this is instead of `commonPartitionType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1019734173


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type
    */
+  public static StructType commonPartitionType(Collection<PartitionSpec> specs) {
+    return buildPartitionType(specs, true /* only common fields */);
+  }
+
+  /**
+   * Builds a unified partition type considering all specs in a table.
+   *
+   * <p>Whenever a table has multiple specs, the partition type is a struct containing all fields
+   * that have ever been a part of any spec in the table. In other words, the struct fields
+   * represent a union of all known partition fields.
+   *
+   * @param table a table with one or many specs
+   * @return the constructed unified partition type
+   */
   public static StructType partitionType(Table table) {
+    return buildPartitionType(table.specs().values(), false /* only common fields */);
+  }
+
+  private static StructType buildPartitionType(
+      Collection<PartitionSpec> specs, boolean onlyCommonFields) {
+
     // we currently don't know the output type of unknown transforms
-    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(table);
+    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(specs);
     ValidationException.check(
         unknownTransforms.isEmpty(),
-        "Cannot build table partition type, unknown transforms: %s",
+        "Cannot build partition type, unknown transforms: %s",
         unknownTransforms);
 
-    if (table.specs().size() == 1) {
-      return table.spec().partitionType();
+    if (specs.size() == 1) {
+      PartitionSpec spec = Iterables.getOnlyElement(specs);
+      return spec.partitionType();
     }
 
     Map<Integer, PartitionField> fieldMap = Maps.newHashMap();
     Map<Integer, Type> typeMap = Maps.newHashMap();
     Map<Integer, String> nameMap = Maps.newHashMap();
 
-    // sort the spec IDs in descending order to pick up the most recent field names
-    List<Integer> specIds =
-        table.specs().keySet().stream()
-            .sorted(Collections.reverseOrder())
+    // sort specs by ID in descending order to pick up the most recent field names
+    List<PartitionSpec> sortedSpecs =
+        specs.stream()
+            .sorted(Comparator.comparingLong(PartitionSpec::specId).reversed())
             .collect(Collectors.toList());
 
-    for (Integer specId : specIds) {
-      PartitionSpec spec = table.specs().get(specId);
+    Set<Integer> projectedFieldIds = onlyCommonFields ? commonPartitionFieldIds(specs) : null;

Review Comment:
   Changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1019733840


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.

Review Comment:
   Updated.



##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type
    */
+  public static StructType commonPartitionType(Collection<PartitionSpec> specs) {
+    return buildPartitionType(specs, true /* only common fields */);
+  }
+
+  /**
+   * Builds a unified partition type considering all specs in a table.
+   *
+   * <p>Whenever a table has multiple specs, the partition type is a struct containing all fields
+   * that have ever been a part of any spec in the table. In other words, the struct fields
+   * represent a union of all known partition fields.
+   *
+   * @param table a table with one or many specs
+   * @return the constructed unified partition type
+   */
   public static StructType partitionType(Table table) {
+    return buildPartitionType(table.specs().values(), false /* only common fields */);

Review Comment:
   No longer applies.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018479165


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type

Review Comment:
   Another thought here, do we need to return common alwaysNull transforms? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018487624


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type
    */
+  public static StructType commonPartitionType(Collection<PartitionSpec> specs) {
+    return buildPartitionType(specs, true /* only common fields */);
+  }
+
+  /**
+   * Builds a unified partition type considering all specs in a table.
+   *
+   * <p>Whenever a table has multiple specs, the partition type is a struct containing all fields
+   * that have ever been a part of any spec in the table. In other words, the struct fields
+   * represent a union of all known partition fields.
+   *
+   * @param table a table with one or many specs
+   * @return the constructed unified partition type
+   */
   public static StructType partitionType(Table table) {
+    return buildPartitionType(table.specs().values(), false /* only common fields */);
+  }
+
+  private static StructType buildPartitionType(
+      Collection<PartitionSpec> specs, boolean onlyCommonFields) {
+
     // we currently don't know the output type of unknown transforms
-    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(table);
+    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(specs);
     ValidationException.check(
         unknownTransforms.isEmpty(),
-        "Cannot build table partition type, unknown transforms: %s",
+        "Cannot build partition type, unknown transforms: %s",
         unknownTransforms);
 
-    if (table.specs().size() == 1) {
-      return table.spec().partitionType();
+    if (specs.size() == 1) {
+      PartitionSpec spec = Iterables.getOnlyElement(specs);
+      return spec.partitionType();
     }
 
     Map<Integer, PartitionField> fieldMap = Maps.newHashMap();
     Map<Integer, Type> typeMap = Maps.newHashMap();
     Map<Integer, String> nameMap = Maps.newHashMap();
 
-    // sort the spec IDs in descending order to pick up the most recent field names
-    List<Integer> specIds =
-        table.specs().keySet().stream()
-            .sorted(Collections.reverseOrder())
+    // sort specs by ID in descending order to pick up the most recent field names
+    List<PartitionSpec> sortedSpecs =
+        specs.stream()
+            .sorted(Comparator.comparingLong(PartitionSpec::specId).reversed())
             .collect(Collectors.toList());
 
-    for (Integer specId : specIds) {
-      PartitionSpec spec = table.specs().get(specId);
+    Set<Integer> projectedFieldIds = onlyCommonFields ? commonPartitionFieldIds(specs) : null;

Review Comment:
   I can do that. I guess I was lazy to write code to compute all field IDs :) I agree it would be easier than a null-aware check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018485570


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type
    */
+  public static StructType commonPartitionType(Collection<PartitionSpec> specs) {
+    return buildPartitionType(specs, true /* only common fields */);
+  }
+
+  /**
+   * Builds a unified partition type considering all specs in a table.
+   *
+   * <p>Whenever a table has multiple specs, the partition type is a struct containing all fields
+   * that have ever been a part of any spec in the table. In other words, the struct fields
+   * represent a union of all known partition fields.
+   *
+   * @param table a table with one or many specs
+   * @return the constructed unified partition type
+   */
   public static StructType partitionType(Table table) {
+    return buildPartitionType(table.specs().values(), false /* only common fields */);
+  }
+
+  private static StructType buildPartitionType(
+      Collection<PartitionSpec> specs, boolean onlyCommonFields) {
+
     // we currently don't know the output type of unknown transforms
-    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(table);
+    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(specs);
     ValidationException.check(
         unknownTransforms.isEmpty(),
-        "Cannot build table partition type, unknown transforms: %s",
+        "Cannot build partition type, unknown transforms: %s",
         unknownTransforms);
 
-    if (table.specs().size() == 1) {
-      return table.spec().partitionType();
+    if (specs.size() == 1) {
+      PartitionSpec spec = Iterables.getOnlyElement(specs);
+      return spec.partitionType();
     }
 
     Map<Integer, PartitionField> fieldMap = Maps.newHashMap();
     Map<Integer, Type> typeMap = Maps.newHashMap();
     Map<Integer, String> nameMap = Maps.newHashMap();
 
-    // sort the spec IDs in descending order to pick up the most recent field names
-    List<Integer> specIds =
-        table.specs().keySet().stream()
-            .sorted(Collections.reverseOrder())
+    // sort specs by ID in descending order to pick up the most recent field names
+    List<PartitionSpec> sortedSpecs =
+        specs.stream()
+            .sorted(Comparator.comparingLong(PartitionSpec::specId).reversed())
             .collect(Collectors.toList());
 
-    for (Integer specId : specIds) {
-      PartitionSpec spec = table.specs().get(specId);
+    Set<Integer> projectedFieldIds = onlyCommonFields ? commonPartitionFieldIds(specs) : null;

Review Comment:
   This is a set of fields to keep IFF onlyCommonFields is true? 
   
   The logic may be easier for me to understand if it was just
   
   ```
   onlyCommonFields ? commonPartitionFieldIds() : AllFieldIDs()
   ```
   
   Rather than using "null" as a special condition to skip doing the contains check . That may just be me though
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018491119


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type

Review Comment:
   I think that's fine, just needs to be doc'd



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1019734281


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -298,4 +324,33 @@ private static boolean compatibleTransforms(Transform<?, ?> t1, Transform<?, ?>
         || t1.equals(Transforms.alwaysNull())
         || t2.equals(Transforms.alwaysNull());
   }
+
+  private static Set<Integer> commonPartitionFieldIds(Collection<PartitionSpec> specs) {
+    Set<Integer> commonFieldIds = Sets.newHashSet();
+
+    int specIndex = 0;
+    for (PartitionSpec spec : specs) {
+      if (specIndex == 0) {
+        commonFieldIds.addAll(activeFieldIds(spec));
+      } else {
+        commonFieldIds.retainAll(activeFieldIds(spec));
+      }
+
+      specIndex++;
+    }
+
+    ValidationException.check(
+        commonFieldIds.size() > 0 || specs.size() == 1,
+        "Specs do not have any common fields: %s",
+        specs);
+
+    return commonFieldIds;
+  }
+
+  private static List<Integer> activeFieldIds(PartitionSpec spec) {
+    return spec.fields().stream()

Review Comment:
   Yep, correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1019736124


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,75 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds a grouping key type considering all provided specs.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>A grouping key defines how data is split between files and consists of partition fields with
+   * non-void transforms that are present in each provided spec. Iceberg guarantees that records
+   * with different values for the grouping key are disjoint and are stored in separate files.
+   *
+   * <p>If there is only one spec, the grouping key will include all partition fields with non-void
+   * transforms from that spec. Whenever there are multiple specs, the grouping key will represent
+   * an intersection of all partition fields with non-void transforms. If a partition field is
+   * present only in a subset of specs, Iceberg cannot guarantee data distribution on that field.
+   * That's why it will not be part of the grouping key. Unpartitioned tables or tables with
+   * non-overlapping specs have empty grouping keys.
+   *
+   * <p>When partition fields are dropped in v1 tables, they are replaced with new partition fields
+   * that have the same field ID but use a void transform under the hood. Such fields cannot be part
+   * of the grouping key as void transforms always return null.
+   *
+   * @param specs one or many specs
+   * @return the constructed grouping key type
+   */
+  public static StructType groupingKeyType(Collection<PartitionSpec> specs) {

Review Comment:
   I decided to use a name that actually represents what we will use this for. Spark has `KeyGroupedPartitioning`. That makes sense, so I used a similar name. It seems to better represent what this is instead of `commonPartitionType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1019735287


##########
core/src/test/java/org/apache/iceberg/TestPartitioning.java:
##########
@@ -43,6 +43,12 @@ public class TestPartitioning {
           required(1, "id", Types.IntegerType.get()),
           required(2, "data", Types.StringType.get()),
           required(3, "category", Types.StringType.get()));
+  private static final PartitionSpec BY_DATA_SPEC =
+      PartitionSpec.builderFor(SCHEMA).identity("data").build();
+  private static final PartitionSpec BY_CATEGORY_DATA_SPEC =
+      PartitionSpec.builderFor(SCHEMA).identity("category").identity("data").build();
+  private static final PartitionSpec BY_DATA_CATEGORY_BUCKET_SPEC =
+      PartitionSpec.builderFor(SCHEMA).identity("data").bucket("category", 8).build();
 

Review Comment:
   There are some slight differences in how other tests behave for different format versions. We can cover that with assumes and v1 and v2 checks. Since it is an existing class, though, I'd consider refactoring in a separate PR.



##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -298,4 +324,33 @@ private static boolean compatibleTransforms(Transform<?, ?> t1, Transform<?, ?>
         || t1.equals(Transforms.alwaysNull())
         || t2.equals(Transforms.alwaysNull());
   }
+
+  private static Set<Integer> commonPartitionFieldIds(Collection<PartitionSpec> specs) {
+    Set<Integer> commonFieldIds = Sets.newHashSet();
+
+    int specIndex = 0;
+    for (PartitionSpec spec : specs) {
+      if (specIndex == 0) {
+        commonFieldIds.addAll(activeFieldIds(spec));
+      } else {
+        commonFieldIds.retainAll(activeFieldIds(spec));
+      }
+
+      specIndex++;
+    }
+
+    ValidationException.check(
+        commonFieldIds.size() > 0 || specs.size() == 1,
+        "Specs do not have any common fields: %s",

Review Comment:
   Correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018487624


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type
    */
+  public static StructType commonPartitionType(Collection<PartitionSpec> specs) {
+    return buildPartitionType(specs, true /* only common fields */);
+  }
+
+  /**
+   * Builds a unified partition type considering all specs in a table.
+   *
+   * <p>Whenever a table has multiple specs, the partition type is a struct containing all fields
+   * that have ever been a part of any spec in the table. In other words, the struct fields
+   * represent a union of all known partition fields.
+   *
+   * @param table a table with one or many specs
+   * @return the constructed unified partition type
+   */
   public static StructType partitionType(Table table) {
+    return buildPartitionType(table.specs().values(), false /* only common fields */);
+  }
+
+  private static StructType buildPartitionType(
+      Collection<PartitionSpec> specs, boolean onlyCommonFields) {
+
     // we currently don't know the output type of unknown transforms
-    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(table);
+    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(specs);
     ValidationException.check(
         unknownTransforms.isEmpty(),
-        "Cannot build table partition type, unknown transforms: %s",
+        "Cannot build partition type, unknown transforms: %s",
         unknownTransforms);
 
-    if (table.specs().size() == 1) {
-      return table.spec().partitionType();
+    if (specs.size() == 1) {
+      PartitionSpec spec = Iterables.getOnlyElement(specs);
+      return spec.partitionType();
     }
 
     Map<Integer, PartitionField> fieldMap = Maps.newHashMap();
     Map<Integer, Type> typeMap = Maps.newHashMap();
     Map<Integer, String> nameMap = Maps.newHashMap();
 
-    // sort the spec IDs in descending order to pick up the most recent field names
-    List<Integer> specIds =
-        table.specs().keySet().stream()
-            .sorted(Collections.reverseOrder())
+    // sort specs by ID in descending order to pick up the most recent field names
+    List<PartitionSpec> sortedSpecs =
+        specs.stream()
+            .sorted(Comparator.comparingLong(PartitionSpec::specId).reversed())
             .collect(Collectors.toList());
 
-    for (Integer specId : specIds) {
-      PartitionSpec spec = table.specs().get(specId);
+    Set<Integer> projectedFieldIds = onlyCommonFields ? commonPartitionFieldIds(specs) : null;

Review Comment:
   I can do that. I guess I was lazy to write code to compute all field IDs :) I agree it would be easier to follow than a null-aware check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018499080


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type
    */
+  public static StructType commonPartitionType(Collection<PartitionSpec> specs) {
+    return buildPartitionType(specs, true /* only common fields */);
+  }
+
+  /**
+   * Builds a unified partition type considering all specs in a table.
+   *
+   * <p>Whenever a table has multiple specs, the partition type is a struct containing all fields
+   * that have ever been a part of any spec in the table. In other words, the struct fields
+   * represent a union of all known partition fields.
+   *
+   * @param table a table with one or many specs
+   * @return the constructed unified partition type
+   */
   public static StructType partitionType(Table table) {
+    return buildPartitionType(table.specs().values(), false /* only common fields */);
+  }
+
+  private static StructType buildPartitionType(
+      Collection<PartitionSpec> specs, boolean onlyCommonFields) {
+
     // we currently don't know the output type of unknown transforms
-    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(table);
+    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(specs);
     ValidationException.check(
         unknownTransforms.isEmpty(),
-        "Cannot build table partition type, unknown transforms: %s",
+        "Cannot build partition type, unknown transforms: %s",
         unknownTransforms);
 
-    if (table.specs().size() == 1) {
-      return table.spec().partitionType();
+    if (specs.size() == 1) {
+      PartitionSpec spec = Iterables.getOnlyElement(specs);
+      return spec.partitionType();
     }
 
     Map<Integer, PartitionField> fieldMap = Maps.newHashMap();
     Map<Integer, Type> typeMap = Maps.newHashMap();
     Map<Integer, String> nameMap = Maps.newHashMap();
 
-    // sort the spec IDs in descending order to pick up the most recent field names
-    List<Integer> specIds =
-        table.specs().keySet().stream()
-            .sorted(Collections.reverseOrder())
+    // sort specs by ID in descending order to pick up the most recent field names
+    List<PartitionSpec> sortedSpecs =
+        specs.stream()
+            .sorted(Comparator.comparingLong(PartitionSpec::specId).reversed())
             .collect(Collectors.toList());
 
-    for (Integer specId : specIds) {
-      PartitionSpec spec = table.specs().get(specId);
+    Set<Integer> projectedFieldIds = onlyCommonFields ? commonPartitionFieldIds(specs) : null;

Review Comment:
   Thinking more on this if "commonPartitionFields" didn't throw an error you could just always populate it it here, and keep the projection code below gated just on "onlyCommonFields"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6163: Core: Method for building grouping key type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#issuecomment-1315675501

   Thanks for reviewing, @RussellSpitzer @sunchao!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi merged pull request #6163: Core: Method for building grouping key type

Posted by GitBox <gi...@apache.org>.

aokolnychyi merged PR #6163:
URL: https://github.com/apache/iceberg/pull/6163


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018462633


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.

Review Comment:
   Empty struct if we have no common partitionType?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#issuecomment-1309208030

   @sunchao, here is how we can compute a clustering key for storage-partitioned joins.
   
   cc @RussellSpitzer @flyrain @szehon-ho @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018485570


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -195,41 +198,68 @@ public Void alwaysNull(int fieldId, String sourceName, int sourceId) {
   }
 
   /**
-   * Builds a common partition type for all specs in a table.
+   * Builds an intersection of all partition types in a table.
    *
-   * <p>Whenever a table has multiple specs, the partition type is a struct containing all columns
-   * that have ever been a part of any spec in the table.
+   * <p>Whenever a table has multiple specs, the common partition type is a struct containing only
+   * fields that are present in every spec of the table. In other words, the struct fields represent
+   * an intersection of all partition types.
    *
-   * @param table a table with one or many specs
+   * @param specs one or many specs
    * @return the constructed common partition type
    */
+  public static StructType commonPartitionType(Collection<PartitionSpec> specs) {
+    return buildPartitionType(specs, true /* only common fields */);
+  }
+
+  /**
+   * Builds a unified partition type considering all specs in a table.
+   *
+   * <p>Whenever a table has multiple specs, the partition type is a struct containing all fields
+   * that have ever been a part of any spec in the table. In other words, the struct fields
+   * represent a union of all known partition fields.
+   *
+   * @param table a table with one or many specs
+   * @return the constructed unified partition type
+   */
   public static StructType partitionType(Table table) {
+    return buildPartitionType(table.specs().values(), false /* only common fields */);
+  }
+
+  private static StructType buildPartitionType(
+      Collection<PartitionSpec> specs, boolean onlyCommonFields) {
+
     // we currently don't know the output type of unknown transforms
-    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(table);
+    List<Transform<?, ?>> unknownTransforms = collectUnknownTransforms(specs);
     ValidationException.check(
         unknownTransforms.isEmpty(),
-        "Cannot build table partition type, unknown transforms: %s",
+        "Cannot build partition type, unknown transforms: %s",
         unknownTransforms);
 
-    if (table.specs().size() == 1) {
-      return table.spec().partitionType();
+    if (specs.size() == 1) {
+      PartitionSpec spec = Iterables.getOnlyElement(specs);
+      return spec.partitionType();
     }
 
     Map<Integer, PartitionField> fieldMap = Maps.newHashMap();
     Map<Integer, Type> typeMap = Maps.newHashMap();
     Map<Integer, String> nameMap = Maps.newHashMap();
 
-    // sort the spec IDs in descending order to pick up the most recent field names
-    List<Integer> specIds =
-        table.specs().keySet().stream()
-            .sorted(Collections.reverseOrder())
+    // sort specs by ID in descending order to pick up the most recent field names
+    List<PartitionSpec> sortedSpecs =
+        specs.stream()
+            .sorted(Comparator.comparingLong(PartitionSpec::specId).reversed())
             .collect(Collectors.toList());
 
-    for (Integer specId : specIds) {
-      PartitionSpec spec = table.specs().get(specId);
+    Set<Integer> projectedFieldIds = onlyCommonFields ? commonPartitionFieldIds(specs) : null;

Review Comment:
   This is a set of fields to keep IFF onlyCommonFields is true? 
   
   The logic may be easier for me to understand if it was just
   
   ```
   onlyCommonFields ? commonPartitionFieldIds() : AllFieldIDs()
   ```
   
   or alternatively
   
   ```
   fieldsToSkip = onlyCommonFields ? allIds -- commonPartitionFieldIds() : empty() 
   ```
   
   Rather than using "null" as a special condition to skip doing the contains check . That may just be me though
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1018496077


##########
core/src/main/java/org/apache/iceberg/Partitioning.java:
##########
@@ -298,4 +324,33 @@ private static boolean compatibleTransforms(Transform<?, ?> t1, Transform<?, ?>
         || t1.equals(Transforms.alwaysNull())
         || t2.equals(Transforms.alwaysNull());
   }
+
+  private static Set<Integer> commonPartitionFieldIds(Collection<PartitionSpec> specs) {
+    Set<Integer> commonFieldIds = Sets.newHashSet();
+
+    int specIndex = 0;
+    for (PartitionSpec spec : specs) {
+      if (specIndex == 0) {
+        commonFieldIds.addAll(activeFieldIds(spec));
+      } else {
+        commonFieldIds.retainAll(activeFieldIds(spec));
+      }
+
+      specIndex++;
+    }
+
+    ValidationException.check(
+        commonFieldIds.size() > 0 || specs.size() == 1,
+        "Specs do not have any common fields: %s",
+        specs);
+
+    return commonFieldIds;
+  }
+
+  private static List<Integer> activeFieldIds(PartitionSpec spec) {
+    return spec.fields().stream()

Review Comment:
   This weeds out fields that have the same ID but were deleted in V1 tables correct? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6163: Core: Method for building common partition type

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #6163:
URL: https://github.com/apache/iceberg/pull/6163#discussion_r1019735287


##########
core/src/test/java/org/apache/iceberg/TestPartitioning.java:
##########
@@ -43,6 +43,12 @@ public class TestPartitioning {
           required(1, "id", Types.IntegerType.get()),
           required(2, "data", Types.StringType.get()),
           required(3, "category", Types.StringType.get()));
+  private static final PartitionSpec BY_DATA_SPEC =
+      PartitionSpec.builderFor(SCHEMA).identity("data").build();
+  private static final PartitionSpec BY_CATEGORY_DATA_SPEC =
+      PartitionSpec.builderFor(SCHEMA).identity("category").identity("data").build();
+  private static final PartitionSpec BY_DATA_CATEGORY_BUCKET_SPEC =
+      PartitionSpec.builderFor(SCHEMA).identity("data").bucket("category", 8).build();
 

Review Comment:
   There are some slight differences in how other tests behave for different format versions. We can cover that with assumes and v1/v2 checks. Since it is an existing class, though, I'd consider refactoring in a separate PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org