You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/10 20:20:43 UTC

[GitHub] [iceberg] jun-he opened a new pull request #2061: Improve partition spec builder

jun-he opened a new pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061


   Improve partition spec builder to prevent adding duplicate identity partition fields.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#discussion_r559060311



##########
File path: api/src/test/java/org/apache/iceberg/TestPartitionSpecValidation.java
##########
@@ -123,6 +123,21 @@ public void testMultipleDatePartitionsWithDifferentSourceColumns() {
     PartitionSpec.builderFor(SCHEMA).hour("d").hour("another_d").build();
   }
 
+  @Test
+  public void testMultipleIdentityPartitions() {
+    PartitionSpec.builderFor(SCHEMA).year("d").identity("id").identity("d").identity("s").build();

Review comment:
       > We had people using day("ts").hour("ts") often, which is redundant so we want to fail cases like that because it signals the user does not understand what they are using.
   
   Yes people also asked me about this, and there is little information about this when people read https://iceberg.apache.org/partitioning. Let me add some more doc in that area, plus the info about `UpdatePartitionSepc` and its SQL extension.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
jun-he commented on pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#issuecomment-757567574


   @rdblue  can you please take a look? Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#discussion_r559347551



##########
File path: api/src/main/java/org/apache/iceberg/PartitionSpec.java
##########
@@ -319,7 +319,7 @@ public static Builder builderFor(Schema schema) {
     private final Schema schema;
     private final List<PartitionField> fields = Lists.newArrayList();
     private final Set<String> partitionNames = Sets.newHashSet();
-    private Map<Integer, PartitionField> timeFields = Maps.newHashMap();
+    private Map<String, PartitionField> partitionFields = Maps.newHashMap();

Review comment:
       Updated. As api module cannot use Pair from core module, I used map entry for the same purpose.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#issuecomment-818089902


   Thanks! I think we may want to improve the error message then since we don't mean redundant partition as much as "Cannot add transform for field which is already used in another partition transform" or something like that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#discussion_r558035672



##########
File path: api/src/test/java/org/apache/iceberg/TestPartitionSpecValidation.java
##########
@@ -123,6 +123,21 @@ public void testMultipleDatePartitionsWithDifferentSourceColumns() {
     PartitionSpec.builderFor(SCHEMA).hour("d").hour("another_d").build();
   }
 
+  @Test
+  public void testMultipleIdentityPartitions() {
+    PartitionSpec.builderFor(SCHEMA).year("d").identity("id").identity("d").identity("s").build();

Review comment:
       Users might not define their spec like this test, which checks one possible code path, where dudup won't apply for date and identity and also won't apply if identity partitions are for different source fields. 
   
   Now, partition spec update can actually add day and year for the same source field, we may consider to drop this constraint. @rdblue how do you feel about this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#discussion_r559050950



##########
File path: api/src/main/java/org/apache/iceberg/PartitionSpec.java
##########
@@ -319,7 +319,7 @@ public static Builder builderFor(Schema schema) {
     private final Schema schema;
     private final List<PartitionField> fields = Lists.newArrayList();
     private final Set<String> partitionNames = Sets.newHashSet();
-    private Map<Integer, PartitionField> timeFields = Maps.newHashMap();
+    private Map<String, PartitionField> partitionFields = Maps.newHashMap();

Review comment:
       I like this implementation because it allows easily adding new checks. I agree that the name could be better, like `dedupFields` or `dedupPartitionFields` as Jack suggested. I would not combine this with `fields`. I think that would make both too difficult to reason about.
   
   One more thing I would change is to make the key a `Pair<Integer, String>` instead of a `String` That avoids confusion about how to create the keys, and also prevents unnecessary collisions like `(1, p1)` with `(11, p)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#issuecomment-815273258


   @jun-he I believe this code also disables using multiple partition fields derived from a particular field even if they are not identity transforms which is allowed. For example
   
   ```SQL
   CREATE TABLE qa_catalog.db.test_table(
           part bignit COMMENT 'part',
           refreshed_at timestamp,
           date date
           )
          USING iceberg
          PARTITIONED BY (entity, years(date), month(date), day(date), part)
    ```
    
    ```
    java.lang.IllegalArgumentException: Cannot add redundant partition: 1001: date_year: year(2) conflicts with 1002: date_month: month(2)
    ```
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#discussion_r556182365



##########
File path: api/src/main/java/org/apache/iceberg/PartitionSpec.java
##########
@@ -389,7 +392,7 @@ public Builder year(String sourceName, String targetName) {
       Types.NestedField sourceColumn = findSourceColumn(sourceName);
       PartitionField field = new PartitionField(
           sourceColumn.fieldId(), nextFieldId(), targetName, Transforms.year(sourceColumn.type()));
-      checkForRedundantPartitions(field);
+      checkForRedundantPartitions(field, "dates");

Review comment:
       nit: "dates" has shown up multiple times, move to static variable

##########
File path: api/src/test/java/org/apache/iceberg/TestPartitionSpecValidation.java
##########
@@ -123,6 +123,21 @@ public void testMultipleDatePartitionsWithDifferentSourceColumns() {
     PartitionSpec.builderFor(SCHEMA).hour("d").hour("another_d").build();
   }
 
+  @Test
+  public void testMultipleIdentityPartitions() {
+    PartitionSpec.builderFor(SCHEMA).year("d").identity("id").identity("d").identity("s").build();

Review comment:
       I think I understand the goal this PR tires to achieve, but what is the use case for this? Because they are now not deduped, there are now 2 partition columns "d_year" and "d". Is this actually the right behavior for users?

##########
File path: api/src/main/java/org/apache/iceberg/PartitionSpec.java
##########
@@ -319,7 +319,7 @@ public static Builder builderFor(Schema schema) {
     private final Schema schema;
     private final List<PartitionField> fields = Lists.newArrayList();
     private final Set<String> partitionNames = Sets.newHashSet();
-    private Map<Integer, PartitionField> timeFields = Maps.newHashMap();
+    private Map<String, PartitionField> partitionFields = Maps.newHashMap();

Review comment:
       +1, maybe some name like a `dedupPartitionFields`. And if we are doing this for datetime transforms, we should probably also do it for bucket and truncate and alwaysNull.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
jun-he commented on pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#issuecomment-816469940


   @RussellSpitzer The reason to fail `partitions years(date), month(date), day(date)` is explained in https://github.com/apache/iceberg/pull/2061#discussion_r559051327. 
   One workaround is to update the table with additional partitions after creating it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#discussion_r558021203



##########
File path: api/src/main/java/org/apache/iceberg/PartitionSpec.java
##########
@@ -319,7 +319,7 @@ public static Builder builderFor(Schema schema) {
     private final Schema schema;
     private final List<PartitionField> fields = Lists.newArrayList();
     private final Set<String> partitionNames = Sets.newHashSet();
-    private Map<Integer, PartitionField> timeFields = Maps.newHashMap();
+    private Map<String, PartitionField> partitionFields = Maps.newHashMap();

Review comment:
       Thanks @yyanyy and @jackye1995 for the review.
   `alwaysNull` should not be deduped as the same source field might be removed multiple times during the spec evolution.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#discussion_r556125886



##########
File path: api/src/main/java/org/apache/iceberg/PartitionSpec.java
##########
@@ -319,7 +319,7 @@ public static Builder builderFor(Schema schema) {
     private final Schema schema;
     private final List<PartitionField> fields = Lists.newArrayList();
     private final Set<String> partitionNames = Sets.newHashSet();
-    private Map<Integer, PartitionField> timeFields = Maps.newHashMap();
+    private Map<String, PartitionField> partitionFields = Maps.newHashMap();

Review comment:
       Overall LGTM, one nit is that I think `partitionFields` here would be good to be renamed so that it's easy to tell it's just for collision detection. Also I wonder if we want to do this check for other transformations too (e.g. bucket, and record numBuckets in the string), so that we might be able to combine `fields` and `partitionFields` into potentially something like a LinkedHashMap?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#issuecomment-762463198


   Thanks @jun-he!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he edited a comment on pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
jun-he edited a comment on pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#issuecomment-816469940


   @RussellSpitzer The reason to fail `partitions years(date), month(date), day(date)` is explained in https://github.com/apache/iceberg/pull/2061#discussion_r559051327. 
   One workaround is to update the table with additional time based partitions after creating it with a single time based partition.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#discussion_r556230507



##########
File path: api/src/test/java/org/apache/iceberg/TestPartitionSpecValidation.java
##########
@@ -123,6 +123,21 @@ public void testMultipleDatePartitionsWithDifferentSourceColumns() {
     PartitionSpec.builderFor(SCHEMA).hour("d").hour("another_d").build();
   }
 
+  @Test
+  public void testMultipleIdentityPartitions() {
+    PartitionSpec.builderFor(SCHEMA).year("d").identity("id").identity("d").identity("s").build();

Review comment:
       I was having a similar question but forgot to ask, not related to this change - why do we not want to have two different time transforms on the same column? Is it just because the more granular one makes the other one not so useful (e.g. when already partitioned by `day("d")`, partition on `year("d")` won't help with query performance)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#discussion_r559051327



##########
File path: api/src/test/java/org/apache/iceberg/TestPartitionSpecValidation.java
##########
@@ -123,6 +123,21 @@ public void testMultipleDatePartitionsWithDifferentSourceColumns() {
     PartitionSpec.builderFor(SCHEMA).hour("d").hour("another_d").build();
   }
 
+  @Test
+  public void testMultipleIdentityPartitions() {
+    PartitionSpec.builderFor(SCHEMA).year("d").identity("id").identity("d").identity("s").build();

Review comment:
       This check was introduced because users were accustomed to partitioning by both date and hour, assuming that hour was in [0-23]. In Iceberg, `hours` is the granularity, not a repeating value. We had people using `day("ts").hour("ts")` often, which is redundant so we want to fail cases like that because it signals the user does not understand what they are using.
   
   When updating partitioning, we allow duplication because there is probably existing data in the table stored by day. If the user had to drop `ts_day` to add `ts_hour`, then the metadata tables would also be affected. The result is that metadata queries to find `max(partition.ts_day)` would start to fail. Also, there is no harm in populating `ts_day` for data partitioned by `ts_hour` so we allow it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2061: Improve partition spec builder

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2061:
URL: https://github.com/apache/iceberg/pull/2061#discussion_r556125886



##########
File path: api/src/main/java/org/apache/iceberg/PartitionSpec.java
##########
@@ -319,7 +319,7 @@ public static Builder builderFor(Schema schema) {
     private final Schema schema;
     private final List<PartitionField> fields = Lists.newArrayList();
     private final Set<String> partitionNames = Sets.newHashSet();
-    private Map<Integer, PartitionField> timeFields = Maps.newHashMap();
+    private Map<String, PartitionField> partitionFields = Maps.newHashMap();

Review comment:
       Overall LGTM, one nit that I think `partitionFields` here would be good to be renamed so that it's easy to tell it's just for collision detecting. Also I wonder if we want to do this collision check for other transformations too (like bucket, and record numBuckets in the string), so that we might be able to combine `fields` and `partitionFields` into potentially something like a LinkedHashMap?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org