You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/01 06:00:10 UTC

[GitHub] [iceberg] jun-he opened a new pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

jun-he opened a new pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284


    Reassign the partition field IDs and reuse any existing IDs for equivalent fields when refreshing spec.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r609348817



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -842,18 +844,42 @@ private static SortOrder updateSortOrderSchema(Schema schema, SortOrder sortOrde
     return builder.build();
   }
 
-  private static PartitionSpec freshSpec(int specId, Schema schema, PartitionSpec partitionSpec) {
+  private static PartitionSpec freshSpec(int specId, Schema schema, PartitionSpec partitionSpec, int formatVersion,
+                                         List<PartitionSpec> specs, AtomicInteger lastPartitionId) {
     PartitionSpec.Builder specBuilder = PartitionSpec.builderFor(schema)
         .withSpecId(specId);
 
-    for (PartitionField field : partitionSpec.fields()) {
-      // look up the name of the source field in the old schema to get the new schema's id
-      String sourceName = partitionSpec.schema().findColumnName(field.sourceId());
-      specBuilder.add(
-          schema.findField(sourceName).fieldId(),
-          field.fieldId(),
-          field.name(),
-          field.transform().toString());
+    if (formatVersion > 1) {
+      Map<Pair<Integer, String>, Integer> transformToFieldId = specs.stream()
+          .flatMap(spec -> spec.fields().stream())
+          .collect(Collectors.toMap(
+              field -> Pair.of(field.sourceId(), field.transform().toString()),
+              PartitionField::fieldId,
+              (n1, n2) -> n2));
+
+      for (PartitionField field : partitionSpec.fields()) {

Review comment:
       👌  updated it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r603676375



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -529,7 +530,8 @@ public TableMetadata updatePartitionSpec(PartitionSpec newPartitionSpec) {
         .addAll(specs);
     if (!specsById.containsKey(newDefaultSpecId)) {
       // get a fresh spec to ensure the spec ID is set to the new default
-      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec));
+      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec, formatVersion,
+          specs, new AtomicInteger(lastAssignedPartitionId)));

Review comment:
       The problem from [my comment](https://github.com/apache/iceberg/pull/2089#discussion_r567154516) that needs to be fixed does not affect this method. The logic in `BaseUpdatePartitionSpec` already covers spec evolution. The problem I was referring to is that when a table replacement is created, all of the metadata is new and can possibly conflict with the existing table's metadata.
   
   For example, if I have a table partitioned by `1000: ts_day=day(ts)` and I replace it with one partitioned by `1000: id_bucket=bucket(16, id)` then the two partition IDs collide. The `buildReplacement` method needs to handle reassigning IDs for partition specs just like it does for schemas.
   
   The specs that are passed into `updatePartitionSpec` are already based on the current spec and so the IDs are assigned correctly and are consistent with all previous table specs. There is no "bug" to fix in this method. However, there is an improvement that we can make that is similar. Currently, if you add a partition field that was previously part of the table but is not any longer, it is assigned a completely new ID.
   
   For example, if I have a table partitioned by `1000: ts_day=day(ts)` and then update it by dropping the `ts_day` partition, it would have one spec with `ts_day` and one that is unpartitioned. Then if I go back and add `day(ts)` again, it would create `1001: ts_day=day(ts)`. That is perfectly fine and consistent, but it would be better to go back and reuse ID 1000 for the new field.
   
   We can detect that fields should be un-deleted rather than added by going back to through old partition specs for the table. That will recover the old ID, but there is an additional issue to solve: when un-deleting a partition field in v1, we need to make sure that the un-deleted field replaces the `void` field that was added when the field was deleted. For v2, it would also be nice to use the original partition field order but that's not required.
   
   I think that this improvement is a good thing to do, but should not be part of this change because this PR is needed to fix the blocker bug with `buildReplacement`. Also, I think that the improvement for spec updates should be built into `BaseUpdatePartitionSpec` rather than done in table metadata.
   
   Could you separate this into another PR, @jun-he?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r609348299



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -842,18 +844,42 @@ private static SortOrder updateSortOrderSchema(Schema schema, SortOrder sortOrde
     return builder.build();
   }
 
-  private static PartitionSpec freshSpec(int specId, Schema schema, PartitionSpec partitionSpec) {
+  private static PartitionSpec freshSpec(int specId, Schema schema, PartitionSpec partitionSpec, int formatVersion,
+                                         List<PartitionSpec> specs, AtomicInteger lastPartitionId) {

Review comment:
       Thanks for the comments. Updated it accordingly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r609349189



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -587,6 +587,62 @@ public void testInvalidUpdatePartitionSpecForV1Table() throws Exception {
         () -> metadata.updatePartitionSpec(spec));
   }
 
+  @Test
+  public void testUpdatePartitionSpecForV1Table() {
+    Schema schema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "y", Types.LongType.get())
+    );
+
+    PartitionSpec spec = PartitionSpec.builderFor(schema).withSpecId(0)
+        .bucket("x", 8)
+        .build();
+    String location = "file://tmp/db/table";
+
+    TableMetadata metadata = TableMetadata.newTableMetadata(
+        schema, spec, SortOrder.unsorted(), location, ImmutableMap.of(), 1);
+    TableMetadata updated = metadata.updatePartitionSpec(spec);
+    Assert.assertEquals(spec, updated.spec());
+
+    spec = PartitionSpec.builderFor(schema).withSpecId(1)

Review comment:
       Yep, updated the test accordingly.

##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -587,6 +587,62 @@ public void testInvalidUpdatePartitionSpecForV1Table() throws Exception {
         () -> metadata.updatePartitionSpec(spec));
   }
 
+  @Test
+  public void testUpdatePartitionSpecForV1Table() {
+    Schema schema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "y", Types.LongType.get())
+    );
+
+    PartitionSpec spec = PartitionSpec.builderFor(schema).withSpecId(0)
+        .bucket("x", 8)
+        .build();
+    String location = "file://tmp/db/table";
+
+    TableMetadata metadata = TableMetadata.newTableMetadata(
+        schema, spec, SortOrder.unsorted(), location, ImmutableMap.of(), 1);
+    TableMetadata updated = metadata.updatePartitionSpec(spec);
+    Assert.assertEquals(spec, updated.spec());
+
+    spec = PartitionSpec.builderFor(schema).withSpecId(1)

Review comment:
       👌 , updated the test accordingly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r628823549



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -703,6 +705,35 @@ public TableMetadata removeSnapshotLogEntries(Set<Long> snapshotIds) {
         snapshots, newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));
   }
 
+  private PartitionSpec reassignPartitionIds(PartitionSpec partitionSpec, AtomicInteger lastPartitionId) {
+    if (formatVersion > 1) {
+      Map<Pair<Integer, String>, Integer> transformToFieldId = specs.stream()
+          .flatMap(spec -> spec.fields().stream())
+          .collect(Collectors.toMap(
+              field -> Pair.of(field.sourceId(), field.transform().toString()),
+              PartitionField::fieldId,
+              (n1, n2) -> n2));
+
+      PartitionSpec.Builder specBuilder = PartitionSpec.builderFor(partitionSpec.schema())
+          .withSpecId(partitionSpec.specId());
+
+      for (PartitionField field : partitionSpec.fields()) {
+        // reassign the partition field ids
+        Integer fieldId = transformToFieldId.computeIfAbsent(
+            Pair.of(field.sourceId(), field.transform().toString()), k -> lastPartitionId.incrementAndGet());
+        specBuilder.add(
+            field.sourceId(),
+            fieldId,
+            field.name(),
+            field.transform());
+      }
+      return specBuilder.build();
+    } else {
+      // noop for v1 table
+      return partitionSpec;

Review comment:
       As mentioned in https://github.com/apache/iceberg/pull/2284#discussion_r609381706, the issue to replace old one using void transform is that the table schema might be changed and cannot create void transform if the old source field is removed.
   @rdblue  do you think we should use a dummy source field in this case?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r609381706



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -529,7 +530,8 @@ public TableMetadata updatePartitionSpec(PartitionSpec newPartitionSpec) {
         .addAll(specs);
     if (!specsById.containsKey(newDefaultSpecId)) {
       // get a fresh spec to ensure the spec ID is set to the new default
-      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec));
+      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec, formatVersion,
+          specs, new AtomicInteger(lastAssignedPartitionId)));

Review comment:
       @rdblue  I updated the PR accordingly. 
   Seems the logic of `reassignPartitionIds` can only work for V2 table. 
   For V1 table,  if a table has a schema with a single field `ts` and is partitioned by `1000: ts_day=day(ts)` and then is replaced with a new schema with a single field `id` and is partitioned by `1000: id_bucket=bucket(16, id)`, it cannot reassign spec including `1000: ts_day=null` and  `1001: id_bucket=bucket(16, id)` because the new schema does not contain `ts`  any more.
   Therefore, I kept `reassignPartitionIds` to be no op in the PR, which is the current behavior for V1 table. 
   How do you feel about this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r628837253



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -593,6 +593,74 @@ public void testInvalidUpdatePartitionSpecForV1Table() throws Exception {
         () -> metadata.updatePartitionSpec(spec));
   }
 
+  @Test
+  public void testBuildReplacementForV1Table() {
+    Schema schema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "y", Types.LongType.get())
+    );
+    PartitionSpec spec = PartitionSpec.builderFor(schema).withSpecId(0)
+        .identity("x")
+        .identity("y")
+        .build();
+    String location = "file://tmp/db/table";
+    TableMetadata metadata = TableMetadata.newTableMetadata(
+        schema, spec, SortOrder.unsorted(), location, ImmutableMap.of(), 1);
+    Assert.assertEquals(spec, metadata.spec());
+
+    Schema updatedSchema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "z", Types.StringType.get())
+    );
+    PartitionSpec updatedSpec = PartitionSpec.builderFor(updatedSchema).withSpecId(0)
+        .bucket("z", 8)
+        .identity("x")
+        .build();
+    TableMetadata updated = metadata.buildReplacement(
+        updatedSchema, updatedSpec, SortOrder.unsorted(), location, ImmutableMap.of());
+    PartitionSpec expected = PartitionSpec.builderFor(updated.schema()).withSpecId(1)
+        .add(3, 1000, "z_bucket", "bucket[8]")
+        .add(1, 1001, "x", "identity")
+        .build();

Review comment:
       This test is for v2 table, which does not need void transform partition field for removed column.
   Also, for v1 table, this `.add(2, 1001, "y", "void")` seems not working as no field `y` in the updated schema.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r628837361



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -703,6 +705,35 @@ public TableMetadata removeSnapshotLogEntries(Set<Long> snapshotIds) {
         snapshots, newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));
   }
 
+  private PartitionSpec reassignPartitionIds(PartitionSpec partitionSpec, AtomicInteger lastPartitionId) {

Review comment:
       Updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r625440214



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -703,6 +705,35 @@ public TableMetadata removeSnapshotLogEntries(Set<Long> snapshotIds) {
         snapshots, newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));
   }
 
+  private PartitionSpec reassignPartitionIds(PartitionSpec partitionSpec, AtomicInteger lastPartitionId) {

Review comment:
       Minor: this is private, but I think it would be a bit better to use `TypeUtil.NextID` just because that removes confusion about whether `incrementAndGet` should be used or `getAndIncrement`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r609071906



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -529,7 +530,8 @@ public TableMetadata updatePartitionSpec(PartitionSpec newPartitionSpec) {
         .addAll(specs);
     if (!specsById.containsKey(newDefaultSpecId)) {
       // get a fresh spec to ensure the spec ID is set to the new default
-      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec));
+      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec, formatVersion,
+          specs, new AtomicInteger(lastAssignedPartitionId)));

Review comment:
       Thanks for the details. It makes sense and I will update the PR accordingly.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r642710416



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -593,6 +593,74 @@ public void testInvalidUpdatePartitionSpecForV1Table() throws Exception {
         () -> metadata.updatePartitionSpec(spec));
   }
 
+  @Test
+  public void testBuildReplacementForV1Table() {
+    Schema schema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "y", Types.LongType.get())
+    );
+    PartitionSpec spec = PartitionSpec.builderFor(schema).withSpecId(0)
+        .identity("x")
+        .identity("y")
+        .build();
+    String location = "file://tmp/db/table";
+    TableMetadata metadata = TableMetadata.newTableMetadata(
+        schema, spec, SortOrder.unsorted(), location, ImmutableMap.of(), 1);
+    Assert.assertEquals(spec, metadata.spec());
+
+    Schema updatedSchema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "z", Types.StringType.get())
+    );
+    PartitionSpec updatedSpec = PartitionSpec.builderFor(updatedSchema).withSpecId(0)
+        .bucket("z", 8)
+        .identity("x")
+        .build();
+    TableMetadata updated = metadata.buildReplacement(
+        updatedSchema, updatedSpec, SortOrder.unsorted(), location, ImmutableMap.of());
+    PartitionSpec expected = PartitionSpec.builderFor(updated.schema()).withSpecId(1)
+        .add(3, 1000, "z_bucket", "bucket[8]")
+        .add(1, 1001, "x", "identity")
+        .build();

Review comment:
       The test name is `testBuildReplacementForV1Table`, which is why I thought that the changes should be only v1 changes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue closed pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r603678133



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -842,18 +844,42 @@ private static SortOrder updateSortOrderSchema(Schema schema, SortOrder sortOrde
     return builder.build();
   }
 
-  private static PartitionSpec freshSpec(int specId, Schema schema, PartitionSpec partitionSpec) {
+  private static PartitionSpec freshSpec(int specId, Schema schema, PartitionSpec partitionSpec, int formatVersion,
+                                         List<PartitionSpec> specs, AtomicInteger lastPartitionId) {
     PartitionSpec.Builder specBuilder = PartitionSpec.builderFor(schema)
         .withSpecId(specId);
 
-    for (PartitionField field : partitionSpec.fields()) {
-      // look up the name of the source field in the old schema to get the new schema's id
-      String sourceName = partitionSpec.schema().findColumnName(field.sourceId());
-      specBuilder.add(
-          schema.findField(sourceName).fieldId(),
-          field.fieldId(),
-          field.name(),
-          field.transform().toString());
+    if (formatVersion > 1) {
+      Map<Pair<Integer, String>, Integer> transformToFieldId = specs.stream()
+          .flatMap(spec -> spec.fields().stream())
+          .collect(Collectors.toMap(
+              field -> Pair.of(field.sourceId(), field.transform().toString()),
+              PartitionField::fieldId,
+              (n1, n2) -> n2));
+
+      for (PartitionField field : partitionSpec.fields()) {

Review comment:
       This logic looks mostly correct for a `reassignPartitionIds` method, but I would separate the concern of updating the source field ID references (what `freshSpec` originally did) from reassigning partition IDs to be consistent with other specs in the table. We should use `freshSpec` to reassign the source IDs to the ones used in the current table schema, and then we should call `reassignPartitionIds` to make them match existing partition specs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r628837283



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -703,6 +705,35 @@ public TableMetadata removeSnapshotLogEntries(Set<Long> snapshotIds) {
         snapshots, newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));
   }
 
+  private PartitionSpec reassignPartitionIds(PartitionSpec partitionSpec, AtomicInteger lastPartitionId) {
+    if (formatVersion > 1) {
+      Map<Pair<Integer, String>, Integer> transformToFieldId = specs.stream()
+          .flatMap(spec -> spec.fields().stream())
+          .collect(Collectors.toMap(
+              field -> Pair.of(field.sourceId(), field.transform().toString()),
+              PartitionField::fieldId,
+              (n1, n2) -> n2));
+
+      PartitionSpec.Builder specBuilder = PartitionSpec.builderFor(partitionSpec.schema())
+          .withSpecId(partitionSpec.specId());
+
+      for (PartitionField field : partitionSpec.fields()) {
+        // reassign the partition field ids
+        Integer fieldId = transformToFieldId.computeIfAbsent(

Review comment:
       Updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r628837253



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -593,6 +593,74 @@ public void testInvalidUpdatePartitionSpecForV1Table() throws Exception {
         () -> metadata.updatePartitionSpec(spec));
   }
 
+  @Test
+  public void testBuildReplacementForV1Table() {
+    Schema schema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "y", Types.LongType.get())
+    );
+    PartitionSpec spec = PartitionSpec.builderFor(schema).withSpecId(0)
+        .identity("x")
+        .identity("y")
+        .build();
+    String location = "file://tmp/db/table";
+    TableMetadata metadata = TableMetadata.newTableMetadata(
+        schema, spec, SortOrder.unsorted(), location, ImmutableMap.of(), 1);
+    Assert.assertEquals(spec, metadata.spec());
+
+    Schema updatedSchema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "z", Types.StringType.get())
+    );
+    PartitionSpec updatedSpec = PartitionSpec.builderFor(updatedSchema).withSpecId(0)
+        .bucket("z", 8)
+        .identity("x")
+        .build();
+    TableMetadata updated = metadata.buildReplacement(
+        updatedSchema, updatedSpec, SortOrder.unsorted(), location, ImmutableMap.of());
+    PartitionSpec expected = PartitionSpec.builderFor(updated.schema()).withSpecId(1)
+        .add(3, 1000, "z_bucket", "bucket[8]")
+        .add(1, 1001, "x", "identity")
+        .build();

Review comment:
       This test is for v2 table, which does not need void transform partition field for removed column.
   This `.add(2, 1001, "y", "void")` seems not working with the error fo `Cannot find source column: 2` as no field `y` in the updated schema.
   This somehow means v1 table cannot use `void` in this case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r643685642



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -529,7 +530,8 @@ public TableMetadata updatePartitionSpec(PartitionSpec newPartitionSpec) {
         .addAll(specs);
     if (!specsById.containsKey(newDefaultSpecId)) {
       // get a fresh spec to ensure the spec ID is set to the new default
-      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec));
+      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec, formatVersion,
+          specs, new AtomicInteger(lastAssignedPartitionId)));

Review comment:
       @rdblue ah, I see. Thanks for the comment. If it cannot happen when dropping a column, it won't be a problem. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r625440979



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -593,6 +593,74 @@ public void testInvalidUpdatePartitionSpecForV1Table() throws Exception {
         () -> metadata.updatePartitionSpec(spec));
   }
 
+  @Test
+  public void testBuildReplacementForV1Table() {
+    Schema schema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "y", Types.LongType.get())
+    );
+    PartitionSpec spec = PartitionSpec.builderFor(schema).withSpecId(0)
+        .identity("x")
+        .identity("y")
+        .build();
+    String location = "file://tmp/db/table";
+    TableMetadata metadata = TableMetadata.newTableMetadata(
+        schema, spec, SortOrder.unsorted(), location, ImmutableMap.of(), 1);
+    Assert.assertEquals(spec, metadata.spec());
+
+    Schema updatedSchema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "z", Types.StringType.get())
+    );
+    PartitionSpec updatedSpec = PartitionSpec.builderFor(updatedSchema).withSpecId(0)
+        .bucket("z", 8)
+        .identity("x")
+        .build();
+    TableMetadata updated = metadata.buildReplacement(
+        updatedSchema, updatedSpec, SortOrder.unsorted(), location, ImmutableMap.of());
+    PartitionSpec expected = PartitionSpec.builderFor(updated.schema()).withSpecId(1)
+        .add(3, 1000, "z_bucket", "bucket[8]")
+        .add(1, 1001, "x", "identity")
+        .build();
+    Assert.assertEquals(
+        "Should reassign the partition field IDs and reuse any existing IDs for equivalent fields",
+        expected, updated.spec());
+  }
+
+  @Test
+  public void testBuildReplacementForV2Table() {

Review comment:
       Looks good.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#issuecomment-882111078


   I am going to continue this task and should be shipped together with the release 0.12.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r625438418



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -703,6 +705,35 @@ public TableMetadata removeSnapshotLogEntries(Set<Long> snapshotIds) {
         snapshots, newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));
   }
 
+  private PartitionSpec reassignPartitionIds(PartitionSpec partitionSpec, AtomicInteger lastPartitionId) {
+    if (formatVersion > 1) {
+      Map<Pair<Integer, String>, Integer> transformToFieldId = specs.stream()
+          .flatMap(spec -> spec.fields().stream())
+          .collect(Collectors.toMap(
+              field -> Pair.of(field.sourceId(), field.transform().toString()),
+              PartitionField::fieldId,
+              (n1, n2) -> n2));
+
+      PartitionSpec.Builder specBuilder = PartitionSpec.builderFor(partitionSpec.schema())
+          .withSpecId(partitionSpec.specId());
+
+      for (PartitionField field : partitionSpec.fields()) {
+        // reassign the partition field ids
+        Integer fieldId = transformToFieldId.computeIfAbsent(
+            Pair.of(field.sourceId(), field.transform().toString()), k -> lastPartitionId.incrementAndGet());
+        specBuilder.add(
+            field.sourceId(),
+            fieldId,
+            field.name(),
+            field.transform());
+      }
+      return specBuilder.build();
+    } else {
+      // noop for v1 table
+      return partitionSpec;

Review comment:
       I think the algorithm for v1 tables should be to find out what matches previous specs and re-use those. Then if there is a new field it is moved to the end, and if there is an old field that is unmatched it gets replaced with a `void` transform. Otherwise, this will create partition field ID conflicts in v1 tables.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#issuecomment-890605206


   I'm going to close this because I made the updates in #2906.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r625439021



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -703,6 +705,35 @@ public TableMetadata removeSnapshotLogEntries(Set<Long> snapshotIds) {
         snapshots, newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));
   }
 
+  private PartitionSpec reassignPartitionIds(PartitionSpec partitionSpec, AtomicInteger lastPartitionId) {
+    if (formatVersion > 1) {
+      Map<Pair<Integer, String>, Integer> transformToFieldId = specs.stream()
+          .flatMap(spec -> spec.fields().stream())
+          .collect(Collectors.toMap(
+              field -> Pair.of(field.sourceId(), field.transform().toString()),
+              PartitionField::fieldId,
+              (n1, n2) -> n2));
+
+      PartitionSpec.Builder specBuilder = PartitionSpec.builderFor(partitionSpec.schema())
+          .withSpecId(partitionSpec.specId());
+
+      for (PartitionField field : partitionSpec.fields()) {
+        // reassign the partition field ids
+        Integer fieldId = transformToFieldId.computeIfAbsent(

Review comment:
       Why is this `Integer` instead of `int`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#issuecomment-792044682


   @rdblue this is another follow up PR of #2089. Can you help to review it? Thanks.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r603679451



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -587,6 +587,62 @@ public void testInvalidUpdatePartitionSpecForV1Table() throws Exception {
         () -> metadata.updatePartitionSpec(spec));
   }
 
+  @Test
+  public void testUpdatePartitionSpecForV1Table() {
+    Schema schema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "y", Types.LongType.get())
+    );
+
+    PartitionSpec spec = PartitionSpec.builderFor(schema).withSpecId(0)
+        .bucket("x", 8)
+        .build();
+    String location = "file://tmp/db/table";
+
+    TableMetadata metadata = TableMetadata.newTableMetadata(
+        schema, spec, SortOrder.unsorted(), location, ImmutableMap.of(), 1);
+    TableMetadata updated = metadata.updatePartitionSpec(spec);
+    Assert.assertEquals(spec, updated.spec());
+
+    spec = PartitionSpec.builderFor(schema).withSpecId(1)

Review comment:
       This violates v1 requirements because the partition field ID for `bucket(8, x)` is reassigned. It is 1000 in the first spec and 1001 in the second spec.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r628837253



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -593,6 +593,74 @@ public void testInvalidUpdatePartitionSpecForV1Table() throws Exception {
         () -> metadata.updatePartitionSpec(spec));
   }
 
+  @Test
+  public void testBuildReplacementForV1Table() {
+    Schema schema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "y", Types.LongType.get())
+    );
+    PartitionSpec spec = PartitionSpec.builderFor(schema).withSpecId(0)
+        .identity("x")
+        .identity("y")
+        .build();
+    String location = "file://tmp/db/table";
+    TableMetadata metadata = TableMetadata.newTableMetadata(
+        schema, spec, SortOrder.unsorted(), location, ImmutableMap.of(), 1);
+    Assert.assertEquals(spec, metadata.spec());
+
+    Schema updatedSchema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "z", Types.StringType.get())
+    );
+    PartitionSpec updatedSpec = PartitionSpec.builderFor(updatedSchema).withSpecId(0)
+        .bucket("z", 8)
+        .identity("x")
+        .build();
+    TableMetadata updated = metadata.buildReplacement(
+        updatedSchema, updatedSpec, SortOrder.unsorted(), location, ImmutableMap.of());
+    PartitionSpec expected = PartitionSpec.builderFor(updated.schema()).withSpecId(1)
+        .add(3, 1000, "z_bucket", "bucket[8]")
+        .add(1, 1001, "x", "identity")
+        .build();

Review comment:
       This test is for v2 table, which does not need void transform partition field for removed column.
   This `.add(2, 1001, "y", "void")` seems not working with the error fo `Cannot find source column: 2` as no field `y` in the updated schema.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r603677197



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -842,18 +844,42 @@ private static SortOrder updateSortOrderSchema(Schema schema, SortOrder sortOrde
     return builder.build();
   }
 
-  private static PartitionSpec freshSpec(int specId, Schema schema, PartitionSpec partitionSpec) {
+  private static PartitionSpec freshSpec(int specId, Schema schema, PartitionSpec partitionSpec, int formatVersion,
+                                         List<PartitionSpec> specs, AtomicInteger lastPartitionId) {

Review comment:
       I noted above that I don't think that we should change the behavior of `freshSpec` for `updatePartitionSpec`. Instead, what about creating a `reassignPartitionIds` method that gets called after `freshSpec` in `buildReplacement`? That would be similar to how schemas are handled, where the existing schema is passed to seed the IDs for the new table schema.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r642709927



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -529,7 +530,8 @@ public TableMetadata updatePartitionSpec(PartitionSpec newPartitionSpec) {
         .addAll(specs);
     if (!specsById.containsKey(newDefaultSpecId)) {
       // get a fresh spec to ensure the spec ID is set to the new default
-      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec));
+      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec, formatVersion,
+          specs, new AtomicInteger(lastAssignedPartitionId)));

Review comment:
       You can't drop a column that is referenced by a partition spec, so I don't think that we would get in that situation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r628837253



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -593,6 +593,74 @@ public void testInvalidUpdatePartitionSpecForV1Table() throws Exception {
         () -> metadata.updatePartitionSpec(spec));
   }
 
+  @Test
+  public void testBuildReplacementForV1Table() {
+    Schema schema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "y", Types.LongType.get())
+    );
+    PartitionSpec spec = PartitionSpec.builderFor(schema).withSpecId(0)
+        .identity("x")
+        .identity("y")
+        .build();
+    String location = "file://tmp/db/table";
+    TableMetadata metadata = TableMetadata.newTableMetadata(
+        schema, spec, SortOrder.unsorted(), location, ImmutableMap.of(), 1);
+    Assert.assertEquals(spec, metadata.spec());
+
+    Schema updatedSchema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "z", Types.StringType.get())
+    );
+    PartitionSpec updatedSpec = PartitionSpec.builderFor(updatedSchema).withSpecId(0)
+        .bucket("z", 8)
+        .identity("x")
+        .build();
+    TableMetadata updated = metadata.buildReplacement(
+        updatedSchema, updatedSpec, SortOrder.unsorted(), location, ImmutableMap.of());
+    PartitionSpec expected = PartitionSpec.builderFor(updated.schema()).withSpecId(1)
+        .add(3, 1000, "z_bucket", "bucket[8]")
+        .add(1, 1001, "x", "identity")
+        .build();

Review comment:
       This test is for v2 table, which does not need void transform partition field for removed column.
   This `.add(2, 1001, "y", "void")` seems not working with the error fo `Cannot find source column: 2` as no field `y` in the updated schema.
   This also means v1 table cannot use `void` to fill the field id gap in this case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r628837253



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -593,6 +593,74 @@ public void testInvalidUpdatePartitionSpecForV1Table() throws Exception {
         () -> metadata.updatePartitionSpec(spec));
   }
 
+  @Test
+  public void testBuildReplacementForV1Table() {
+    Schema schema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "y", Types.LongType.get())
+    );
+    PartitionSpec spec = PartitionSpec.builderFor(schema).withSpecId(0)
+        .identity("x")
+        .identity("y")
+        .build();
+    String location = "file://tmp/db/table";
+    TableMetadata metadata = TableMetadata.newTableMetadata(
+        schema, spec, SortOrder.unsorted(), location, ImmutableMap.of(), 1);
+    Assert.assertEquals(spec, metadata.spec());
+
+    Schema updatedSchema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "z", Types.StringType.get())
+    );
+    PartitionSpec updatedSpec = PartitionSpec.builderFor(updatedSchema).withSpecId(0)
+        .bucket("z", 8)
+        .identity("x")
+        .build();
+    TableMetadata updated = metadata.buildReplacement(
+        updatedSchema, updatedSpec, SortOrder.unsorted(), location, ImmutableMap.of());
+    PartitionSpec expected = PartitionSpec.builderFor(updated.schema()).withSpecId(1)
+        .add(3, 1000, "z_bucket", "bucket[8]")
+        .add(1, 1001, "x", "identity")
+        .build();

Review comment:
       This test is for v2 table, which does not need void transform partition field for removed column.
   This `.add(2, 1001, "y", "void")` seems not working with the error fo `Cannot find source column: 2` as no field `y` in the updated schema.
   This somehow means v1 table cannot use `void` to fill the field id gap in this case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r603678440



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -529,7 +530,8 @@ public TableMetadata updatePartitionSpec(PartitionSpec newPartitionSpec) {
         .addAll(specs);
     if (!specsById.containsKey(newDefaultSpecId)) {
       // get a fresh spec to ensure the spec ID is set to the new default
-      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec));
+      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec, formatVersion,
+          specs, new AtomicInteger(lastAssignedPartitionId)));

Review comment:
       By the way, I think the examples above provide good test cases to include. I'd also want to include one where the deleted field is not at the end of the spec.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r642710220



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -703,6 +705,35 @@ public TableMetadata removeSnapshotLogEntries(Set<Long> snapshotIds) {
         snapshots, newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));
   }
 
+  private PartitionSpec reassignPartitionIds(PartitionSpec partitionSpec, AtomicInteger lastPartitionId) {
+    if (formatVersion > 1) {
+      Map<Pair<Integer, String>, Integer> transformToFieldId = specs.stream()
+          .flatMap(spec -> spec.fields().stream())
+          .collect(Collectors.toMap(
+              field -> Pair.of(field.sourceId(), field.transform().toString()),
+              PartitionField::fieldId,
+              (n1, n2) -> n2));
+
+      PartitionSpec.Builder specBuilder = PartitionSpec.builderFor(partitionSpec.schema())
+          .withSpecId(partitionSpec.specId());
+
+      for (PartitionField field : partitionSpec.fields()) {
+        // reassign the partition field ids
+        Integer fieldId = transformToFieldId.computeIfAbsent(
+            Pair.of(field.sourceId(), field.transform().toString()), k -> lastPartitionId.incrementAndGet());
+        specBuilder.add(
+            field.sourceId(),
+            fieldId,
+            field.name(),
+            field.transform());
+      }
+      return specBuilder.build();
+    } else {
+      // noop for v1 table
+      return partitionSpec;

Review comment:
       @jun-he, you can't drop a column if it is referenced by the partition spec. I don't think we could get into the case you mentioned because of that limitation. When this method runs, the column was referenced by the previous version of the field. So we know that the column exists. After this method runs, the void partition should prevent the column from being removed. Am I missing a case where the column can be dropped but still referenced by a void transform?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r625441337



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -593,6 +593,74 @@ public void testInvalidUpdatePartitionSpecForV1Table() throws Exception {
         () -> metadata.updatePartitionSpec(spec));
   }
 
+  @Test
+  public void testBuildReplacementForV1Table() {
+    Schema schema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "y", Types.LongType.get())
+    );
+    PartitionSpec spec = PartitionSpec.builderFor(schema).withSpecId(0)
+        .identity("x")
+        .identity("y")
+        .build();
+    String location = "file://tmp/db/table";
+    TableMetadata metadata = TableMetadata.newTableMetadata(
+        schema, spec, SortOrder.unsorted(), location, ImmutableMap.of(), 1);
+    Assert.assertEquals(spec, metadata.spec());
+
+    Schema updatedSchema = new Schema(
+        Types.NestedField.required(1, "x", Types.LongType.get()),
+        Types.NestedField.required(2, "z", Types.StringType.get())
+    );
+    PartitionSpec updatedSpec = PartitionSpec.builderFor(updatedSchema).withSpecId(0)
+        .bucket("z", 8)
+        .identity("x")
+        .build();
+    TableMetadata updated = metadata.buildReplacement(
+        updatedSchema, updatedSpec, SortOrder.unsorted(), location, ImmutableMap.of());
+    PartitionSpec expected = PartitionSpec.builderFor(updated.schema()).withSpecId(1)
+        .add(3, 1000, "z_bucket", "bucket[8]")
+        .add(1, 1001, "x", "identity")
+        .build();

Review comment:
       I think that this needs to be this:
   
   ```java
       PartitionSpec expected = PartitionSpec.builderFor(updated.schema()).withSpecId(1)
           .add(1, 1000, "x", "identity")
           .add(2, 1001, "y", "void")
           .add(3, 1002, "z_bucket", "bucket[8]")
           .build();
   ```
   
   We should be able to create that by updating the existing spec.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org