You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/29 23:24:00 UTC
[GitHub] [iceberg] rdblue commented on a change in pull request #2284: Core: reassign the partition field IDs and reuse any existing IDs

rdblue commented on a change in pull request #2284:
URL: https://github.com/apache/iceberg/pull/2284#discussion_r603676375



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -529,7 +530,8 @@ public TableMetadata updatePartitionSpec(PartitionSpec newPartitionSpec) {
         .addAll(specs);
     if (!specsById.containsKey(newDefaultSpecId)) {
       // get a fresh spec to ensure the spec ID is set to the new default
-      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec));
+      builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec, formatVersion,
+          specs, new AtomicInteger(lastAssignedPartitionId)));

Review comment:
       The problem from [my comment](https://github.com/apache/iceberg/pull/2089#discussion_r567154516) that needs to be fixed does not affect this method. The logic in `BaseUpdatePartitionSpec` already covers spec evolution. The problem I was referring to is that when a table replacement is created, all of the metadata is new and can possibly conflict with the existing table's metadata.
   
   For example, if I have a table partitioned by `1000: ts_day=day(ts)` and I replace it with one partitioned by `1000: id_bucket=bucket(16, id)` then the two partition IDs collide. The `buildReplacement` method needs to handle reassigning IDs for partition specs just like it does for schemas.
   
   The specs that are passed into `updatePartitionSpec` are already based on the current spec and so the IDs are assigned correctly and are consistent with all previous table specs. There is no "bug" to fix in this method. However, there is an improvement that we can make that is similar. Currently, if you add a partition field that was previously part of the table but is not any longer, it is assigned a completely new ID.
   
   For example, if I have a table partitioned by `1000: ts_day=day(ts)` and then update it by dropping the `ts_day` partition, it would have one spec with `ts_day` and one that is unpartitioned. Then if I go back and add `day(ts)` again, it would create `1001: ts_day=day(ts)`. That is perfectly fine and consistent, but it would be better to go back and reuse ID 1000 for the new field.
   
   We can detect that fields should be un-deleted rather than added by going back to through old partition specs for the table. That will recover the old ID, but there is an additional issue to solve: when un-deleting a partition field in v1, we need to make sure that the un-deleted field replaces the `void` field that was added when the field was deleted. For v2, it would also be nice to use the original partition field order but that's not required.
   
   I think that this improvement is a good thing to do, but should not be part of this change because this PR is needed to fix the blocker bug with `buildReplacement`. Also, I think that the improvement for spec updates should be built into `BaseUpdatePartitionSpec` rather than done in table metadata.
   
   Could you separate this into another PR, @jun-he?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org