You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/29 12:52:30 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #5665: Core: Don't show dropped fields from the partition spec

Fokko opened a new pull request, #5665:
URL: https://github.com/apache/iceberg/pull/5665

   I would love to get your input on this @aokolnychyi. The current PR isn't passing because it changes the output. I noticed https://github.com/apache/iceberg/issues/5641 last week, and it looks to be a bug. We read all the existing partition specs for some paths, and for some only the current one. This leads to incorrect results (an empty query, which is quite concerning) as shown in the issue.
   
   I fired up the debugger and found the issue here:
   ![image](https://user-images.githubusercontent.com/1134248/187204891-3196c3af-539e-4dda-ac61-a27c9ea1a0e3.png)
   
   We can see the two values and just a single field in the struct. So the stats go to null, and when we read the data, Iceberg will skip the data. I also added this as a test case.
   
   The bug only shows in V2. And I think (but I'm probably missing some historical context here), that we only want to show the current partition spec (hence the change), but this would change the behavior. What do you think? Since you're the author of the code.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r960123754


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -84,10 +93,31 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     sortOrderIdPosition = positions.get("sort_order_id");
   }
 
+  private void wrapPartitionSpec(GenericRowWithSchema specRow) {
+    // We get all the partition fields, but want to project to the current one
+    StructType wrappedPartitionStruct = specRow.schema();
+
+    if (!wrappedPartitionStruct.equals(currentWrappedPartitionStruct)) {
+      this.currentWrappedPartitionStruct = wrappedPartitionStruct;
+
+      // The original IDs are lost in translation, therefore we apply the ones that we know

Review Comment:
   Or we could handle the projection before even wrapping `Row` into `SparkDataFile`. We will need to try out these options and see what makes more sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#issuecomment-1233567807

   I think it has been a while since we updated this action and there are more problems that what was currently reported.
   I'll take a look later and refactor/fix as needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko closed pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
Fokko closed pull request #5665: Core: Correctly project the partition fields
URL: https://github.com/apache/iceberg/pull/5665


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #5665: Core: Exclude old fields from the partition spec

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r958677816


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -212,7 +211,7 @@ public Set<TableCapability> capabilities() {
 
   @Override
   public MetadataColumn[] metadataColumns() {
-    DataType sparkPartitionType = SparkSchemaUtil.convert(Partitioning.partitionType(table()));
+    DataType sparkPartitionType = SparkSchemaUtil.convert(table().spec().partitionType());

Review Comment:
   Reading it a second time, and looking at the failing test, it makes sense to me:
   
   https://github.com/apache/iceberg/blob/dbb8a404f6632a55acb36e949f0e7b84b643cede/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java#L135-L166
   
   We want to be able to jump back to earlier specs, and therefore we include them in the struct of the metadata column.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5665: Core: Exclude old fields from the partition spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r957787873


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -212,7 +211,7 @@ public Set<TableCapability> capabilities() {
 
   @Override
   public MetadataColumn[] metadataColumns() {
-    DataType sparkPartitionType = SparkSchemaUtil.convert(Partitioning.partitionType(table()));
+    DataType sparkPartitionType = SparkSchemaUtil.convert(table().spec().partitionType());

Review Comment:
   Wasn't this the unioned spec type on purpose?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r960098474


##########
api/src/main/java/org/apache/iceberg/PartitionSpec.java:
##########
@@ -125,8 +125,7 @@ public List<PartitionField> getFieldsBySourceId(int fieldId) {
   public StructType partitionType() {
     List<Types.NestedField> structFields = Lists.newArrayListWithExpectedSize(fields.length);
 
-    for (int i = 0; i < fields.length; i += 1) {
-      PartitionField field = fields[i];
+    for (PartitionField field : fields) {

Review Comment:
   Is it worth modifying an unrelated API file to fix this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#issuecomment-1233448714

   I'd love to take a look later today/tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r960384482


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -84,10 +93,31 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     sortOrderIdPosition = positions.get("sort_order_id");
   }
 
+  private void wrapPartitionSpec(GenericRowWithSchema specRow) {
+    // We get all the partition fields, but want to project to the current one
+    StructType wrappedPartitionStruct = specRow.schema();
+
+    if (!wrappedPartitionStruct.equals(currentWrappedPartitionStruct)) {
+      this.currentWrappedPartitionStruct = wrappedPartitionStruct;
+
+      // The original IDs are lost in translation, therefore we apply the ones that we know

Review Comment:
   Thanks @aokolnychyi for chiming in here, much appreciated. I'm just digging into this code for the first time, so bear with me. Wouldn't it make more sense to either pass in all the historical specs (in the case of the metadata column), or just the latest spec when that's needed? It doesn't feel like a scalable approach to pass in a tuple of all the specs, and then only project the ones that we need.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#issuecomment-1233510867

   Okay, I think the problem is that we did not adapt this action when we started to output a union partition type in our metadata tables. I suppose we need to add a projection before writing files. Let me take a look at the proposed fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#issuecomment-1235113062

   Closing this in favor of https://github.com/apache/iceberg/pull/5691


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r961167787


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -84,10 +93,31 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     sortOrderIdPosition = positions.get("sort_order_id");
   }
 
+  private void wrapPartitionSpec(GenericRowWithSchema specRow) {
+    // We get all the partition fields, but want to project to the current one
+    StructType wrappedPartitionStruct = specRow.schema();
+
+    if (!wrappedPartitionStruct.equals(currentWrappedPartitionStruct)) {
+      this.currentWrappedPartitionStruct = wrappedPartitionStruct;
+
+      // The original IDs are lost in translation, therefore we apply the ones that we know

Review Comment:
   Oh, I think I was wrong here. You could do it in `SparkDataFile` because that's reused for all of the data files passed to `writeManifest`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r961177492


##########
spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteManifestsProcedure.java:
##########
@@ -193,4 +195,30 @@ public void testInvalidRewriteManifestsCases() {
         "Cannot handle an empty identifier",
         () -> sql("CALL %s.system.rewrite_manifests('')", catalogName));
   }
+
+  public void testReplacePartitionField() {

Review Comment:
   Missing `@Test`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r960097179


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -84,10 +93,31 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     sortOrderIdPosition = positions.get("sort_order_id");
   }
 
+  private void wrapPartitionSpec(GenericRowWithSchema specRow) {
+    // We get all the partition fields, but want to project to the current one
+    StructType wrappedPartitionStruct = specRow.schema();
+
+    if (!wrappedPartitionStruct.equals(currentWrappedPartitionStruct)) {
+      this.currentWrappedPartitionStruct = wrappedPartitionStruct;
+
+      // The original IDs are lost in translation, therefore we apply the ones that we know

Review Comment:
   I think it may not work in all cases (e.g. rewriting manifests for an older spec if the union type had forced renames).
   Let me think a bit.



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -84,10 +93,31 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     sortOrderIdPosition = positions.get("sort_order_id");
   }
 
+  private void wrapPartitionSpec(GenericRowWithSchema specRow) {
+    // We get all the partition fields, but want to project to the current one
+    StructType wrappedPartitionStruct = specRow.schema();
+
+    if (!wrappedPartitionStruct.equals(currentWrappedPartitionStruct)) {
+      this.currentWrappedPartitionStruct = wrappedPartitionStruct;
+
+      // The original IDs are lost in translation, therefore we apply the ones that we know

Review Comment:
   I think it may not work in all cases (e.g. rewriting manifests for an older spec if the union type had forced renames). Let me think a bit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r960111505


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -84,10 +93,31 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     sortOrderIdPosition = positions.get("sort_order_id");
   }
 
+  private void wrapPartitionSpec(GenericRowWithSchema specRow) {
+    // We get all the partition fields, but want to project to the current one
+    StructType wrappedPartitionStruct = specRow.schema();
+
+    if (!wrappedPartitionStruct.equals(currentWrappedPartitionStruct)) {
+      this.currentWrappedPartitionStruct = wrappedPartitionStruct;
+
+      // The original IDs are lost in translation, therefore we apply the ones that we know

Review Comment:
   Yeah, we need something closer to what we have in `SparkPositionDeltaWrite`.
   
   We can pass `Broadcast<Table> table` to `writeManifest`, which would allow us to construct the union partition type as well as have access to the spec we are trying to write to.
   
   I am not yet sure where the projection should be. We could either change the constructor of `SparkDataFile` and init it there.
   
   ```
   public SparkDataFile(PartitionSpec spec, Types.StructType partitionType, StructType sparkType) {
     ...
     this.partitionWrapper = new SparkStructLike(partitionType);
     this.partitionProjection = StructProjection.create(partitionType, spec.partitionType());
     ...
   }
   ```
   
   Or pass a precomputed `StructProjection partitionProjection`, which we would build in `writeManifest`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r961164228


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -84,10 +93,31 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     sortOrderIdPosition = positions.get("sort_order_id");
   }
 
+  private void wrapPartitionSpec(GenericRowWithSchema specRow) {
+    // We get all the partition fields, but want to project to the current one
+    StructType wrappedPartitionStruct = specRow.schema();
+
+    if (!wrappedPartitionStruct.equals(currentWrappedPartitionStruct)) {
+      this.currentWrappedPartitionStruct = wrappedPartitionStruct;
+
+      // The original IDs are lost in translation, therefore we apply the ones that we know

Review Comment:
   @Fokko, I think what gets passed depends on what's already broadcasted and available (which I suspect is why Anton suggested `Broadcast<Table>`). The most direct route would be to send the combined struct type and the spec (since I think that we always write new manifests with just one spec).
   
   I would probably do this in `toManifest` because I think that we want to reuse the projection across rows. Doing it inside `SparkDataFile` would create a new projection each time, which has needed to be fixed in other places. We don't want to do the struct type comparison for each row.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r960111505


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -84,10 +93,31 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     sortOrderIdPosition = positions.get("sort_order_id");
   }
 
+  private void wrapPartitionSpec(GenericRowWithSchema specRow) {
+    // We get all the partition fields, but want to project to the current one
+    StructType wrappedPartitionStruct = specRow.schema();
+
+    if (!wrappedPartitionStruct.equals(currentWrappedPartitionStruct)) {
+      this.currentWrappedPartitionStruct = wrappedPartitionStruct;
+
+      // The original IDs are lost in translation, therefore we apply the ones that we know

Review Comment:
   Yeah, we need something closer to what we have in `SparkPositionDeltaWrite`, where we essentially solve the same problem (the incoming partition tuple may have more columns that the spec we are trying to write).
   
   We can pass `Broadcast<Table> table` to `writeManifest`, which would allow us to construct the union partition type via `Partitioning$partitionType` as well as have access to the spec we are trying to write to.
   
   I am not yet sure where the projection should be. We could either change the constructor of `SparkDataFile` and init it there.
   
   ```
   public SparkDataFile(PartitionSpec spec, Types.StructType partitionType, StructType sparkType) {
     ...
     this.partitionWrapper = new SparkStructLike(partitionType);
     this.partitionProjection = StructProjection.create(partitionType, spec.partitionType());
     ...
   }
   ```
   
   Or pass a precomputed `StructProjection partitionProjection`, which we would build in `writeManifest`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r960111505


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -84,10 +93,31 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     sortOrderIdPosition = positions.get("sort_order_id");
   }
 
+  private void wrapPartitionSpec(GenericRowWithSchema specRow) {
+    // We get all the partition fields, but want to project to the current one
+    StructType wrappedPartitionStruct = specRow.schema();
+
+    if (!wrappedPartitionStruct.equals(currentWrappedPartitionStruct)) {
+      this.currentWrappedPartitionStruct = wrappedPartitionStruct;
+
+      // The original IDs are lost in translation, therefore we apply the ones that we know

Review Comment:
   Yeah, we need something closer to what we have in `SparkPositionDeltaWrite`, where we essentially solve the same problem (the incoming partition tuple may have more columns that the spec we are trying to write).
   
   We can pass `Broadcast<Table> table` to `writeManifest`, which would allow us to construct the union partition type as well as have access to the spec we are trying to write to.
   
   I am not yet sure where the projection should be. We could either change the constructor of `SparkDataFile` and init it there.
   
   ```
   public SparkDataFile(PartitionSpec spec, Types.StructType partitionType, StructType sparkType) {
     ...
     this.partitionWrapper = new SparkStructLike(partitionType);
     this.partitionProjection = StructProjection.create(partitionType, spec.partitionType());
     ...
   }
   ```
   
   Or pass a precomputed `StructProjection partitionProjection`, which we would build in `writeManifest`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5665: Core: Correctly project the partition fields

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5665:
URL: https://github.com/apache/iceberg/pull/5665#discussion_r960107948


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -84,10 +93,31 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     sortOrderIdPosition = positions.get("sort_order_id");
   }
 
+  private void wrapPartitionSpec(GenericRowWithSchema specRow) {
+    // We get all the partition fields, but want to project to the current one
+    StructType wrappedPartitionStruct = specRow.schema();
+
+    if (!wrappedPartitionStruct.equals(currentWrappedPartitionStruct)) {
+      this.currentWrappedPartitionStruct = wrappedPartitionStruct;
+
+      // The original IDs are lost in translation, therefore we apply the ones that we know

Review Comment:
   I think the problem may be that this is trying to fix `SparkDataFile`. There isn't enough information here to project, but that's okay.
   
   I think we need to update the place where `SparkDataFile.partition()` is used, in `RewriteManifestsSparkAction.writeManifest`. That uses a `SparkDataFile`, but it has the spec. We just need to pass in the combined partition struct type for `SparkDataFile` and then add the projection before writing into the new manifest file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org