You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/27 16:35:50 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request #2877: Fix nested struct pruning

RussellSpitzer opened a new pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #2877: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-887663556


   @szehon-ho + @aokolnychyi + @rdblue + @cwsteinbach - If you have some time I would be grateful if you could check this out


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r680274473



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -186,28 +186,29 @@ protected Schema tableSchema() {
   }
 
   private CloseableIterable<InternalRow> newDataIterable(DataTask task, Schema readSchema) {
-    StructInternalRow row = new StructInternalRow(tableSchema.asStruct());
+    Schema taskSchema = task.schema() == null ? tableSchema : task.schema();
+    StructInternalRow row = new StructInternalRow(taskSchema.asStruct());
     CloseableIterable<InternalRow> asSparkRows = CloseableIterable.transform(
         task.asDataTask().rows(), row::setStruct);
     return CloseableIterable.transform(
-        asSparkRows, APPLY_PROJECTION.bind(projection(readSchema, tableSchema))::invoke);
+        asSparkRows, APPLY_PROJECTION.bind(projection(readSchema, taskSchema))::invoke);

Review comment:
       Is this the line that's failing?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-888701079


   Thanks, I'll take a look at this as soon as I can


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r680274473



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -186,28 +186,29 @@ protected Schema tableSchema() {
   }
 
   private CloseableIterable<InternalRow> newDataIterable(DataTask task, Schema readSchema) {
-    StructInternalRow row = new StructInternalRow(tableSchema.asStruct());
+    Schema taskSchema = task.schema() == null ? tableSchema : task.schema();
+    StructInternalRow row = new StructInternalRow(taskSchema.asStruct());
     CloseableIterable<InternalRow> asSparkRows = CloseableIterable.transform(
         task.asDataTask().rows(), row::setStruct);
     return CloseableIterable.transform(
-        asSparkRows, APPLY_PROJECTION.bind(projection(readSchema, tableSchema))::invoke);
+        asSparkRows, APPLY_PROJECTION.bind(projection(readSchema, taskSchema))::invoke);

Review comment:
       Is this the line that's failing?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-893546408


   Awesome work, @RussellSpitzer!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r682932438



##########
File path: api/src/main/java/org/apache/iceberg/util/StructProjection.java
##########
@@ -82,8 +85,31 @@ private StructProjection(StructType structType, StructType projection) {
                   dataField.type().asStructType(), projectedField.type().asStructType());
               break;
             case MAP:
+              MapType projectedMap = projectedField.type().asMapType();
+              MapType originalMap = dataField.type().asMapType();
+
+              boolean keyProjectable = !projectedMap.keyType().isStructType() ||
+                  projectedMap.keyType().equals(originalMap.keyType());
+              boolean valueProjectable = !projectedMap.valueType().isStructType() ||

Review comment:
       Swapped to !isNestedType




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-893416471


   Closes #2783 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-887882956


   I added another test and found an issue with my fix. We create a StructInternalRow with a "type" and it's data "struct"
   When we actually prune the projected record's "type" does not match the struct layout, this is an issue for Map, Array, Struct types.
   
   For example, if we make a BaseFile with a projection we end up with a mapping of ordinal to physical entry in the row, but the StructInternal row does *not* do the same mapping when looking up types. So although my fix works for all non parameterized lookups, it fails on parameterized types since the parameterized types are looked up based on the original layout and not the pruned one.
   
   Ie
   StructInternalRow(TableLayout) stores
   ```
   Types = ( int, string, int, list)
   data = ([1,2], "hello") with Map ( 0 -> 3, 2 -> 1)
   ```
   Since the type lookup doesn't know about the projection it is incorrect ... trying to figure out a fix without breaking everything
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r682919095



##########
File path: api/src/main/java/org/apache/iceberg/util/StructProjection.java
##########
@@ -82,8 +85,31 @@ private StructProjection(StructType structType, StructType projection) {
                   dataField.type().asStructType(), projectedField.type().asStructType());
               break;
             case MAP:
+              MapType projectedMap = projectedField.type().asMapType();
+              MapType originalMap = dataField.type().asMapType();
+
+              boolean keyProjectable = !projectedMap.keyType().isStructType() ||
+                  projectedMap.keyType().equals(originalMap.keyType());
+              boolean valueProjectable = !projectedMap.valueType().isStructType() ||
+                  projectedMap.valueType().equals(originalMap.valueType());
+
+              Preconditions.checkArgument(keyProjectable && valueProjectable,
+                  "Cannot perform a projection of a map unless key and value types are primitive or a " +
+                      "struct which is fully projected. Trying to project %s out of %s", projectedField, dataField);
+              nestedProjections[pos] = null;
+              break;
             case LIST:
-              throw new IllegalArgumentException(String.format("Cannot project list or map field: %s", projectedField));
+              ListType projectedList = projectedField.type().asListType();
+              ListType originalList = dataField.type().asListType();
+
+              boolean elementProjectable = !projectedList.elementType().isStructType() ||
+                  projectedList.elementType().equals(originalList.elementType());
+
+              Preconditions.checkArgument(elementProjectable,
+                  "Cannot perform a projection of a list unless it's element is a primitive or a struct which is " +

Review comment:
       What about something shorter, like "Cannot project a partial list element struct: %s from %s"?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer edited a comment on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-887947654


   Ok so trying to fix this from the Source side, the issue here for Entries table is although it reports a schema of
   ```
   status, snapshot_id, sequence_number, data_file <Struct with 15 fields>
   ```
   
   The manifest reader is allowed to project within data file which means the actual GenericManifestFiles it creates have a schema of
   ```
   status, snapshot_id, sequence_number, data_file < pruned columns>
   ```
   
   This means the table schema as set in the read tasks is incorrect and does not match what is actually in the read data.
   
   Creating GenericManfiestFile with projection of data file column in the reader, creating structs with pruned columns and projections
   
   https://github.com/apache/iceberg/blob/83ebd4ed57254822ca26ef9b7a5ea6f528da8b34/core/src/main/java/org/apache/iceberg/ManifestEntriesTable.java#L141-L142
   
   Creating Spark StructInternalRow representation using incorrect schema (full table schema not projected schema used in GenericManfiestFile) no pruned columns or projections
   
   https://github.com/apache/iceberg/blob/c69da8a8c1c2f99de3a1b826514775f0f07bde72/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java#L189


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r681323450



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -194,20 +195,32 @@ protected Schema tableSchema() {
   }
 
   private static UnsafeProjection projection(Schema finalSchema, Schema readSchema) {
-    StructType struct = SparkSchemaUtil.convert(readSchema);
+    StructType readStruct = SparkSchemaUtil.convert(readSchema);

Review comment:
       Are these changes still needed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2877: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r677741660



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -194,20 +194,32 @@ protected Schema tableSchema() {
   }
 
   private static UnsafeProjection projection(Schema finalSchema, Schema readSchema) {
-    StructType struct = SparkSchemaUtil.convert(readSchema);
+    StructType readStruct = SparkSchemaUtil.convert(readSchema);
 
-    List<AttributeReference> refs = JavaConverters.seqAsJavaListConverter(struct.toAttributes()).asJava();
-    List<Attribute> attrs = Lists.newArrayListWithExpectedSize(struct.fields().length);
+    List<AttributeReference> readReferences = JavaConverters.seqAsJavaListConverter(readStruct.toAttributes()).asJava();
+    List<Attribute> attrs = Lists.newArrayListWithExpectedSize(readStruct.fields().length);
     List<org.apache.spark.sql.catalyst.expressions.Expression> exprs =
-        Lists.newArrayListWithExpectedSize(struct.fields().length);
+        Lists.newArrayListWithExpectedSize(readStruct.fields().length);
 
-    for (AttributeReference ref : refs) {
+    for (AttributeReference ref : readReferences) {
       attrs.add(ref.toAttribute());
     }
 
     for (Types.NestedField field : finalSchema.columns()) {
-      int indexInReadSchema = struct.fieldIndex(field.name());
-      exprs.add(refs.get(indexInReadSchema));
+      int indexInReadSchema = readStruct.fieldIndex(field.name());
+      if (field.type().isStructType()) {
+        // We may need to prune this attribute to only refer to our expected schema

Review comment:
       Yep, this basically creates the target expression from the attribute we are pruning but uses the pruned data type instead of the original read data type




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-893602102


   Yea , great job @RussellSpitzer and @rdblue , thanks again !  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r681331863



##########
File path: core/src/main/java/org/apache/iceberg/StaticDataTask.java
##########
@@ -30,18 +30,26 @@
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.util.StructProjection;
 
 class StaticDataTask implements DataTask {
 
-  static <T> DataTask of(InputFile metadata, Iterable<T> values, Function<T, Row> transform) {
+  static <T> DataTask of(InputFile metadata, Iterable<T> values, Function<T, Row> transform,
+      Schema original, Schema projected) {

Review comment:
       Nit: I'm not sure if it's just me, but I'd normally place lambda function arguments at the end of the list. Since this is internal, we can move these just after `InputFile`.
   
   Also, is original always the table schema? If so, maybe we should use `tableSchema` instead?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-893566096


   Thanks @rdblue for the discussion and review :) Onto the next one, https://github.com/apache/iceberg/pull/1744


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r680278607



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -186,28 +186,29 @@ protected Schema tableSchema() {
   }
 
   private CloseableIterable<InternalRow> newDataIterable(DataTask task, Schema readSchema) {
-    StructInternalRow row = new StructInternalRow(tableSchema.asStruct());
+    Schema taskSchema = task.schema() == null ? tableSchema : task.schema();
+    StructInternalRow row = new StructInternalRow(taskSchema.asStruct());

Review comment:
       I think I see what's going on. For the entries table, Spark will push the projection into the scan and because we are reading manifests as the data files, we actually apply that projection when reading in the data task (the `data_file` schema is passed into each `ManifestReadTask`).
   
   In theory, we should be able to use `expectedSchema` here instead of `tableSchema` to handle this because the expected schema should match the schema that gets pushed down by Spark. But in practice there are two problems:
   1. `ManifestReader` will only accept a file projection because it needs to return live entries and so it always projects all fields of `manifest_entry`
   2. Some tables use this Spark projection to avoid needing to project rows in Iceberg. For example, rows in the history table are never projected because we didn't want to implement a projection in Iceberg when it was built.
   
   I see how this is a reasonable work-around, but I think we should fix some of the debt instead of moving ahead with it. We should make sure that tasks produce the `expectedSchema` instead of trying to figure out what schema the task produces.
   
   I would solve this by using `StructProjection` to project rows in tables like the history table that return full rows. And I would also use it to prune out the additional top-level fields of `manifest_entry`. I think if you do that, then there will be no need to add a task-specific schema. *And*, we should be able to remove the Spark projection here, which exists because of the history table setup. Now that we have an Iceberg projection there is no need to continue doing that.
   
   Does that make sense?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer merged pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer merged pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r682932438



##########
File path: api/src/main/java/org/apache/iceberg/util/StructProjection.java
##########
@@ -82,8 +85,31 @@ private StructProjection(StructType structType, StructType projection) {
                   dataField.type().asStructType(), projectedField.type().asStructType());
               break;
             case MAP:
+              MapType projectedMap = projectedField.type().asMapType();
+              MapType originalMap = dataField.type().asMapType();
+
+              boolean keyProjectable = !projectedMap.keyType().isStructType() ||
+                  projectedMap.keyType().equals(originalMap.keyType());
+              boolean valueProjectable = !projectedMap.valueType().isStructType() ||

Review comment:
       Swapped to !isNestedType




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a change in pull request #2877: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r677733110



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java
##########
@@ -144,6 +144,30 @@ public void testEntriesTable() throws Exception {
   }
 
   @Test
+  public void testEntriesTableDataFilePrune() throws Exception {
+    TableIdentifier tableIdentifier = TableIdentifier.of("db", "entries_test");
+    Table table = createTable(tableIdentifier, SCHEMA, PartitionSpec.unpartitioned());
+
+    List<SimpleRecord> records = Lists.newArrayList(new SimpleRecord(1, "1"));
+
+    Dataset<Row> inputDf = spark.createDataFrame(records, SimpleRecord.class);
+    inputDf.select("id", "data").write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+
+    List<Row> actual = spark.read()
+        .format("iceberg")
+        .load(loadLocation(tableIdentifier, "entries"))
+        .select("data_file.file_path")
+        .collectAsList();
+
+    Assert.assertEquals("Should have a single entry", 1,  actual.size());
+    Assert.assertEquals("Should only have file_path", 0, actual.get(0).fieldIndex("file_path"));

Review comment:
       Nit/Opt: Not sure if this message matches the assert logic.
   
   Maybe :  
   assertEquals("Should select one field", actual.get(0).schema().fieldNames().length, 1)
   assertEquals("Selected field should be file_path", actual.get(0).schema().fieldNames()[0], "file_path")
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r681323979



##########
File path: api/src/main/java/org/apache/iceberg/util/StructProjection.java
##########
@@ -83,7 +83,7 @@ private StructProjection(StructType structType, StructType projection) {
               break;
             case MAP:
             case LIST:
-              throw new IllegalArgumentException(String.format("Cannot project list or map field: %s", projectedField));
+              // TODO Figure this out

Review comment:
       What about allowing the projection if the the fields are primitives or if the entire struct is projected? That would cover the cases that are currently supported and avoid introducing a new pruning bug to replace the one you're fixing (where nested structs don't match the requested struct schema).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r680282436



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -186,28 +186,29 @@ protected Schema tableSchema() {
   }
 
   private CloseableIterable<InternalRow> newDataIterable(DataTask task, Schema readSchema) {
-    StructInternalRow row = new StructInternalRow(tableSchema.asStruct());
+    Schema taskSchema = task.schema() == null ? tableSchema : task.schema();
+    StructInternalRow row = new StructInternalRow(taskSchema.asStruct());

Review comment:
       I think that's a good solution as well, I was worried about redoing the whole setup for every data task here but that makes sense to me. Basically we will only be projecting within our original row construction and not doing the projection in spark.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2877: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r677621959



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -194,20 +195,32 @@ protected Schema tableSchema() {
   }
 
   private static UnsafeProjection projection(Schema finalSchema, Schema readSchema) {
-    StructType struct = SparkSchemaUtil.convert(readSchema);
+    StructType readStruct = SparkSchemaUtil.convert(readSchema);

Review comment:
       Renamed these variables because too many things were called "struct, or ref" and I was getting confused which was which




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-887973321


   @rdblue Attempted to fix this from the other direction, I don't like this because we have to muck about with the DataTask api. But our underlying issue is a DataTask may read an element with a struct layout different than the table it originated from based on
   scan time projection. In this example the entry table when scanned produces tasks with a different schema (because of datafile pruning) than it originally reports as a table itself.
   
   If you have another approach I'm all ears, but I think we will run into this again if we want to allow other metadata tables like the
   pure files table to prune in a similar way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-893595776


   Awesome work here! It's great to get rid of that ugly reflection call to create a Spark projection!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r682918017



##########
File path: api/src/main/java/org/apache/iceberg/util/StructProjection.java
##########
@@ -82,8 +85,31 @@ private StructProjection(StructType structType, StructType projection) {
                   dataField.type().asStructType(), projectedField.type().asStructType());
               break;
             case MAP:
+              MapType projectedMap = projectedField.type().asMapType();
+              MapType originalMap = dataField.type().asMapType();
+
+              boolean keyProjectable = !projectedMap.keyType().isStructType() ||
+                  projectedMap.keyType().equals(originalMap.keyType());
+              boolean valueProjectable = !projectedMap.valueType().isStructType() ||

Review comment:
       It looks like this will support things like `map<string, map<string, int>>`. I don't think that will be a problem.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a change in pull request #2877: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r677733110



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java
##########
@@ -144,6 +144,30 @@ public void testEntriesTable() throws Exception {
   }
 
   @Test
+  public void testEntriesTableDataFilePrune() throws Exception {
+    TableIdentifier tableIdentifier = TableIdentifier.of("db", "entries_test");
+    Table table = createTable(tableIdentifier, SCHEMA, PartitionSpec.unpartitioned());
+
+    List<SimpleRecord> records = Lists.newArrayList(new SimpleRecord(1, "1"));
+
+    Dataset<Row> inputDf = spark.createDataFrame(records, SimpleRecord.class);
+    inputDf.select("id", "data").write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+
+    List<Row> actual = spark.read()
+        .format("iceberg")
+        .load(loadLocation(tableIdentifier, "entries"))
+        .select("data_file.file_path")
+        .collectAsList();
+
+    Assert.assertEquals("Should have a single entry", 1,  actual.size());
+    Assert.assertEquals("Should only have file_path", 0, actual.get(0).fieldIndex("file_path"));

Review comment:
       Opt: Not sure if this message matches the assert logic.
   
   Maybe :  
   assertEquals("Should select one field", actual.get(0).schema().fieldNames().length, 1)
   assertEquals("Selected field should be file_path", actual.get(0).schema().fieldNames()[0], "file_path")
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2877: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-887776855


   @RussellSpitzer, can you update the description with a high-level summary of what you're changing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r677811751



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java
##########
@@ -144,6 +144,30 @@ public void testEntriesTable() throws Exception {
   }
 
   @Test
+  public void testEntriesTableDataFilePrune() throws Exception {
+    TableIdentifier tableIdentifier = TableIdentifier.of("db", "entries_test");
+    Table table = createTable(tableIdentifier, SCHEMA, PartitionSpec.unpartitioned());
+
+    List<SimpleRecord> records = Lists.newArrayList(new SimpleRecord(1, "1"));
+
+    Dataset<Row> inputDf = spark.createDataFrame(records, SimpleRecord.class);
+    inputDf.select("id", "data").write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+
+    List<Row> actual = spark.read()
+        .format("iceberg")
+        .load(loadLocation(tableIdentifier, "entries"))
+        .select("data_file.file_path")
+        .collectAsList();
+
+    Assert.assertEquals("Should have a single entry", 1,  actual.size());
+    Assert.assertEquals("Should only have file_path", 0, actual.get(0).fieldIndex("file_path"));

Review comment:
       SGTM, I really just wanted a check to make sure this didn't crash. Maybe I should add some additional pruning tests?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r682918017



##########
File path: api/src/main/java/org/apache/iceberg/util/StructProjection.java
##########
@@ -82,8 +85,31 @@ private StructProjection(StructType structType, StructType projection) {
                   dataField.type().asStructType(), projectedField.type().asStructType());
               break;
             case MAP:
+              MapType projectedMap = projectedField.type().asMapType();
+              MapType originalMap = dataField.type().asMapType();
+
+              boolean keyProjectable = !projectedMap.keyType().isStructType() ||
+                  projectedMap.keyType().equals(originalMap.keyType());
+              boolean valueProjectable = !projectedMap.valueType().isStructType() ||

Review comment:
       It looks like this will support things like `map<string, map<string, int>>`. I don't think that will be a problem.

##########
File path: api/src/main/java/org/apache/iceberg/util/StructProjection.java
##########
@@ -82,8 +85,31 @@ private StructProjection(StructType structType, StructType projection) {
                   dataField.type().asStructType(), projectedField.type().asStructType());
               break;
             case MAP:
+              MapType projectedMap = projectedField.type().asMapType();
+              MapType originalMap = dataField.type().asMapType();
+
+              boolean keyProjectable = !projectedMap.keyType().isStructType() ||
+                  projectedMap.keyType().equals(originalMap.keyType());
+              boolean valueProjectable = !projectedMap.valueType().isStructType() ||
+                  projectedMap.valueType().equals(originalMap.valueType());
+
+              Preconditions.checkArgument(keyProjectable && valueProjectable,
+                  "Cannot perform a projection of a map unless key and value types are primitive or a " +
+                      "struct which is fully projected. Trying to project %s out of %s", projectedField, dataField);
+              nestedProjections[pos] = null;
+              break;
             case LIST:
-              throw new IllegalArgumentException(String.format("Cannot project list or map field: %s", projectedField));
+              ListType projectedList = projectedField.type().asListType();
+              ListType originalList = dataField.type().asListType();
+
+              boolean elementProjectable = !projectedList.elementType().isStructType() ||
+                  projectedList.elementType().equals(originalList.elementType());
+
+              Preconditions.checkArgument(elementProjectable,
+                  "Cannot perform a projection of a list unless it's element is a primitive or a struct which is " +

Review comment:
       What about something shorter, like "Cannot project a partial list element struct: %s from %s"?

##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java
##########
@@ -724,6 +875,64 @@ public void testManifestsTable() {
     TestHelpers.assertEqualsSafe(manifestTable.schema().asStruct(), expected.get(0), actual.get(0));
   }
 
+  @Test
+  public void testPruneManifestsTable() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("db", "manifests_test");
+    Table table = createTable(tableIdentifier, SCHEMA, PartitionSpec.builderFor(SCHEMA).identity("id").build());
+    Table manifestTable = loadTable(tableIdentifier, "manifests");
+    Dataset<Row> df1 = spark.createDataFrame(
+        Lists.newArrayList(new SimpleRecord(1, "a"), new SimpleRecord(null, "b")), SimpleRecord.class);
+
+    df1.select("id", "data").write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    AssertHelpers.assertThrows("Can't prune struct inside list", SparkException.class,
+        "Cannot perform a projection of a list",
+        () -> spark.read()
+            .format("iceberg")
+            .load(loadLocation(tableIdentifier, "manifests"))
+            .select("partition_spec_id", "path", "partition_summaries.contains_null")
+            .collectAsList());
+
+    Dataset<Row> actualDf = spark.read()
+        .format("iceberg")
+        .load(loadLocation(tableIdentifier, "manifests"))
+        .select("partition_spec_id", "path", "partition_summaries");
+
+    Schema projectedSchema = SparkSchemaUtil.convert(actualDf.schema());
+
+    List<Row> actual = spark.read()
+        .format("iceberg")
+        .load(loadLocation(tableIdentifier, "manifests"))
+        .select("partition_spec_id", "path", "partition_summaries")
+        .collectAsList();
+
+    table.refresh();
+
+    GenericRecordBuilder builder = new GenericRecordBuilder(AvroSchemaUtil.convert(projectedSchema.asStruct()));
+    GenericRecordBuilder summaryBuilder = new GenericRecordBuilder(AvroSchemaUtil.convert(
+        projectedSchema.findType("partition_summaries.element").asStructType(), "partition_summary"));
+    List<GenericData.Record> expected = Lists.transform(table.currentSnapshot().allManifests(), manifest ->
+        builder.set("partition_spec_id", manifest.partitionSpecId())
+            .set("path", manifest.path())
+            .set("partition_summaries", Lists.transform(manifest.partitions(), partition ->
+                summaryBuilder
+                    .set("contains_null", true)
+                    .set("contains_nan", false)
+                    .set("lower_bound", "1")
+                    .set("upper_bound", "1")
+                    .build()
+            ))
+            .build()
+    );
+
+    Assert.assertEquals("Manifests table should have one manifest row", 1, actual.size());
+    TestHelpers.assertEqualsSafe(projectedSchema.asStruct(), expected.get(0), actual.get(0));
+  }
+
+

Review comment:
       Nit: unnecessary newline.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-887947654


   Ok so trying to fix this from the Source side, the issue here for Entries table is although it reports a schema of
   ```
   status, snapshot_id, sequence_number, data_file <Struct with 15 fields>
   ```
   
   The manifest reader is allowed to project within data file which means the actual GenericManifestFiles it creates have a schema of
   ```
   status, snapshot_id, sequence_number, data_file < pruned columns>
   ```
   
   This means the table schema as set in the read tasks is incorrect and does not match what is actually in the read data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer edited a comment on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-887882956


   I added another test and found an issue with my fix. We create a StructInternalRow with a "type" and it's data "struct"
   When we actually prune the projected record's "type" does not match the struct layout, this is an issue for Map, Array, Struct types.
   
   For example, if we make a BaseFile with a projection we end up with a mapping of ordinal to physical entry in the row, but the StructInternal row does *not* do the same mapping when looking up types. So although my fix works for all non parameterized lookups, it fails on parameterized types since the parameterized types are looked up based on the original layout and not the pruned one.
   
   Ie
   StructInternalRow(TableLayout) stores
   ```
   Types = ( int, string, int, list)
   data = ([1,2], "hello") with Map ( 0 -> 3, 2 -> 1)
   ```
   Since the type lookup doesn't know about the projection it is incorrect ... trying to figure out a fix without breaking everything
   
   This isn't an issue for setters or getters which know their type, since they never touch the "Types" struct.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer edited a comment on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-887947654


   Ok so trying to fix this from the Source side, the issue here for Entries table is although it reports a schema of
   ```
   status, snapshot_id, sequence_number, data_file <Struct with 15 fields>
   ```
   
   The manifest reader is allowed to project within data file which means the actual GenericManifestFiles it creates have a schema of
   ```
   status, snapshot_id, sequence_number, data_file < pruned columns>
   ```
   
   This means the table schema as set in the read tasks is incorrect and does not match what is actually in the read data.
   
   Creating GenericManfiestFile with projection of data file column
   
   https://github.com/apache/iceberg/blob/83ebd4ed57254822ca26ef9b7a5ea6f528da8b34/core/src/main/java/org/apache/iceberg/ManifestEntriesTable.java#L141-L142
   
   Creating Spark StructInternalRow representation using incorrect schema (full table schema not projected schema used in GenericManfiestFile)
   
   https://github.com/apache/iceberg/blob/c69da8a8c1c2f99de3a1b826514775f0f07bde72/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java#L189


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-887845285


   > @RussellSpitzer, can you update the description with a high-level summary of what you're changing?
   
   Added! Sorry I have been posting notes in a lot of different places. The other option for the solution is to figure out our read schema correctly the first time based on what is being pruned rather than just using the table schema but I think this is a bit simpler.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a change in pull request #2877: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r677733110



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java
##########
@@ -144,6 +144,30 @@ public void testEntriesTable() throws Exception {
   }
 
   @Test
+  public void testEntriesTableDataFilePrune() throws Exception {
+    TableIdentifier tableIdentifier = TableIdentifier.of("db", "entries_test");
+    Table table = createTable(tableIdentifier, SCHEMA, PartitionSpec.unpartitioned());
+
+    List<SimpleRecord> records = Lists.newArrayList(new SimpleRecord(1, "1"));
+
+    Dataset<Row> inputDf = spark.createDataFrame(records, SimpleRecord.class);
+    inputDf.select("id", "data").write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    table.refresh();
+
+    List<Row> actual = spark.read()
+        .format("iceberg")
+        .load(loadLocation(tableIdentifier, "entries"))
+        .select("data_file.file_path")
+        .collectAsList();
+
+    Assert.assertEquals("Should have a single entry", 1,  actual.size());
+    Assert.assertEquals("Should only have file_path", 0, actual.get(0).fieldIndex("file_path"));

Review comment:
       Opt: Not sure if this message matches the assert logic.
   Nit/Opt:
   Maybe :  
   assertEquals("Should select one field", actual.get(0).schema().fieldNames().length, 1)
   assertEquals("Selected field should be file_path", actual.get(0).schema().fieldNames()[0], "file_path")
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a change in pull request #2877: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r677735298



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -194,20 +194,32 @@ protected Schema tableSchema() {
   }
 
   private static UnsafeProjection projection(Schema finalSchema, Schema readSchema) {
-    StructType struct = SparkSchemaUtil.convert(readSchema);
+    StructType readStruct = SparkSchemaUtil.convert(readSchema);
 
-    List<AttributeReference> refs = JavaConverters.seqAsJavaListConverter(struct.toAttributes()).asJava();
-    List<Attribute> attrs = Lists.newArrayListWithExpectedSize(struct.fields().length);
+    List<AttributeReference> readReferences = JavaConverters.seqAsJavaListConverter(readStruct.toAttributes()).asJava();
+    List<Attribute> attrs = Lists.newArrayListWithExpectedSize(readStruct.fields().length);
     List<org.apache.spark.sql.catalyst.expressions.Expression> exprs =
-        Lists.newArrayListWithExpectedSize(struct.fields().length);
+        Lists.newArrayListWithExpectedSize(readStruct.fields().length);
 
-    for (AttributeReference ref : refs) {
+    for (AttributeReference ref : readReferences) {
       attrs.add(ref.toAttribute());
     }
 
     for (Types.NestedField field : finalSchema.columns()) {
-      int indexInReadSchema = struct.fieldIndex(field.name());
-      exprs.add(refs.get(indexInReadSchema));
+      int indexInReadSchema = readStruct.fieldIndex(field.name());
+      if (field.type().isStructType()) {
+        // We may need to prune this attribute to only refer to our expected schema

Review comment:
       Curious question: probably don't have this in our case, but if we have another layer of nesting does it capture it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r682922220



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java
##########
@@ -724,6 +875,64 @@ public void testManifestsTable() {
     TestHelpers.assertEqualsSafe(manifestTable.schema().asStruct(), expected.get(0), actual.get(0));
   }
 
+  @Test
+  public void testPruneManifestsTable() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("db", "manifests_test");
+    Table table = createTable(tableIdentifier, SCHEMA, PartitionSpec.builderFor(SCHEMA).identity("id").build());
+    Table manifestTable = loadTable(tableIdentifier, "manifests");
+    Dataset<Row> df1 = spark.createDataFrame(
+        Lists.newArrayList(new SimpleRecord(1, "a"), new SimpleRecord(null, "b")), SimpleRecord.class);
+
+    df1.select("id", "data").write()
+        .format("iceberg")
+        .mode("append")
+        .save(loadLocation(tableIdentifier));
+
+    AssertHelpers.assertThrows("Can't prune struct inside list", SparkException.class,
+        "Cannot perform a projection of a list",
+        () -> spark.read()
+            .format("iceberg")
+            .load(loadLocation(tableIdentifier, "manifests"))
+            .select("partition_spec_id", "path", "partition_summaries.contains_null")
+            .collectAsList());
+
+    Dataset<Row> actualDf = spark.read()
+        .format("iceberg")
+        .load(loadLocation(tableIdentifier, "manifests"))
+        .select("partition_spec_id", "path", "partition_summaries");
+
+    Schema projectedSchema = SparkSchemaUtil.convert(actualDf.schema());
+
+    List<Row> actual = spark.read()
+        .format("iceberg")
+        .load(loadLocation(tableIdentifier, "manifests"))
+        .select("partition_spec_id", "path", "partition_summaries")
+        .collectAsList();
+
+    table.refresh();
+
+    GenericRecordBuilder builder = new GenericRecordBuilder(AvroSchemaUtil.convert(projectedSchema.asStruct()));
+    GenericRecordBuilder summaryBuilder = new GenericRecordBuilder(AvroSchemaUtil.convert(
+        projectedSchema.findType("partition_summaries.element").asStructType(), "partition_summary"));
+    List<GenericData.Record> expected = Lists.transform(table.currentSnapshot().allManifests(), manifest ->
+        builder.set("partition_spec_id", manifest.partitionSpecId())
+            .set("path", manifest.path())
+            .set("partition_summaries", Lists.transform(manifest.partitions(), partition ->
+                summaryBuilder
+                    .set("contains_null", true)
+                    .set("contains_nan", false)
+                    .set("lower_bound", "1")
+                    .set("upper_bound", "1")
+                    .build()
+            ))
+            .build()
+    );
+
+    Assert.assertEquals("Manifests table should have one manifest row", 1, actual.size());
+    TestHelpers.assertEqualsSafe(projectedSchema.asStruct(), expected.get(0), actual.get(0));
+  }
+
+

Review comment:
       Nit: unnecessary newline.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #2877: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#issuecomment-887662619


   Solves #2783 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2877: Spark: Fix nested struct pruning

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2877:
URL: https://github.com/apache/iceberg/pull/2877#discussion_r681323342



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
##########
@@ -186,28 +186,27 @@ protected Schema tableSchema() {
   }
 
   private CloseableIterable<InternalRow> newDataIterable(DataTask task, Schema readSchema) {
-    StructInternalRow row = new StructInternalRow(tableSchema.asStruct());
+    StructInternalRow row = new StructInternalRow(readSchema.asStruct());
     CloseableIterable<InternalRow> asSparkRows = CloseableIterable.transform(
         task.asDataTask().rows(), row::setStruct);
-    return CloseableIterable.transform(
-        asSparkRows, APPLY_PROJECTION.bind(projection(readSchema, tableSchema))::invoke);
+    return asSparkRows;

Review comment:
       I think you can also remove `APPLY_PROJECTION`. This is the only place it is used.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org