You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/05 07:17:56 UTC

[GitHub] [iceberg] szehon-ho opened a new issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

szehon-ho opened a new issue #2783:
URL: https://github.com/apache/iceberg/issues/2783


   As of Spark 3, it seems more predicates are pushed down.  For example, take this join:
   
   ```
   Dataset<Row> stringDs = spark.createDataset(Arrays.asList("my_path"), Encoders.STRING())
           .toDF("file_path");
   
       SparkCatalog catalog = (SparkCatalog) spark.sessionState().catalogManager().catalog(catalogName);
       String[] tableIdentifiers = tableName.split("\\.");
       Identifier metaId = Identifier.of(
           new String[]{tableIdentifiers[1], tableIdentifiers[2]}, "entries");
       SparkTable metaTable = catalog.loadTable(metaId);
       Dataset<Row> entriesDs = Dataset.ofRows(spark, DataSourceV2Relation.create(metaTable, Some.apply(catalog), Some.apply(
           metaId)));
   
       Column joinCond = entriesDs.col("data_file.file_path").equalTo(stringDs.col("file_path"));
       Dataset<Row> res = entriesDs.join(stringDs, joinCond);
       boolean empty = res.isEmpty();
       Assert.assertEquals(true, empty);
   ```
   
   It will result in the following NPE: 
   ```Driver stacktrace:
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5) (10.0.0.45 executor driver): java.lang.IllegalStateException: Unknown type for int field. Type name: java.lang.String
   	at org.apache.iceberg.spark.source.StructInternalRow.getInt(StructInternalRow.java:131)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown Source)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.apache.iceberg.common.DynMethods$UnboundMethod.invokeChecked(DynMethods.java:65)
   	at org.apache.iceberg.common.DynMethods$UnboundMethod.invoke(DynMethods.java:77)
   	at org.apache.iceberg.common.DynMethods$BoundMethod.invoke(DynMethods.java:180)
   	at org.apache.iceberg.spark.source.RowDataReader.lambda$newDataIterable$3(RowDataReader.java:193)
   	at org.apache.iceberg.io.CloseableIterable$4$1.next(CloseableIterable.java:113)
   	at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:66)
   	at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:50)
   	at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:87)
   	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
   	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
   	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
   	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer closed issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

RussellSpitzer closed issue #2783:
URL: https://github.com/apache/iceberg/issues/2783


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873875087






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873869272


   Adding a reproduction test (can run in spark3-extensions test case, for example):
   ```
   package org.apache.iceberg.spark.extensions;
   
   import com.google.common.collect.Lists;
   import org.apache.iceberg.spark.SparkCatalog;
   import org.apache.iceberg.spark.source.SimpleRecord;
   import org.apache.iceberg.spark.source.SparkTable;
   import org.apache.spark.sql.Column;
   import org.apache.spark.sql.Dataset;
   import org.apache.spark.sql.Encoders;
   import org.apache.spark.sql.Row;
   import org.apache.spark.sql.connector.catalog.Identifier;
   import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation;
   import org.junit.After;
   import org.junit.Assert;
   import org.junit.Test;
   import scala.Some;
   
   import java.util.Arrays;
   import java.util.List;
   import java.util.Map;
   
   public class TestSparkMetadataTable extends SparkExtensionsTestBase {
   
     public TestSparkMetadataTable(String catalogName, String implementation, Map<String, String> config) {
       super(catalogName, implementation, config);
     }
   
     @Test
     public void testCountEntriesPartitionedTable() throws Exception {
       // init load
       List<SimpleRecord> records = Lists.newArrayList(new SimpleRecord(1, "1"));
       Dataset<Row> inputDf = spark.createDataFrame(records, SimpleRecord.class);
       inputDf.writeTo(tableName).create();
   
   
       Dataset<Row> stringDs = spark.createDataset(Arrays.asList("my_path"), Encoders.STRING())
           .toDF("file_path");
   
       SparkCatalog catalog = (SparkCatalog) spark.sessionState().catalogManager().catalog(catalogName);
       String[] tableIdentifiers = tableName.split("\\.");
       Identifier metaId = Identifier.of(
           new String[]{tableIdentifiers[1], tableIdentifiers[2]}, "entries");
       SparkTable metaTable = catalog.loadTable(metaId);
       Dataset<Row> entriesDs = Dataset.ofRows(spark, DataSourceV2Relation.create(metaTable, Some.apply(catalog), Some.apply(
           metaId)));
   
       Column joinCond = entriesDs.col("data_file.file_path").equalTo(stringDs.col("file_path"));
       Dataset<Row> res = entriesDs.join(stringDs, joinCond);
       boolean empty = res.isEmpty();
       Assert.assertEquals(true, empty);
     }
   
     @After
     public void dropTables() {
       sql("DROP TABLE IF EXISTS %s", tableName);
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873885380


   @rdblue @RussellSpitzer  fyi, if you have any ideas about this.
   
   My first thought is in this method:  https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java#L78. , if the struct's schema does not match the type, then set a flag on this struct to skip projection.
   
   We would need to add the method (StructLike.useProjection(boolean)) to the interface, and have BaseFile implement it, so maybe not the best approach.  I don't have the most context on this area, so maybe the idea is not good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873875087


   Preliminary analysis:  The BaseFile class uses some pruning logic, keeping a mapping called 'fromProjectionPos' that maps indexes in a projection (a subset of its columns) to real indexes in its column list.
   
   Ie, if the projected array has three elements which are fields 1,2, and 10 of the original schema respectively, fromProjectionPos becomes {1, 2, 10}.
   
   This is usually fine but the Spark generated code in Spark 3 uses the original index to get all the fields (It's entry point is StructInternalRow: https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java#L124. (here 'struct' is the BaseDataFile with the pruning mapping).  
   
   It calls getInt(0) expecting an int as per the real column at index 0, but due to the mapping we return value at index 1 which is actually a string and hence the error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873885380


   @rdblue @RussellSpitzer  fyi, if you have any ideas about this.
   
   My first thought is in this method:  https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java#L78. , if the struct's schema does not match the type, then set a flag on this struct to skip projection.
   
   We would need to add the method (StructLike.useProjection(boolean)) to the interface, and have BaseFile implement it, so maybe not the best approach.  I don't have the most context on this area, so maybe the idea is not good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873875087


   Preliminary analysis:  The BaseFile class uses some pruning logic, keeping a mapping called 'fromProjectionPos' that maps indexes in a projection (a subset of its columns) to real indexes in its column list.
   
   Ie, if the projected array has three elements which are fields 1,2, and 10 of the original schema respectively, fromProjectionPos becomes {1, 2, 10}.
   
   This is usually fine but the Spark generated code in Spark 3 uses the original index to get all the fields (It's entry point is StructInternalRow: https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java#L124. (here 'struct' is the BaseDataFile with the pruning mapping, BUT spark initializes 'type' as the original (unpruned) list of fields).
   
   The calll getInt(0) expects an int as per the original schema at index 0, but due to the mapping we return value at index 1 which is actually a string and hence the error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873875087


   Preliminary analysis:  The BaseFile class uses some pruning logic, keeping a mapping called 'fromProjectionPos' that maps indexes in a projection (a subset of its columns) to real indexes in its column list.
   
   Ie, if the projected array has three elements which are fields 1,2, and 10 of the original schema respectively, fromProjectionPos becomes {1, 2, 10}.
   
   This is usually fine but the Spark generated code in Spark 3 uses the original index to get all the fields (It's entry point is StructInternalRow: https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java#L124. (here struct is the BaseDataFile).  It calls getInt(0) expecting an int as per the real column at index 0, but due to the mapping we return value at index 1 which is actually a string and hence the error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873869272


   Adding a reproduction test (can run in spark3-extensions test case, for example):
   ```
   package org.apache.iceberg.spark.extensions;
   
   import com.google.common.collect.Lists;
   import org.apache.iceberg.spark.SparkCatalog;
   import org.apache.iceberg.spark.source.SimpleRecord;
   import org.apache.iceberg.spark.source.SparkTable;
   import org.apache.spark.sql.Column;
   import org.apache.spark.sql.Dataset;
   import org.apache.spark.sql.Encoders;
   import org.apache.spark.sql.Row;
   import org.apache.spark.sql.connector.catalog.Identifier;
   import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation;
   import org.junit.After;
   import org.junit.Assert;
   import org.junit.Test;
   import scala.Some;
   
   import java.util.Arrays;
   import java.util.List;
   import java.util.Map;
   
   public class TestSparkMetadataTable extends SparkExtensionsTestBase {
   
     public TestSparkMetadataTable(String catalogName, String implementation, Map<String, String> config) {
       super(catalogName, implementation, config);
     }
   
     @Test
     public void testDataFileProjectionError1() throws Exception {
       // init load
       List<SimpleRecord> records = Lists.newArrayList(new SimpleRecord(1, "1"));
       Dataset<Row> inputDf = spark.createDataFrame(records, SimpleRecord.class);
       inputDf.writeTo(tableName).create();
   
   
       SparkCatalog catalog = (SparkCatalog) spark.sessionState().catalogManager().catalog(catalogName);
       String[] tableIdentifiers = tableName.split("\\.");
       Identifier metaId = Identifier.of(
           new String[]{tableIdentifiers[1], tableIdentifiers[2]}, "entries");
       SparkTable metaTable = catalog.loadTable(metaId);
       Dataset<Row> entriesDs = Dataset.ofRows(spark, DataSourceV2Relation.create(metaTable, Some.apply(catalog), Some.apply(
           metaId)));
       Column aggCol = entriesDs.col("data_file.record_count");
       Dataset<Row> agg = entriesDs.agg(max(aggCol));
       Assert.assertFalse(agg.collectAsList().isEmpty());
     }
   
     @Test
     public void testDataFileProjectionError2() throws Exception {
       // init load
       List<SimpleRecord> records = Lists.newArrayList(new SimpleRecord(1, "1"));
       Dataset<Row> inputDf = spark.createDataFrame(records, SimpleRecord.class);
       inputDf.writeTo(tableName).create();
   
   
       Dataset<Row> stringDs = spark.createDataset(Arrays.asList("my_path"), Encoders.STRING())
           .toDF("file_path");
   
       SparkCatalog catalog = (SparkCatalog) spark.sessionState().catalogManager().catalog(catalogName);
       String[] tableIdentifiers = tableName.split("\\.");
       Identifier metaId = Identifier.of(
           new String[]{tableIdentifiers[1], tableIdentifiers[2]}, "entries");
       SparkTable metaTable = catalog.loadTable(metaId);
       Dataset<Row> entriesDs = Dataset.ofRows(spark, DataSourceV2Relation.create(metaTable, Some.apply(catalog), Some.apply(
           metaId)));
   
       Column joinCond = entriesDs.col("data_file.file_path").equalTo(stringDs.col("file_path"));
       Dataset<Row> res = entriesDs.join(stringDs, joinCond);
       boolean empty = res.isEmpty();
       Assert.assertEquals(true, empty);
     }
   
     @After
     public void dropTables() {
       sql("DROP TABLE IF EXISTS %s", tableName);
     }
   }
   ```
   
   Side note: I use "data_file" field to reproduce it, if I do not then I hit other error: https://github.com/apache/iceberg/issues/1378 and https://github.com/apache/iceberg/issues/1735 (same underlying error)
   
   There are two tests, it shows that even a simple aggregation now fails even with the workaround of using "data_file".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-887059118


   @karuppayya Has a temporary workaround by setting
   ```
       sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=false");
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-887662870


   Pr Posted ^ For those interested


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873869272


   Adding a reproduction test (can run in spark3-extensions test case, for example):
   ```
   package org.apache.iceberg.spark.extensions;
   
   import com.google.common.collect.Lists;
   import org.apache.iceberg.spark.SparkCatalog;
   import org.apache.iceberg.spark.source.SimpleRecord;
   import org.apache.iceberg.spark.source.SparkTable;
   import org.apache.spark.sql.Column;
   import org.apache.spark.sql.Dataset;
   import org.apache.spark.sql.Encoders;
   import org.apache.spark.sql.Row;
   import org.apache.spark.sql.connector.catalog.Identifier;
   import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation;
   import org.junit.After;
   import org.junit.Assert;
   import org.junit.Test;
   import scala.Some;
   
   import java.util.Arrays;
   import java.util.List;
   import java.util.Map;
   
   public class TestSparkMetadataTable extends SparkExtensionsTestBase {
   
     public TestSparkMetadataTable(String catalogName, String implementation, Map<String, String> config) {
       super(catalogName, implementation, config);
     }
   
     @Test
     public void testDataFileProjectionError() throws Exception {
       // init load
       List<SimpleRecord> records = Lists.newArrayList(new SimpleRecord(1, "1"));
       Dataset<Row> inputDf = spark.createDataFrame(records, SimpleRecord.class);
       inputDf.writeTo(tableName).create();
   
   
       Dataset<Row> stringDs = spark.createDataset(Arrays.asList("my_path"), Encoders.STRING())
           .toDF("file_path");
   
       SparkCatalog catalog = (SparkCatalog) spark.sessionState().catalogManager().catalog(catalogName);
       String[] tableIdentifiers = tableName.split("\\.");
       Identifier metaId = Identifier.of(
           new String[]{tableIdentifiers[1], tableIdentifiers[2]}, "entries");
       SparkTable metaTable = catalog.loadTable(metaId);
       Dataset<Row> entriesDs = Dataset.ofRows(spark, DataSourceV2Relation.create(metaTable, Some.apply(catalog), Some.apply(
           metaId)));
   
       Column joinCond = entriesDs.col("data_file.file_path").equalTo(stringDs.col("file_path"));
       Dataset<Row> res = entriesDs.join(stringDs, joinCond);
       boolean empty = res.isEmpty();
       Assert.assertEquals(true, empty);
     }
   
     @After
     public void dropTables() {
       sql("DROP TABLE IF EXISTS %s", tableName);
     }
   }
   ```
   
   Side note: I could not use a simple aggregation to reproduce it due to the other error (two issues filed of the same error): https://github.com/apache/iceberg/issues/1378 and https://github.com/apache/iceberg/issues/1735


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873869272


   Adding a reproduction test (can run in spark3-extensions test case, for example):
   ```
   package org.apache.iceberg.spark.extensions;
   
   import com.google.common.collect.Lists;
   import org.apache.iceberg.spark.SparkCatalog;
   import org.apache.iceberg.spark.source.SimpleRecord;
   import org.apache.iceberg.spark.source.SparkTable;
   import org.apache.spark.sql.Column;
   import org.apache.spark.sql.Dataset;
   import org.apache.spark.sql.Encoders;
   import org.apache.spark.sql.Row;
   import org.apache.spark.sql.connector.catalog.Identifier;
   import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation;
   import org.junit.After;
   import org.junit.Assert;
   import org.junit.Test;
   import scala.Some;
   
   import java.util.Arrays;
   import java.util.List;
   import java.util.Map;
   
   public class TestSparkMetadataTable extends SparkExtensionsTestBase {
   
     public TestSparkMetadataTable(String catalogName, String implementation, Map<String, String> config) {
       super(catalogName, implementation, config);
     }
   
     @Test
     public void testDataFileProjectionError() throws Exception {
       // init load
       List<SimpleRecord> records = Lists.newArrayList(new SimpleRecord(1, "1"));
       Dataset<Row> inputDf = spark.createDataFrame(records, SimpleRecord.class);
       inputDf.writeTo(tableName).create();
   
   
       Dataset<Row> stringDs = spark.createDataset(Arrays.asList("my_path"), Encoders.STRING())
           .toDF("file_path");
   
       SparkCatalog catalog = (SparkCatalog) spark.sessionState().catalogManager().catalog(catalogName);
       String[] tableIdentifiers = tableName.split("\\.");
       Identifier metaId = Identifier.of(
           new String[]{tableIdentifiers[1], tableIdentifiers[2]}, "entries");
       SparkTable metaTable = catalog.loadTable(metaId);
       Dataset<Row> entriesDs = Dataset.ofRows(spark, DataSourceV2Relation.create(metaTable, Some.apply(catalog), Some.apply(
           metaId)));
   
       Column joinCond = entriesDs.col("data_file.file_path").equalTo(stringDs.col("file_path"));
       Dataset<Row> res = entriesDs.join(stringDs, joinCond);
       boolean empty = res.isEmpty();
       Assert.assertEquals(true, empty);
     }
   
     @After
     public void dropTables() {
       sql("DROP TABLE IF EXISTS %s", tableName);
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873875087


   Preliminary analysis:  The BaseFile class has some pruning logic, keeping a mapping called 'fromProjectionPos' that maps indexes in a projection (a subset of its columns) to real indexes in its column list.
   
   Ie, if the projected array has three elements which are fields 1,2, and 10 of the original schema respectively, fromProjectionPos becomes {1, 2, 10}.
   
   This is fine but the Spark generated code in Spark 3 uses the original index to get all the fields (It's entry point is StructInternalRow: https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java#L124. (here struct is the BaseDataFile).  It calls getInt(0) expecting an int as per the real column at index 0, but due to the mapping we return value at index 1 which is actually a string and hence the error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-874899341


   One more note: this issue happens only after upgrade to Spark 3.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873875087


   Preliminary analysis:  The BaseDataFile class has some pruning logic, keeping a mapping called 'fromProjectionPos' that maps indexes in a projection (a subset of its columns) to real indexes in its column list.
   
   Ie, if the projected array has two elements which are fields 1,2, and 10 respectively, this becomes {1, 2, 10}.
   
   This is fine but the Spark generated code in Spark 3 uses the original index to get all the fields (It's entry point is StructInternalRow: https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java#L124. (here struct is the BaseDataFile).  It calls getInt(0) expecting an int as per the real column at index 0, but due to the mapping we return value at index 1 which is actually a string and hence the error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873875087


   Preliminary analysis:  The BaseFile class uses some pruning logic, keeping a mapping called 'fromProjectionPos' that maps indexes in a projection (a subset of its columns) to real indexes in its column list.
   
   Ie, if the projected array has three elements which are fields 1,2, and 10 of the original schema respectively, fromProjectionPos becomes {1, 2, 10}.
   
   This is usually fine but the Spark generated code in Spark 3 uses the original index to get all the fields (It's entry point is StructInternalRow: https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java#L124. (here 'struct' is the BaseDataFile with the pruning mapping, but 'type' is the original Type).  
   
   It calls getInt(0) expecting an int as per the real column at index 0, but due to the mapping we return value at index 1 which is actually a string and hence the error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

RussellSpitzer edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-886952345


   So I think I tracked this down, the basic issue is that Spark 3.1 correctly prunes nested structs and Spark 3.0 does not. You may wonder, if Spark3.1 correctly prunes nested structs why is this an issue?
   
   The issue is that we end up reading only 2 fields out of our metadata tables and correctly present them. But our create UnsafeProjection code assumes that if a nested struct is read, then all fields are read so we end up building a projection which requires all columns, rather than just the ones we have actually extracted. This means we build a broken projection.
   
   
   
   See RowDataReader projection, which only does a top-level pruning.
   https://github.com/apache/iceberg/blob/a79de571860a290f6e96ac562d616c9c6be2071e/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java#L208-L211
   
   If we never prune columns out of the struct this is fine, if we do then we have a problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-886952345


   So I think I tracked this down, the basic issue is that Spark 3.1 correctly prunes nested structs and Spark 3.0 does not. You may wonder, why if Spark3.1 correctly prunes nested structs why is this an issue?
   
   The issue is that we end up reading only 2 fields out of our metadata tables and correctly present them. But our create UnsafeProjection code assumes that if a nested struct is read, then all fields are read so we end up building a projection which requires all columns, rather than just the ones we have actually extracted. This means we build a broken projection.
   
   See RowDataReader projection, which only does a top-level pruning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873875087


   Preliminary analysis:  The BaseDataFile class has some pruning logic, keeping a mapping called 'fromProjectionPos' that maps indexes in a projection (a subset of its columns) to real indexes in its column list.
   
   Ie, if the projected array has two elements which are fields 1,2, and 10 respectively, this becomes {1, 2, 10}.
   
   This is fine but the Spark generated code in Spark 3 uses the original index to get all the fields (It's entry point is StructInternalRow: https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java#L124. (here struct is the BaseDataFile).  It calls getInt(0) but due to the mapping we return value at index 1 which is actually a string and hence the error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho edited a comment on issue #2783: Metadata Table Empty Projection -Unknown type for int field. Type name: java.lang.String

Posted by GitBox <gi...@apache.org>.

szehon-ho edited a comment on issue #2783:
URL: https://github.com/apache/iceberg/issues/2783#issuecomment-873875087


   Preliminary analysis:  The BaseFile class has some pruning logic, keeping a mapping called 'fromProjectionPos' that maps indexes in a projection (a subset of its columns) to real indexes in its column list.
   
   Ie, if the projected array has two elements which are fields 1,2, and 10 respectively, this becomes {1, 2, 10}.
   
   This is fine but the Spark generated code in Spark 3 uses the original index to get all the fields (It's entry point is StructInternalRow: https://github.com/apache/iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java#L124. (here struct is the BaseDataFile).  It calls getInt(0) expecting an int as per the real column at index 0, but due to the mapping we return value at index 1 which is actually a string and hence the error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org