You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/19 20:24:28 UTC

[GitHub] [iceberg] SinghAsDev opened a new pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

SinghAsDev opened a new pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774


   Add support to read Parquet files written with old 2-level list structures. This should resolve https://github.com/apache/iceberg/issues/3759.
   
   Will rebase on top of https://github.com/apache/iceberg/pull/3773 once it is merged. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#issuecomment-1001831880


   > 3773 is merged, could you rebase?
   
   @jackye1995 done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r789323849



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,17 +66,21 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);

Review comment:
       Here as well, I'd rather just have a util method that checks for a 2-level list and returns the correct element type.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790077162



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java
##########
@@ -610,7 +623,71 @@ public void testStructOfThreeLevelLists() throws Exception {
     structOfThreeLevelLists(false);
   }
 
-  public void threeLevelList(boolean useLegacyMode) throws Exception {
+  @Test
+  public void testTwoLevelList() throws IOException {
+    spark.conf().set("spark.sql.parquet.writeLegacyFormat", true);
+
+    String tableName = sourceName("testTwoLevelList");
+    File location = temp.newFolder();
+
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                        "col1", new ArrayType(
+                            new StructType(
+                                new StructField[]{
+                                    new StructField(
+                                        "col2",
+                                        DataTypes.IntegerType,
+                                        false,
+                                        Metadata.empty())
+                                }), false), true, Metadata.empty())});
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group array {\n" +
+            "      required int32 col2;\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(
+            JavaSparkContext.fromSparkContext(spark.sparkContext()).parallelize(testData))
+        .coalesce(1).write().format("parquet").mode(SaveMode.Append).save(location.getPath());
+
+    File parquetFile = Arrays.stream(Objects.requireNonNull(location.listFiles(new FilenameFilter() {
+      @Override
+      public boolean accept(File dir, String name) {
+        return name.endsWith("parquet");
+      }
+    }))).findAny().get();
+
+    // verify generated parquet file has expected schema
+    ParquetFileReader pqReader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(parquetFile.getPath()),
+        new Configuration()));

Review comment:
       This shouldn't call `new Configuration()`. Instead, use the one from Spark.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791321420



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,17 +66,16 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type elementField = ParquetSchemaUtil.getListElementType(list);
 
     visitor.beforeRepeatedElement(repeatedElement);
     try {
       T elementResult = null;
-      if (repeatedElement.getFieldCount() > 0) {
-        Type elementField = repeatedElement.getType(0);
+      if (repeatedElement.isPrimitive() || repeatedElement.asGroupType().getFieldCount() > 0) {

Review comment:
       The changes to this class are inconsistent with the changes to the `TypeWithSchemaVisitor`. Here, the repeated element is always visited (`beforeRepeatedElement` call above) and may be processed again as the element. The other avoids pushing the name on the stack. If `beforeRepeatedElement` were used to track names, I think it would get a duplicate name for the repeated group.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791304657



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java
##########
@@ -169,6 +178,60 @@ public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOExceptio
     }
   }
 
+  @Test
+  public void testWriteReadAvroBinary() throws IOException {
+    String schema = "{" +
+        "\"type\":\"record\"," +
+        "\"name\":\"DbRecord\"," +
+        "\"namespace\":\"com.iceberg\"," +
+        "\"fields\":[" +
+        "{\"name\":\"arraybytes\", " +
+        "\"type\":[ \"null\", { \"type\":\"array\", \"items\":\"bytes\"}], \"default\":null}," +
+        "{\"name\":\"topbytes\", \"type\":\"bytes\"}" +
+        "]" +
+        "}";
+
+    org.apache.avro.Schema.Parser parser = new org.apache.avro.Schema.Parser();
+    org.apache.avro.Schema avroSchema = parser.parse(schema);
+    AvroSchemaConverter converter = new AvroSchemaConverter();
+    MessageType parquetScehma = converter.convert(avroSchema);
+    Schema icebergSchema = ParquetSchemaUtil.convert(parquetScehma);
+
+    File testFile = temp.newFile();
+    Assert.assertTrue(testFile.delete());
+
+    ParquetWriter<GenericRecord> writer = AvroParquetWriter.<GenericRecord>builder(new Path(testFile.toURI()))
+        .withDataModel(GenericData.get())
+        .withSchema(avroSchema)
+        .config("parquet.avro.add-list-element-records", "true")
+        .config("parquet.avro.write-old-list-structure", "true")
+        .build();
+
+    GenericRecordBuilder recordBuilder = new GenericRecordBuilder(avroSchema);
+    List<ByteBuffer> expectedByteList = new ArrayList();
+    byte[] expectedByte = {0x00, 0x01};
+    expectedByteList.add(ByteBuffer.wrap(expectedByte));
+
+    recordBuilder.set("arraybytes", expectedByteList);
+    recordBuilder.set("topbytes", ByteBuffer.wrap(expectedByte));
+    GenericData.Record record = recordBuilder.build();
+    writer.write(record);
+    writer.close();
+
+    List<InternalRow> rows;
+    try (CloseableIterable<InternalRow> reader =
+             Parquet.read(Files.localInput(testFile))
+                 .project(icebergSchema)
+                 .createReaderFunc(type -> SparkParquetReaders.buildReader(icebergSchema, type))
+                 .build()) {

Review comment:
       `ParquetSchema.convert(MessageType)` assigns fallback ids to top-level fields and so the fields were being read.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791305217



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java
##########
@@ -187,13 +187,14 @@ private SparkParquetReaders() {
     @Override
     public ParquetValueReader<?> list(Types.ListType expectedList, GroupType array,
                                       ParquetValueReader<?> elementReader) {
-      GroupType repeated = array.getFields().get(0).asGroupType();
+      Type repeated = array.getFields().get(0);
+      boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(repeated, array.getName());
       String[] repeatedPath = currentPath();
 
       int repeatedD = type.getMaxDefinitionLevel(repeatedPath) - 1;
       int repeatedR = type.getMaxRepetitionLevel(repeatedPath) - 1;
 
-      Type elementType = repeated.getType(0);
+      Type elementType = isOldListElementType ? repeated : repeated.asGroupType().getType(0);

Review comment:
       Added. As we discussed, I will backport to older versions of Spark and flink in a separate diff.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#issuecomment-1009232620


   @jackye1995 @RussellSpitzer @kbendick @rdblue can I get a review on this, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r795267302



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +142,29 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(Types.ListType iListType, Types.NestedField iListElement, GroupType pListType,
+      Type pListElement, TypeWithSchemaVisitor<T> visitor) {
+    T elementResult = visitField(iListElement, pListElement, visitor);
+

Review comment:
       It should already be fixed. Is the update still has style issue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#issuecomment-1023774877


   @SinghAsDev, I did another round of review. This also still needs to update [Iceberg generics](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetReaders.java), I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790076008



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java
##########
@@ -169,6 +178,60 @@ public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOExceptio
     }
   }
 
+  @Test
+  public void testWriteReadAvroBinary() throws IOException {
+    String schema = "{" +
+        "\"type\":\"record\"," +
+        "\"name\":\"DbRecord\"," +
+        "\"namespace\":\"com.iceberg\"," +
+        "\"fields\":[" +
+        "{\"name\":\"arraybytes\", " +
+        "\"type\":[ \"null\", { \"type\":\"array\", \"items\":\"bytes\"}], \"default\":null}," +
+        "{\"name\":\"topbytes\", \"type\":\"bytes\"}" +
+        "]" +
+        "}";
+
+    org.apache.avro.Schema.Parser parser = new org.apache.avro.Schema.Parser();
+    org.apache.avro.Schema avroSchema = parser.parse(schema);
+    AvroSchemaConverter converter = new AvroSchemaConverter();
+    MessageType parquetScehma = converter.convert(avroSchema);
+    Schema icebergSchema = ParquetSchemaUtil.convert(parquetScehma);
+
+    File testFile = temp.newFile();
+    Assert.assertTrue(testFile.delete());
+
+    ParquetWriter<GenericRecord> writer = AvroParquetWriter.<GenericRecord>builder(new Path(testFile.toURI()))
+        .withDataModel(GenericData.get())
+        .withSchema(avroSchema)
+        .config("parquet.avro.add-list-element-records", "true")
+        .config("parquet.avro.write-old-list-structure", "true")
+        .build();
+
+    GenericRecordBuilder recordBuilder = new GenericRecordBuilder(avroSchema);
+    List<ByteBuffer> expectedByteList = new ArrayList();
+    byte[] expectedByte = {0x00, 0x01};
+    expectedByteList.add(ByteBuffer.wrap(expectedByte));
+
+    recordBuilder.set("arraybytes", expectedByteList);
+    recordBuilder.set("topbytes", ByteBuffer.wrap(expectedByte));
+    GenericData.Record record = recordBuilder.build();
+    writer.write(record);
+    writer.close();
+
+    List<InternalRow> rows;
+    try (CloseableIterable<InternalRow> reader =
+             Parquet.read(Files.localInput(testFile))

Review comment:
       Indentation is off.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r789322696



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -96,11 +96,13 @@ public Type struct(GroupType struct, List<Type> fieldTypes) {
 
   @Override
   public Type list(GroupType array, Type elementType) {
-    GroupType repeated = array.getType(0).asGroupType();
-    org.apache.parquet.schema.Type element = repeated.getType(0);
+    org.apache.parquet.schema.Type repeated = array.getType(0);
+    org.apache.parquet.schema.Type repeatedElement = array.getFields().get(0);
+    boolean isElementType = ParquetSchemaUtil.isListElementType(repeatedElement, array.getName());
+    org.apache.parquet.schema.Type element = isElementType ? repeated : repeated.asGroupType().getType(0);
 
     Preconditions.checkArgument(
-        !element.isRepetition(Repetition.REPEATED),
+        isElementType || !element.isRepetition(Repetition.REPEATED),
         "Elements cannot have repetition REPEATED: %s", element);

Review comment:
       I don't know that this is useful anymore. I think it was actually checking to make sure we didn't try to read 2-level lists.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wizardxz commented on pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
wizardxz commented on pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#issuecomment-1005109596


   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r784199120



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestParquetSchemaUtil.java
##########
@@ -204,6 +204,55 @@ public void testSchemaConversionForHiveStyleLists() {
     Assert.assertEquals("Schema must match", expectedSchema.asStruct(), actualSchema.asStruct());
   }
 
+  @Test
+  public void testSchemaConversionForHiveStyleTwoLevelList() {

Review comment:
       Should we have tests for the other backwards compatibility rules? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790078373



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java
##########
@@ -187,13 +187,14 @@ private SparkParquetReaders() {
     @Override
     public ParquetValueReader<?> list(Types.ListType expectedList, GroupType array,
                                       ParquetValueReader<?> elementReader) {
-      GroupType repeated = array.getFields().get(0).asGroupType();
+      Type repeated = array.getFields().get(0);
+      boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(repeated, array.getName());
       String[] repeatedPath = currentPath();
 
       int repeatedD = type.getMaxDefinitionLevel(repeatedPath) - 1;
       int repeatedR = type.getMaxRepetitionLevel(repeatedPath) - 1;
 
-      Type elementType = repeated.getType(0);
+      Type elementType = isOldListElementType ? repeated : repeated.asGroupType().getType(0);

Review comment:
       Looks like this doesn't fix Flink or Generics?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790075806



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java
##########
@@ -169,6 +178,60 @@ public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOExceptio
     }
   }
 
+  @Test
+  public void testWriteReadAvroBinary() throws IOException {

Review comment:
       I think this should have a better name if it is intended to test 2-level lists created by parquet-avro.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791316278



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +165,62 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  public static Type getListElementType(GroupType array) {

Review comment:
       Iceberg doesn't use `get` in method names. There is probably a more specific verb you should use here instead, like `determine` or `find`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791316387



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +165,62 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  public static Type getListElementType(GroupType array) {
+    Type repeated = array.getFields().get(0);
+    boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(array);

Review comment:
       Nit: this uses the current class as a prefix for a static method call.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794110511



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,32 +66,52 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type listElement = ParquetSchemaUtil.determineListElementType(list);
+    if (listElement.isRepetition(Type.Repetition.REPEATED)) {
+      return visitTwoLevelList(list, listElement, visitor);
+    } else {
+      return visitThreeLevelList(list, listElement, visitor);
+    }
+  }
+
+  private static <T> T visitTwoLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    T elementResult = visitListElement(list, listElement, visitor);
+    return visitor.list(list, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);
 
     visitor.beforeRepeatedElement(repeatedElement);
     try {
-      T elementResult = null;
-      if (repeatedElement.getFieldCount() > 0) {
-        Type elementField = repeatedElement.getType(0);
-        visitor.beforeElementField(elementField);
-        try {
-          elementResult = visit(elementField, visitor);
-        } finally {
-          visitor.afterElementField(elementField);
-        }
-      }
+      T elementResult = visitListElement(list, listElement, visitor);
 
       return visitor.list(list, elementResult);
-
     } finally {
       visitor.afterRepeatedElement(repeatedElement);
     }
   }
 
+  private static <T> T visitListElement(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);
+    T elementResult = null;
+
+    if (repeatedElement.isPrimitive() || repeatedElement.asGroupType().getFieldCount() > 0) {

Review comment:
       As I noted in the visitor, you can remove this check entirely. There are lots of places that aren't as careful and call `getType(0)` without checking the field count like this does.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#issuecomment-999897025


   3773 is merged, could you rebase?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790076249



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java
##########
@@ -169,6 +178,60 @@ public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOExceptio
     }
   }
 
+  @Test
+  public void testWriteReadAvroBinary() throws IOException {
+    String schema = "{" +
+        "\"type\":\"record\"," +
+        "\"name\":\"DbRecord\"," +
+        "\"namespace\":\"com.iceberg\"," +
+        "\"fields\":[" +
+        "{\"name\":\"arraybytes\", " +
+        "\"type\":[ \"null\", { \"type\":\"array\", \"items\":\"bytes\"}], \"default\":null}," +
+        "{\"name\":\"topbytes\", \"type\":\"bytes\"}" +
+        "]" +
+        "}";
+
+    org.apache.avro.Schema.Parser parser = new org.apache.avro.Schema.Parser();
+    org.apache.avro.Schema avroSchema = parser.parse(schema);
+    AvroSchemaConverter converter = new AvroSchemaConverter();
+    MessageType parquetScehma = converter.convert(avroSchema);
+    Schema icebergSchema = ParquetSchemaUtil.convert(parquetScehma);
+
+    File testFile = temp.newFile();
+    Assert.assertTrue(testFile.delete());
+
+    ParquetWriter<GenericRecord> writer = AvroParquetWriter.<GenericRecord>builder(new Path(testFile.toURI()))
+        .withDataModel(GenericData.get())
+        .withSchema(avroSchema)
+        .config("parquet.avro.add-list-element-records", "true")
+        .config("parquet.avro.write-old-list-structure", "true")
+        .build();
+
+    GenericRecordBuilder recordBuilder = new GenericRecordBuilder(avroSchema);
+    List<ByteBuffer> expectedByteList = new ArrayList();
+    byte[] expectedByte = {0x00, 0x01};
+    expectedByteList.add(ByteBuffer.wrap(expectedByte));
+
+    recordBuilder.set("arraybytes", expectedByteList);
+    recordBuilder.set("topbytes", ByteBuffer.wrap(expectedByte));
+    GenericData.Record record = recordBuilder.build();
+    writer.write(record);
+    writer.close();
+
+    List<InternalRow> rows;
+    try (CloseableIterable<InternalRow> reader =
+             Parquet.read(Files.localInput(testFile))
+                 .project(icebergSchema)
+                 .createReaderFunc(type -> SparkParquetReaders.buildReader(icebergSchema, type))
+                 .build()) {
+      rows = Lists.newArrayList(reader);
+    }
+
+    InternalRow row = rows.get(0);

Review comment:
       I don't see a reason for this test to be in Spark. This isn't related to Spark, it is testing parquet-avro files. Can you move this to the Parquet tests instead?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791318678



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -63,29 +63,14 @@
             Preconditions.checkArgument(group.getFieldCount() == 1,
                 "Invalid list: does not contain single repeated field: %s", group);
 
-            GroupType repeatedElement = group.getFields().get(0).asGroupType();
+            Type repeatedElement = group.getFields().get(0);
             Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
                 "Invalid list: inner group is not repeated");
-            Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-                "Invalid list: repeated group is not a single field: %s", group);
 
-            Types.ListType list = null;
-            Types.NestedField element = null;
-            if (iType != null) {
-              list = iType.asListType();
-              element = list.fields().get(0);
-            }
-
-            visitor.fieldNames.push(repeatedElement.getName());
-            try {
-              T elementResult = null;
-              if (repeatedElement.getFieldCount() > 0) {
-                elementResult = visitField(element, repeatedElement.getType(0), visitor);
-              }
-
-              return visitor.list(list, group, elementResult);
-            } finally {
-              visitor.fieldNames.pop();
+            if (ParquetSchemaUtil.isOldListElementType(group)) {

Review comment:
       Check for REPEATED and pass the element?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791316930



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -67,11 +67,17 @@ public Type list(GroupType list, Type elementType) {
     Preconditions.checkArgument(elementType != null,
         "List type must have element field");
 
+    boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(list);
     MappedField field = nameMapping.find(currentPath());
-    Type listType = Types.buildGroup(list.getRepetition())
-        .as(LogicalTypeAnnotation.listType())
-        .repeatedGroup().addFields(elementType).named(list.getFieldName(0))
-        .named(list.getName());
+
+    Types.GroupBuilder<GroupType> listBuilder = Types.buildGroup(list.getRepetition())
+        .as(LogicalTypeAnnotation.listType());
+    if (isOldListElementType) {

Review comment:
       I think this can check whether the repetition level of the element is `REPEATED` to determine whether it is a 2-level list. I think that would be better than allowing the `isOldListElementType` check in multiple places.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791088471



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -96,11 +96,13 @@ public Type struct(GroupType struct, List<Type> fieldTypes) {
 
   @Override
   public Type list(GroupType array, Type elementType) {
-    GroupType repeated = array.getType(0).asGroupType();
-    org.apache.parquet.schema.Type element = repeated.getType(0);
+    org.apache.parquet.schema.Type repeated = array.getType(0);
+    org.apache.parquet.schema.Type repeatedElement = array.getFields().get(0);
+    boolean isElementType = ParquetSchemaUtil.isListElementType(repeatedElement, array.getName());
+    org.apache.parquet.schema.Type element = isElementType ? repeated : repeated.asGroupType().getType(0);
 
     Preconditions.checkArgument(
-        !element.isRepetition(Repetition.REPEATED),
+        isElementType || !element.isRepetition(Repetition.REPEATED),
         "Elements cannot have repetition REPEATED: %s", element);

Review comment:
       Yea, forgot to drop this in earlier update.  Dropping it.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +164,48 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  // Parquet LIST backwards-compatibility rules.
+  // https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
+  public static boolean isOldListElementType(Type repeatedType, String parentName) {

Review comment:
       This is also used outside of iceberg-parquet, like iceberg-spark. Good point on making method to accept list directly, updating.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -96,11 +96,13 @@ public Type struct(GroupType struct, List<Type> fieldTypes) {
 
   @Override
   public Type list(GroupType array, Type elementType) {
-    GroupType repeated = array.getType(0).asGroupType();
-    org.apache.parquet.schema.Type element = repeated.getType(0);
+    org.apache.parquet.schema.Type repeated = array.getType(0);
+    org.apache.parquet.schema.Type repeatedElement = array.getFields().get(0);

Review comment:
       Likely a copy paste error, we only need one of them. Updating.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/PruneColumns.java
##########
@@ -108,15 +108,16 @@ public Type struct(GroupType struct, List<Type> fields) {
 
   @Override
   public Type list(GroupType list, Type element) {
-    GroupType repeated = list.getType(0).asGroupType();
-    Type originalElement = repeated.getType(0);
+    Type repeated = list.getType(0);
+    boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(repeated, list.getName());
+    Type originalElement = isOldListElementType ? repeated : repeated.asGroupType().getType(0);
     Integer elementId = getId(originalElement);
 
     if (elementId != null && selectedIds.contains(elementId)) {
       return list;
     } else if (element != null) {
       if (!Objects.equal(element, originalElement)) {
-        return list.withNewFields(repeated.withNewFields(element));
+        return list.withNewFields(repeated.asGroupType().withNewFields(element));

Review comment:
       Thanks for catching this.

##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java
##########
@@ -169,6 +178,60 @@ public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOExceptio
     }
   }
 
+  @Test
+  public void testWriteReadAvroBinary() throws IOException {
+    String schema = "{" +
+        "\"type\":\"record\"," +
+        "\"name\":\"DbRecord\"," +
+        "\"namespace\":\"com.iceberg\"," +
+        "\"fields\":[" +
+        "{\"name\":\"arraybytes\", " +
+        "\"type\":[ \"null\", { \"type\":\"array\", \"items\":\"bytes\"}], \"default\":null}," +
+        "{\"name\":\"topbytes\", \"type\":\"bytes\"}" +
+        "]" +
+        "}";
+
+    org.apache.avro.Schema.Parser parser = new org.apache.avro.Schema.Parser();
+    org.apache.avro.Schema avroSchema = parser.parse(schema);
+    AvroSchemaConverter converter = new AvroSchemaConverter();
+    MessageType parquetScehma = converter.convert(avroSchema);
+    Schema icebergSchema = ParquetSchemaUtil.convert(parquetScehma);
+
+    File testFile = temp.newFile();
+    Assert.assertTrue(testFile.delete());
+
+    ParquetWriter<GenericRecord> writer = AvroParquetWriter.<GenericRecord>builder(new Path(testFile.toURI()))
+        .withDataModel(GenericData.get())
+        .withSchema(avroSchema)
+        .config("parquet.avro.add-list-element-records", "true")
+        .config("parquet.avro.write-old-list-structure", "true")
+        .build();
+
+    GenericRecordBuilder recordBuilder = new GenericRecordBuilder(avroSchema);
+    List<ByteBuffer> expectedByteList = new ArrayList();
+    byte[] expectedByte = {0x00, 0x01};
+    expectedByteList.add(ByteBuffer.wrap(expectedByte));
+
+    recordBuilder.set("arraybytes", expectedByteList);
+    recordBuilder.set("topbytes", ByteBuffer.wrap(expectedByte));
+    GenericData.Record record = recordBuilder.build();
+    writer.write(record);
+    writer.close();
+
+    List<InternalRow> rows;
+    try (CloseableIterable<InternalRow> reader =
+             Parquet.read(Files.localInput(testFile))
+                 .project(icebergSchema)
+                 .createReaderFunc(type -> SparkParquetReaders.buildReader(icebergSchema, type))
+                 .build()) {

Review comment:
       `ParquetSchema.convert(MessageType)` assigns fallback ids to top-level fields and so the fields were being read.

##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java
##########
@@ -610,7 +623,71 @@ public void testStructOfThreeLevelLists() throws Exception {
     structOfThreeLevelLists(false);
   }
 
-  public void threeLevelList(boolean useLegacyMode) throws Exception {
+  @Test
+  public void testTwoLevelList() throws IOException {
+    spark.conf().set("spark.sql.parquet.writeLegacyFormat", true);
+
+    String tableName = sourceName("testTwoLevelList");
+    File location = temp.newFolder();
+
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                        "col1", new ArrayType(
+                            new StructType(
+                                new StructField[]{
+                                    new StructField(
+                                        "col2",
+                                        DataTypes.IntegerType,
+                                        false,
+                                        Metadata.empty())
+                                }), false), true, Metadata.empty())});
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group array {\n" +
+            "      required int32 col2;\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(
+            JavaSparkContext.fromSparkContext(spark.sparkContext()).parallelize(testData))
+        .coalesce(1).write().format("parquet").mode(SaveMode.Append).save(location.getPath());
+
+    File parquetFile = Arrays.stream(Objects.requireNonNull(location.listFiles(new FilenameFilter() {
+      @Override
+      public boolean accept(File dir, String name) {
+        return name.endsWith("parquet");
+      }
+    }))).findAny().get();
+
+    // verify generated parquet file has expected schema
+    ParquetFileReader pqReader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(parquetFile.getPath()),
+        new Configuration()));
+    MessageType schema = pqReader.getFooter().getFileMetaData().getSchema();
+    Assert.assertEquals(expectedParquetSchema, schema.toString());

Review comment:
       I did not understand this, I am already comparing parquet schemas here.

##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java
##########
@@ -187,13 +187,14 @@ private SparkParquetReaders() {
     @Override
     public ParquetValueReader<?> list(Types.ListType expectedList, GroupType array,
                                       ParquetValueReader<?> elementReader) {
-      GroupType repeated = array.getFields().get(0).asGroupType();
+      Type repeated = array.getFields().get(0);
+      boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(repeated, array.getName());
       String[] repeatedPath = currentPath();
 
       int repeatedD = type.getMaxDefinitionLevel(repeatedPath) - 1;
       int repeatedR = type.getMaxRepetitionLevel(repeatedPath) - 1;
 
-      Type elementType = repeated.getType(0);
+      Type elementType = isOldListElementType ? repeated : repeated.asGroupType().getType(0);

Review comment:
       Added. As we discussed, I will backport to older versions of Spark and flink in a separate diff.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#issuecomment-997466182


   @rdblue @RussellSpitzer can you help review this? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r795249134



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,32 +66,52 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type listElement = ParquetSchemaUtil.determineListElementType(list);
+    if (listElement.isRepetition(Type.Repetition.REPEATED)) {
+      return visitTwoLevelList(list, listElement, visitor);
+    } else {
+      return visitThreeLevelList(list, listElement, visitor);
+    }
+  }
+
+  private static <T> T visitTwoLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    T elementResult = visitListElement(list, listElement, visitor);
+    return visitor.list(list, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);
 
     visitor.beforeRepeatedElement(repeatedElement);
     try {
-      T elementResult = null;
-      if (repeatedElement.getFieldCount() > 0) {
-        Type elementField = repeatedElement.getType(0);
-        visitor.beforeElementField(elementField);
-        try {
-          elementResult = visit(elementField, visitor);
-        } finally {
-          visitor.afterElementField(elementField);
-        }
-      }
+      T elementResult = visitListElement(list, listElement, visitor);
 
       return visitor.list(list, elementResult);
-
     } finally {
       visitor.afterRepeatedElement(repeatedElement);
     }
   }
 
+  private static <T> T visitListElement(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);
+    T elementResult = null;
+
+    if (repeatedElement.isPrimitive() || repeatedElement.asGroupType().getFieldCount() > 0) {

Review comment:
       Good point, updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794107948



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,32 +66,52 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type listElement = ParquetSchemaUtil.determineListElementType(list);
+    if (listElement.isRepetition(Type.Repetition.REPEATED)) {
+      return visitTwoLevelList(list, listElement, visitor);
+    } else {
+      return visitThreeLevelList(list, listElement, visitor);
+    }
+  }
+
+  private static <T> T visitTwoLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    T elementResult = visitListElement(list, listElement, visitor);
+    return visitor.list(list, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);
 
     visitor.beforeRepeatedElement(repeatedElement);
     try {
-      T elementResult = null;
-      if (repeatedElement.getFieldCount() > 0) {
-        Type elementField = repeatedElement.getType(0);
-        visitor.beforeElementField(elementField);
-        try {
-          elementResult = visit(elementField, visitor);
-        } finally {
-          visitor.afterElementField(elementField);
-        }
-      }
+      T elementResult = visitListElement(list, listElement, visitor);
 
       return visitor.list(list, elementResult);
-
     } finally {
       visitor.afterRepeatedElement(repeatedElement);
     }
   }
 
+  private static <T> T visitListElement(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);
+    T elementResult = null;
+
+    if (repeatedElement.isPrimitive() || repeatedElement.asGroupType().getFieldCount() > 0) {

Review comment:
       When would this be false?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794107605



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,32 +66,52 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type listElement = ParquetSchemaUtil.determineListElementType(list);
+    if (listElement.isRepetition(Type.Repetition.REPEATED)) {
+      return visitTwoLevelList(list, listElement, visitor);
+    } else {
+      return visitThreeLevelList(list, listElement, visitor);
+    }
+  }
+
+  private static <T> T visitTwoLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    T elementResult = visitListElement(list, listElement, visitor);
+    return visitor.list(list, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);
 
     visitor.beforeRepeatedElement(repeatedElement);
     try {
-      T elementResult = null;
-      if (repeatedElement.getFieldCount() > 0) {
-        Type elementField = repeatedElement.getType(0);
-        visitor.beforeElementField(elementField);
-        try {
-          elementResult = visit(elementField, visitor);
-        } finally {
-          visitor.afterElementField(elementField);
-        }
-      }
+      T elementResult = visitListElement(list, listElement, visitor);
 
       return visitor.list(list, elementResult);
-
     } finally {
       visitor.afterRepeatedElement(repeatedElement);
     }
   }
 
+  private static <T> T visitListElement(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);

Review comment:
       Same thing here. We don't need to find the repeated element 3 times.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r786219905



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +164,48 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  // Parquet LIST backwards-compatibility rules.
+  // https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
+  public static boolean isListElementType(Type repeatedType, String parentName) {

Review comment:
       Good point, updating to `isOldListElementType`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790076374



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java
##########
@@ -169,6 +178,60 @@ public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOExceptio
     }
   }
 
+  @Test
+  public void testWriteReadAvroBinary() throws IOException {
+    String schema = "{" +
+        "\"type\":\"record\"," +
+        "\"name\":\"DbRecord\"," +
+        "\"namespace\":\"com.iceberg\"," +
+        "\"fields\":[" +
+        "{\"name\":\"arraybytes\", " +
+        "\"type\":[ \"null\", { \"type\":\"array\", \"items\":\"bytes\"}], \"default\":null}," +
+        "{\"name\":\"topbytes\", \"type\":\"bytes\"}" +
+        "]" +
+        "}";
+
+    org.apache.avro.Schema.Parser parser = new org.apache.avro.Schema.Parser();
+    org.apache.avro.Schema avroSchema = parser.parse(schema);
+    AvroSchemaConverter converter = new AvroSchemaConverter();
+    MessageType parquetScehma = converter.convert(avroSchema);
+    Schema icebergSchema = ParquetSchemaUtil.convert(parquetScehma);
+
+    File testFile = temp.newFile();
+    Assert.assertTrue(testFile.delete());
+
+    ParquetWriter<GenericRecord> writer = AvroParquetWriter.<GenericRecord>builder(new Path(testFile.toURI()))
+        .withDataModel(GenericData.get())
+        .withSchema(avroSchema)
+        .config("parquet.avro.add-list-element-records", "true")
+        .config("parquet.avro.write-old-list-structure", "true")
+        .build();
+
+    GenericRecordBuilder recordBuilder = new GenericRecordBuilder(avroSchema);
+    List<ByteBuffer> expectedByteList = new ArrayList();
+    byte[] expectedByte = {0x00, 0x01};
+    expectedByteList.add(ByteBuffer.wrap(expectedByte));
+
+    recordBuilder.set("arraybytes", expectedByteList);
+    recordBuilder.set("topbytes", ByteBuffer.wrap(expectedByte));
+    GenericData.Record record = recordBuilder.build();
+    writer.write(record);
+    writer.close();
+
+    List<InternalRow> rows;
+    try (CloseableIterable<InternalRow> reader =
+             Parquet.read(Files.localInput(testFile))
+                 .project(icebergSchema)
+                 .createReaderFunc(type -> SparkParquetReaders.buildReader(icebergSchema, type))
+                 .build()) {

Review comment:
       Why does this work? The file doesn't contain field IDs, so I don't think that any of the fields should be read by Iceberg readers.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790077355



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java
##########
@@ -610,7 +623,71 @@ public void testStructOfThreeLevelLists() throws Exception {
     structOfThreeLevelLists(false);
   }
 
-  public void threeLevelList(boolean useLegacyMode) throws Exception {
+  @Test
+  public void testTwoLevelList() throws IOException {
+    spark.conf().set("spark.sql.parquet.writeLegacyFormat", true);
+
+    String tableName = sourceName("testTwoLevelList");
+    File location = temp.newFolder();
+
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                        "col1", new ArrayType(
+                            new StructType(
+                                new StructField[]{
+                                    new StructField(
+                                        "col2",
+                                        DataTypes.IntegerType,
+                                        false,
+                                        Metadata.empty())
+                                }), false), true, Metadata.empty())});
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group array {\n" +
+            "      required int32 col2;\n" +

Review comment:
       This is a 3-level list, not a 2-level list. Is this testing what you intended?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790075607



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java
##########
@@ -701,7 +778,6 @@ private void structOfThreeLevelLists(boolean useLegacyMode) throws Exception {
     assertEquals("Output must match", expected, results);
   }
 
-

Review comment:
       Nit: unnecessary whitespace change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794108564



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,32 +66,52 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type listElement = ParquetSchemaUtil.determineListElementType(list);
+    if (listElement.isRepetition(Type.Repetition.REPEATED)) {
+      return visitTwoLevelList(list, listElement, visitor);
+    } else {
+      return visitThreeLevelList(list, listElement, visitor);
+    }
+  }
+
+  private static <T> T visitTwoLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    T elementResult = visitListElement(list, listElement, visitor);
+    return visitor.list(list, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);
 
     visitor.beforeRepeatedElement(repeatedElement);
     try {
-      T elementResult = null;
-      if (repeatedElement.getFieldCount() > 0) {
-        Type elementField = repeatedElement.getType(0);
-        visitor.beforeElementField(elementField);
-        try {
-          elementResult = visit(elementField, visitor);
-        } finally {
-          visitor.afterElementField(elementField);
-        }
-      }
+      T elementResult = visitListElement(list, listElement, visitor);
 
       return visitor.list(list, elementResult);
-
     } finally {
       visitor.afterRepeatedElement(repeatedElement);
     }
   }
 
+  private static <T> T visitListElement(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);

Review comment:
       Looks like this is the only time that the `list` is used, so I think this is worth a refactor.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794110239



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +165,62 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  public static Type determineListElementType(GroupType array) {
+    Type repeated = array.getFields().get(0);
+    boolean isOldListElementType = isOldListElementType(array);
+
+    Preconditions.checkArgument(isOldListElementType ||
+            repeated.asGroupType().getFieldCount() <= 1,
+        "Invalid list: repeated group is not a single field: %s", array);
+
+    return isOldListElementType ? repeated : repeated.asGroupType().getType(0);

Review comment:
       Before, the visitor checked that the repeated group had at least 1 field. That prevented this from throwing `NoSuchElementException`.
   
   However, there are plenty of other places that just assumed that there would be a field and would call `getType(0)` without checking, so I think that leaving it out is okay. That means that we no longer need the check to make sure there is at least one field in the visitor.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794112634



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +142,29 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(Types.ListType iListType, Types.NestedField iListElement, GroupType pListType,
+      Type pListElement, TypeWithSchemaVisitor<T> visitor) {
+    T elementResult = visitField(iListElement, pListElement, visitor);
+
+    return visitor.list(iListType, pListType, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(Types.ListType iListType, Types.NestedField iListElement,
+      GroupType pListType, Type pListElement, TypeWithSchemaVisitor<T> visitor) {
+    visitor.fieldNames.push(pListType.getFieldName(0));
+
+    try {
+      T elementResult = null;
+      if (pListElement != null) {

Review comment:
       The `determineListElementType` method doesn't perform the field count check and return null, so this is always true. I think you may have thought that `List.get` would return null?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r795265536



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,32 +66,52 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type listElement = ParquetSchemaUtil.determineListElementType(list);
+    if (listElement.isRepetition(Type.Repetition.REPEATED)) {
+      return visitTwoLevelList(list, listElement, visitor);
+    } else {
+      return visitThreeLevelList(list, listElement, visitor);
+    }
+  }
+
+  private static <T> T visitTwoLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    T elementResult = visitListElement(list, listElement, visitor);
+    return visitor.list(list, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);

Review comment:
       Looks like the only place that `list` is used is to retrieve the repeated element and to visit the list. Visiting the list is done in both `visitTwoLevelList` and `visitThreeLevelList`. So I think the cleanest structure is to keep `visitListElement` and change these methods to visit just the element. `visitTwoLevelList` is not needed, the original method should call `visitListElement` directly. Then this function should accept `GroupType repeated, Type listElement, ParquetTypeVisitor<T> visitor` and return the element result.
   
   Then you can move `visitor.list(list, elementResult)` into the original method. That's pretty clean.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#issuecomment-1027244729


   Thanks, @SinghAsDev!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791088471



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -96,11 +96,13 @@ public Type struct(GroupType struct, List<Type> fieldTypes) {
 
   @Override
   public Type list(GroupType array, Type elementType) {
-    GroupType repeated = array.getType(0).asGroupType();
-    org.apache.parquet.schema.Type element = repeated.getType(0);
+    org.apache.parquet.schema.Type repeated = array.getType(0);
+    org.apache.parquet.schema.Type repeatedElement = array.getFields().get(0);
+    boolean isElementType = ParquetSchemaUtil.isListElementType(repeatedElement, array.getName());
+    org.apache.parquet.schema.Type element = isElementType ? repeated : repeated.asGroupType().getType(0);
 
     Preconditions.checkArgument(
-        !element.isRepetition(Repetition.REPEATED),
+        isElementType || !element.isRepetition(Repetition.REPEATED),
         "Elements cannot have repetition REPEATED: %s", element);

Review comment:
       Yea, forgot to drop this in earlier update.  Dropping it.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +164,48 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  // Parquet LIST backwards-compatibility rules.
+  // https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
+  public static boolean isOldListElementType(Type repeatedType, String parentName) {

Review comment:
       This is also used outside of iceberg-parquet, like iceberg-spark. Good point on making method to accept list directly, updating.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -96,11 +96,13 @@ public Type struct(GroupType struct, List<Type> fieldTypes) {
 
   @Override
   public Type list(GroupType array, Type elementType) {
-    GroupType repeated = array.getType(0).asGroupType();
-    org.apache.parquet.schema.Type element = repeated.getType(0);
+    org.apache.parquet.schema.Type repeated = array.getType(0);
+    org.apache.parquet.schema.Type repeatedElement = array.getFields().get(0);

Review comment:
       Likely a copy paste error, we only need one of them. Updating.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/PruneColumns.java
##########
@@ -108,15 +108,16 @@ public Type struct(GroupType struct, List<Type> fields) {
 
   @Override
   public Type list(GroupType list, Type element) {
-    GroupType repeated = list.getType(0).asGroupType();
-    Type originalElement = repeated.getType(0);
+    Type repeated = list.getType(0);
+    boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(repeated, list.getName());
+    Type originalElement = isOldListElementType ? repeated : repeated.asGroupType().getType(0);
     Integer elementId = getId(originalElement);
 
     if (elementId != null && selectedIds.contains(elementId)) {
       return list;
     } else if (element != null) {
       if (!Objects.equal(element, originalElement)) {
-        return list.withNewFields(repeated.withNewFields(element));
+        return list.withNewFields(repeated.asGroupType().withNewFields(element));

Review comment:
       Thanks for catching this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791304880



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java
##########
@@ -610,7 +623,71 @@ public void testStructOfThreeLevelLists() throws Exception {
     structOfThreeLevelLists(false);
   }
 
-  public void threeLevelList(boolean useLegacyMode) throws Exception {
+  @Test
+  public void testTwoLevelList() throws IOException {
+    spark.conf().set("spark.sql.parquet.writeLegacyFormat", true);
+
+    String tableName = sourceName("testTwoLevelList");
+    File location = temp.newFolder();
+
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                        "col1", new ArrayType(
+                            new StructType(
+                                new StructField[]{
+                                    new StructField(
+                                        "col2",
+                                        DataTypes.IntegerType,
+                                        false,
+                                        Metadata.empty())
+                                }), false), true, Metadata.empty())});
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group array {\n" +
+            "      required int32 col2;\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(
+            JavaSparkContext.fromSparkContext(spark.sparkContext()).parallelize(testData))
+        .coalesce(1).write().format("parquet").mode(SaveMode.Append).save(location.getPath());
+
+    File parquetFile = Arrays.stream(Objects.requireNonNull(location.listFiles(new FilenameFilter() {
+      @Override
+      public boolean accept(File dir, String name) {
+        return name.endsWith("parquet");
+      }
+    }))).findAny().get();
+
+    // verify generated parquet file has expected schema
+    ParquetFileReader pqReader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(parquetFile.getPath()),
+        new Configuration()));
+    MessageType schema = pqReader.getFooter().getFileMetaData().getSchema();
+    Assert.assertEquals(expectedParquetSchema, schema.toString());

Review comment:
       I did not understand this, I am already comparing parquet schemas here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791320410



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +134,50 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(
+      org.apache.iceberg.types.Type iType,
+      TypeWithSchemaVisitor<T> visitor,
+      GroupType group,
+      Type repeatedElement) {
+    Types.ListType list = null;
+    Types.NestedField element = null;
+    if (iType != null) {
+      list = iType.asListType();

Review comment:
       This can be pushed up to the method call. `iType.asListType()` passed in here makes the signature of this method simpler, `Types.ListType`. Same with the other method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r784201889



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -96,11 +96,13 @@ public Type struct(GroupType struct, List<Type> fieldTypes) {
 
   @Override
   public Type list(GroupType array, Type elementType) {
-    GroupType repeated = array.getType(0).asGroupType();
-    org.apache.parquet.schema.Type element = repeated.getType(0);
+    org.apache.parquet.schema.Type repeated = array.getType(0);
+    org.apache.parquet.schema.Type repeatedElement = array.getFields().get(0);
+    boolean isElementType = ParquetSchemaUtil.isListElementType(repeatedElement, array.getName());
+    org.apache.parquet.schema.Type element = isElementType ? repeated : repeated.asGroupType().getType(0);
 
     Preconditions.checkArgument(
-        !element.isRepetition(Repetition.REPEATED),
+        isElementType || !element.isRepetition(Repetition.REPEATED),
         "Elements cannot have repetition REPEATED: %s", element);

Review comment:
       Should we change this message as well? Since now elements can be repeated but only in backwards compatibility situations?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r784204444



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -96,11 +96,13 @@ public Type struct(GroupType struct, List<Type> fieldTypes) {
 
   @Override
   public Type list(GroupType array, Type elementType) {
-    GroupType repeated = array.getType(0).asGroupType();
-    org.apache.parquet.schema.Type element = repeated.getType(0);
+    org.apache.parquet.schema.Type repeated = array.getType(0);
+    org.apache.parquet.schema.Type repeatedElement = array.getFields().get(0);
+    boolean isElementType = ParquetSchemaUtil.isListElementType(repeatedElement, array.getName());

Review comment:
       Is elementType here really is "isBackwardCompatibleListElement" or something like that? I think we should rename this to make it more clear. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r784203495



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +164,48 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  // Parquet LIST backwards-compatibility rules.
+  // https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
+  public static boolean isListElementType(Type repeatedType, String parentName) {

Review comment:
       Perhaps this should be isOldListElementType or something to indicate this is for backwards compatibility in the name, we also switch the name to "isElementType" later which I think we should also rename to show this is the backwards compatibility path.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791319244



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java
##########
@@ -610,7 +623,71 @@ public void testStructOfThreeLevelLists() throws Exception {
     structOfThreeLevelLists(false);
   }
 
-  public void threeLevelList(boolean useLegacyMode) throws Exception {
+  @Test
+  public void testTwoLevelList() throws IOException {
+    spark.conf().set("spark.sql.parquet.writeLegacyFormat", true);
+
+    String tableName = sourceName("testTwoLevelList");
+    File location = temp.newFolder();
+
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                        "col1", new ArrayType(
+                            new StructType(
+                                new StructField[]{
+                                    new StructField(
+                                        "col2",
+                                        DataTypes.IntegerType,
+                                        false,
+                                        Metadata.empty())
+                                }), false), true, Metadata.empty())});
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group array {\n" +
+            "      required int32 col2;\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(
+            JavaSparkContext.fromSparkContext(spark.sparkContext()).parallelize(testData))
+        .coalesce(1).write().format("parquet").mode(SaveMode.Append).save(location.getPath());
+
+    File parquetFile = Arrays.stream(Objects.requireNonNull(location.listFiles(new FilenameFilter() {
+      @Override
+      public boolean accept(File dir, String name) {
+        return name.endsWith("parquet");
+      }
+    }))).findAny().get();
+
+    // verify generated parquet file has expected schema
+    ParquetFileReader pqReader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(parquetFile.getPath()),
+        new Configuration()));
+    MessageType schema = pqReader.getFooter().getFileMetaData().getSchema();
+    Assert.assertEquals(expectedParquetSchema, schema.toString());

Review comment:
       You're converting them to String and then checking, which makes this brittle. Can you compare the schemas without converting to String?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791316278



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +165,62 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  public static Type getListElementType(GroupType array) {

Review comment:
       Iceberg doesn't use `get` in method names. There is probably a more specific verb you should use here instead, like `determine` or `find`.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +165,62 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  public static Type getListElementType(GroupType array) {
+    Type repeated = array.getFields().get(0);
+    boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(array);

Review comment:
       Nit: this uses the current class as a prefix for a static method call.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -67,11 +67,17 @@ public Type list(GroupType list, Type elementType) {
     Preconditions.checkArgument(elementType != null,
         "List type must have element field");
 
+    boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(list);
     MappedField field = nameMapping.find(currentPath());
-    Type listType = Types.buildGroup(list.getRepetition())
-        .as(LogicalTypeAnnotation.listType())
-        .repeatedGroup().addFields(elementType).named(list.getFieldName(0))
-        .named(list.getName());
+
+    Types.GroupBuilder<GroupType> listBuilder = Types.buildGroup(list.getRepetition())
+        .as(LogicalTypeAnnotation.listType());
+    if (isOldListElementType) {

Review comment:
       I think this can check whether the repetition level of the element is `REPEATED` to determine whether it is a 2-level list. I think that would be better than allowing the `isOldListElementType` check in multiple places.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/PruneColumns.java
##########
@@ -108,15 +108,20 @@ public Type struct(GroupType struct, List<Type> fields) {
 
   @Override
   public Type list(GroupType list, Type element) {
-    GroupType repeated = list.getType(0).asGroupType();
-    Type originalElement = repeated.getType(0);
+    Type repeated = list.getType(0);
+    boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(list);
+    Type originalElement = ParquetSchemaUtil.getListElementType(list);
     Integer elementId = getId(originalElement);
 
     if (elementId != null && selectedIds.contains(elementId)) {
       return list;
     } else if (element != null) {
       if (!Objects.equal(element, originalElement)) {
-        return list.withNewFields(repeated.withNewFields(element));
+        if (isOldListElementType) {

Review comment:
       I think this should be a check for whether the element is `REPEATED` instead.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -63,29 +63,14 @@
             Preconditions.checkArgument(group.getFieldCount() == 1,
                 "Invalid list: does not contain single repeated field: %s", group);
 
-            GroupType repeatedElement = group.getFields().get(0).asGroupType();
+            Type repeatedElement = group.getFields().get(0);
             Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
                 "Invalid list: inner group is not repeated");
-            Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-                "Invalid list: repeated group is not a single field: %s", group);
 
-            Types.ListType list = null;
-            Types.NestedField element = null;
-            if (iType != null) {
-              list = iType.asListType();
-              element = list.fields().get(0);
-            }
-
-            visitor.fieldNames.push(repeatedElement.getName());
-            try {
-              T elementResult = null;
-              if (repeatedElement.getFieldCount() > 0) {
-                elementResult = visitField(element, repeatedElement.getType(0), visitor);
-              }
-
-              return visitor.list(list, group, elementResult);
-            } finally {
-              visitor.fieldNames.pop();
+            if (ParquetSchemaUtil.isOldListElementType(group)) {

Review comment:
       Check for REPEATED and pass the element?

##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java
##########
@@ -610,7 +623,71 @@ public void testStructOfThreeLevelLists() throws Exception {
     structOfThreeLevelLists(false);
   }
 
-  public void threeLevelList(boolean useLegacyMode) throws Exception {
+  @Test
+  public void testTwoLevelList() throws IOException {
+    spark.conf().set("spark.sql.parquet.writeLegacyFormat", true);
+
+    String tableName = sourceName("testTwoLevelList");
+    File location = temp.newFolder();
+
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                        "col1", new ArrayType(
+                            new StructType(
+                                new StructField[]{
+                                    new StructField(
+                                        "col2",
+                                        DataTypes.IntegerType,
+                                        false,
+                                        Metadata.empty())
+                                }), false), true, Metadata.empty())});
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group array {\n" +
+            "      required int32 col2;\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(
+            JavaSparkContext.fromSparkContext(spark.sparkContext()).parallelize(testData))
+        .coalesce(1).write().format("parquet").mode(SaveMode.Append).save(location.getPath());
+
+    File parquetFile = Arrays.stream(Objects.requireNonNull(location.listFiles(new FilenameFilter() {
+      @Override
+      public boolean accept(File dir, String name) {
+        return name.endsWith("parquet");
+      }
+    }))).findAny().get();
+
+    // verify generated parquet file has expected schema
+    ParquetFileReader pqReader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(parquetFile.getPath()),
+        new Configuration()));
+    MessageType schema = pqReader.getFooter().getFileMetaData().getSchema();
+    Assert.assertEquals(expectedParquetSchema, schema.toString());

Review comment:
       You're converting them to String and then checking, which makes this brittle. Can you compare the schemas without converting to String?

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +134,50 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(
+      org.apache.iceberg.types.Type iType,
+      TypeWithSchemaVisitor<T> visitor,
+      GroupType group,
+      Type repeatedElement) {

Review comment:
       Iceberg doesn't use one line per argument. Can you wrap these at 120 chars?

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +134,50 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(
+      org.apache.iceberg.types.Type iType,
+      TypeWithSchemaVisitor<T> visitor,
+      GroupType group,
+      Type repeatedElement) {
+    Types.ListType list = null;
+    Types.NestedField element = null;
+    if (iType != null) {
+      list = iType.asListType();

Review comment:
       This can be pushed up to the method call. `iType.asListType()` passed in here makes the signature of this method simpler, `Types.ListType`. Same with the other method.

##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,17 +66,16 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type elementField = ParquetSchemaUtil.getListElementType(list);
 
     visitor.beforeRepeatedElement(repeatedElement);
     try {
       T elementResult = null;
-      if (repeatedElement.getFieldCount() > 0) {
-        Type elementField = repeatedElement.getType(0);
+      if (repeatedElement.isPrimitive() || repeatedElement.asGroupType().getFieldCount() > 0) {

Review comment:
       The changes to this class are inconsistent with the changes to the `TypeWithSchemaVisitor`. Here, the repeated element is always visited (`beforeRepeatedElement` call above) and may be processed again as the element. The other avoids pushing the name on the stack. If `beforeRepeatedElement` were used to track names, I think it would get a duplicate name for the repeated group.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790075882



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java
##########
@@ -169,6 +178,60 @@ public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOExceptio
     }
   }
 
+  @Test
+  public void testWriteReadAvroBinary() throws IOException {
+    String schema = "{" +

Review comment:
       Why not build this with the API?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r795249399



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +142,29 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(Types.ListType iListType, Types.NestedField iListElement, GroupType pListType,
+      Type pListElement, TypeWithSchemaVisitor<T> visitor) {
+    T elementResult = visitField(iListElement, pListElement, visitor);
+
+    return visitor.list(iListType, pListType, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(Types.ListType iListType, Types.NestedField iListElement,
+      GroupType pListType, Type pListElement, TypeWithSchemaVisitor<T> visitor) {
+    visitor.fieldNames.push(pListType.getFieldName(0));
+
+    try {
+      T elementResult = null;
+      if (pListElement != null) {

Review comment:
       Sg, thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r795266372



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +142,29 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(Types.ListType iListType, Types.NestedField iListElement, GroupType pListType,
+      Type pListElement, TypeWithSchemaVisitor<T> visitor) {
+    T elementResult = visitField(iListElement, pListElement, visitor);
+

Review comment:
       Don't worry about where still is incorrect in other places. We'll eventually track those down and fix them, but we do want to keep style from diverging when it is caught by a review. So please do fix this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794109173



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,32 +66,52 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type listElement = ParquetSchemaUtil.determineListElementType(list);
+    if (listElement.isRepetition(Type.Repetition.REPEATED)) {
+      return visitTwoLevelList(list, listElement, visitor);
+    } else {
+      return visitThreeLevelList(list, listElement, visitor);
+    }
+  }
+
+  private static <T> T visitTwoLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    T elementResult = visitListElement(list, listElement, visitor);
+    return visitor.list(list, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);
 
     visitor.beforeRepeatedElement(repeatedElement);
     try {
-      T elementResult = null;
-      if (repeatedElement.getFieldCount() > 0) {
-        Type elementField = repeatedElement.getType(0);
-        visitor.beforeElementField(elementField);
-        try {
-          elementResult = visit(elementField, visitor);
-        } finally {
-          visitor.afterElementField(elementField);
-        }
-      }
+      T elementResult = visitListElement(list, listElement, visitor);
 
       return visitor.list(list, elementResult);
-
     } finally {
       visitor.afterRepeatedElement(repeatedElement);
     }
   }
 
+  private static <T> T visitListElement(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);
+    T elementResult = null;
+
+    if (repeatedElement.isPrimitive() || repeatedElement.asGroupType().getFieldCount() > 0) {

Review comment:
       The purpose of this check was to make sure that there is a list element. But this method already knows that there is one because it was passed in. So this check isn't appropriate at this level and should be removed. That's good because then this method only needs the `listElement` and nothing else.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r789322218



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -67,11 +67,18 @@ public Type list(GroupType list, Type elementType) {
     Preconditions.checkArgument(elementType != null,
         "List type must have element field");
 
+    Type repeatedType = list.getFields().get(0);
+    boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(repeatedType, list.getName());
     MappedField field = nameMapping.find(currentPath());
-    Type listType = Types.buildGroup(list.getRepetition())
-        .as(LogicalTypeAnnotation.listType())
-        .repeatedGroup().addFields(elementType).named(list.getFieldName(0))
-        .named(list.getName());
+
+    Types.GroupBuilder<GroupType> listBuilder = Types.buildGroup(list.getRepetition())
+        .as(LogicalTypeAnnotation.listType());
+    if (isOldListElementType) {
+      listBuilder.addFields(elementType);

Review comment:
       I think that this handling is correct for `elementType` when it was a 2-level list.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r789320918



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -96,11 +96,13 @@ public Type struct(GroupType struct, List<Type> fieldTypes) {
 
   @Override
   public Type list(GroupType array, Type elementType) {
-    GroupType repeated = array.getType(0).asGroupType();
-    org.apache.parquet.schema.Type element = repeated.getType(0);
+    org.apache.parquet.schema.Type repeated = array.getType(0);
+    org.apache.parquet.schema.Type repeatedElement = array.getFields().get(0);
+    boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(repeatedElement, array.getName());
+    org.apache.parquet.schema.Type element = isOldListElementType ? repeated : repeated.asGroupType().getType(0);

Review comment:
       I think it may be cleaner to have a util method that returns the list element `Type` instead of doing all the logic here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r792110439



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java
##########
@@ -610,7 +623,71 @@ public void testStructOfThreeLevelLists() throws Exception {
     structOfThreeLevelLists(false);
   }
 
-  public void threeLevelList(boolean useLegacyMode) throws Exception {
+  @Test
+  public void testTwoLevelList() throws IOException {
+    spark.conf().set("spark.sql.parquet.writeLegacyFormat", true);
+
+    String tableName = sourceName("testTwoLevelList");
+    File location = temp.newFolder();
+
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                        "col1", new ArrayType(
+                            new StructType(
+                                new StructField[]{
+                                    new StructField(
+                                        "col2",
+                                        DataTypes.IntegerType,
+                                        false,
+                                        Metadata.empty())
+                                }), false), true, Metadata.empty())});
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group array {\n" +
+            "      required int32 col2;\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(
+            JavaSparkContext.fromSparkContext(spark.sparkContext()).parallelize(testData))
+        .coalesce(1).write().format("parquet").mode(SaveMode.Append).save(location.getPath());
+
+    File parquetFile = Arrays.stream(Objects.requireNonNull(location.listFiles(new FilenameFilter() {
+      @Override
+      public boolean accept(File dir, String name) {
+        return name.endsWith("parquet");
+      }
+    }))).findAny().get();
+
+    // verify generated parquet file has expected schema
+    ParquetFileReader pqReader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(parquetFile.getPath()),
+        new Configuration()));
+    MessageType schema = pqReader.getFooter().getFileMetaData().getSchema();
+    Assert.assertEquals(expectedParquetSchema, schema.toString());

Review comment:
       Got it, the string comparison in tests really helps in understanding the differences though. Updated to compare `MessageType` instead.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r792111562



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,17 +66,16 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type elementField = ParquetSchemaUtil.getListElementType(list);
 
     visitor.beforeRepeatedElement(repeatedElement);
     try {
       T elementResult = null;
-      if (repeatedElement.getFieldCount() > 0) {
-        Type elementField = repeatedElement.getType(0);
+      if (repeatedElement.isPrimitive() || repeatedElement.asGroupType().getFieldCount() > 0) {

Review comment:
       Thanks, addressed this. It would be nice to have some tests to check this behavior. But, I don't think we need to block on that, unless you disagree.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r789324875



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/PruneColumns.java
##########
@@ -108,15 +108,16 @@ public Type struct(GroupType struct, List<Type> fields) {
 
   @Override
   public Type list(GroupType list, Type element) {
-    GroupType repeated = list.getType(0).asGroupType();
-    Type originalElement = repeated.getType(0);
+    Type repeated = list.getType(0);
+    boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(repeated, list.getName());
+    Type originalElement = isOldListElementType ? repeated : repeated.asGroupType().getType(0);
     Integer elementId = getId(originalElement);
 
     if (elementId != null && selectedIds.contains(elementId)) {
       return list;
     } else if (element != null) {
       if (!Objects.equal(element, originalElement)) {
-        return list.withNewFields(repeated.withNewFields(element));
+        return list.withNewFields(repeated.asGroupType().withNewFields(element));

Review comment:
       I think this is wrong. If this is a 2-level list, then `repeated` is the element type and could be primitive. So this needs to rebuild the list correctly if it is a 2-level list.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790077261



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java
##########
@@ -610,7 +623,71 @@ public void testStructOfThreeLevelLists() throws Exception {
     structOfThreeLevelLists(false);
   }
 
-  public void threeLevelList(boolean useLegacyMode) throws Exception {
+  @Test
+  public void testTwoLevelList() throws IOException {
+    spark.conf().set("spark.sql.parquet.writeLegacyFormat", true);
+
+    String tableName = sourceName("testTwoLevelList");
+    File location = temp.newFolder();
+
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                        "col1", new ArrayType(
+                            new StructType(
+                                new StructField[]{
+                                    new StructField(
+                                        "col2",
+                                        DataTypes.IntegerType,
+                                        false,
+                                        Metadata.empty())
+                                }), false), true, Metadata.empty())});
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group array {\n" +
+            "      required int32 col2;\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(
+            JavaSparkContext.fromSparkContext(spark.sparkContext()).parallelize(testData))
+        .coalesce(1).write().format("parquet").mode(SaveMode.Append).save(location.getPath());
+
+    File parquetFile = Arrays.stream(Objects.requireNonNull(location.listFiles(new FilenameFilter() {
+      @Override
+      public boolean accept(File dir, String name) {
+        return name.endsWith("parquet");
+      }
+    }))).findAny().get();
+
+    // verify generated parquet file has expected schema
+    ParquetFileReader pqReader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(parquetFile.getPath()),
+        new Configuration()));
+    MessageType schema = pqReader.getFooter().getFileMetaData().getSchema();
+    Assert.assertEquals(expectedParquetSchema, schema.toString());

Review comment:
       Can you not check equality with Parquet schemas?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r795248128



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +142,29 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(Types.ListType iListType, Types.NestedField iListElement, GroupType pListType,
+      Type pListElement, TypeWithSchemaVisitor<T> visitor) {
+    T elementResult = visitField(iListElement, pListElement, visitor);
+

Review comment:
       I think the style used in various parts of code are different. For example, IIUC `ParquetReadSupport.prepareForRead` is different than what you are saying. Earlier you also had mentioned Iceberg does not use new param at new line pattern. Updating this part to keep the same level (align params start with previous line) and wrap.
   
   Let me know which style we should try to follow and I can try to update the intellij-style that we provide with Iceberg repo accordingly. I don't know if it is possible, but I can try.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r795248411



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +165,62 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  public static Type determineListElementType(GroupType array) {
+    Type repeated = array.getFields().get(0);
+    boolean isOldListElementType = isOldListElementType(array);
+
+    Preconditions.checkArgument(isOldListElementType ||
+            repeated.asGroupType().getFieldCount() <= 1,
+        "Invalid list: repeated group is not a single field: %s", array);

Review comment:
       Good point, I probably got this while refactoring `ParquetTypeVisitor`. Dropping, thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794114102



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +142,29 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(Types.ListType iListType, Types.NestedField iListElement, GroupType pListType,
+      Type pListElement, TypeWithSchemaVisitor<T> visitor) {
+    T elementResult = visitField(iListElement, pListElement, visitor);
+

Review comment:
       Style: extra newline. Also, we either wrap argument lists at the same level or start all arguments on the next line at 2 indents.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r773398392



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestParquetSchemaUtil.java
##########
@@ -204,6 +204,54 @@ public void testSchemaConversionForHiveStyleLists() {
     Assert.assertEquals("Schema must match", expectedSchema.asStruct(), actualSchema.asStruct());
   }
 
+  @Test
+  public void testSchemaConversionForHiveStyleTwoLevelList() {
+    String messageType =
+            "message root {" +
+            "  required group f0 {" +
+            "    required group f00 (LIST) {" +
+            "      repeated group element {" +
+            "        required int32 f000;" +
+            "        optional int64 f001;" +
+            "      }" +
+            "    }" +
+            "  }" +
+            "}";
+
+    MessageType parquetScehma = MessageTypeParser.parseMessageType(messageType);
+    Schema expectedSchema = new Schema(
+            required(1, "f0", Types.StructType.of(

Review comment:
       Nit: A lot of this looks to be over-indented. Here, I would expect `required` to be indented 4 spaces after the start of `Schema` on the line above it.
   
   Do you use IntelliJ? If so, there's a style guide checked in you can use. It doesn't handle everything, but it will handle the most common concerns (indentation, etc).
   
   Instructions for Code formatter: https://iceberg.apache.org/#community/#configuring-code-formatter-for-intellij-idea




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r786228377



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestParquetSchemaUtil.java
##########
@@ -204,6 +204,55 @@ public void testSchemaConversionForHiveStyleLists() {
     Assert.assertEquals("Schema must match", expectedSchema.asStruct(), actualSchema.asStruct());
   }
 
+  @Test
+  public void testSchemaConversionForHiveStyleTwoLevelList() {

Review comment:
       The other two cases which are specific to avro and thrift are only different in how they name the element, which we don;t care about and is tested by other two level lists tests. However, it is definitely a good idea to explicitly add tests even if they are a bit redundant as it helps in understanding behavior expectations. Adding it shortly, thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791318209



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/PruneColumns.java
##########
@@ -108,15 +108,20 @@ public Type struct(GroupType struct, List<Type> fields) {
 
   @Override
   public Type list(GroupType list, Type element) {
-    GroupType repeated = list.getType(0).asGroupType();
-    Type originalElement = repeated.getType(0);
+    Type repeated = list.getType(0);
+    boolean isOldListElementType = ParquetSchemaUtil.isOldListElementType(list);
+    Type originalElement = ParquetSchemaUtil.getListElementType(list);
     Integer elementId = getId(originalElement);
 
     if (elementId != null && selectedIds.contains(elementId)) {
       return list;
     } else if (element != null) {
       if (!Objects.equal(element, originalElement)) {
-        return list.withNewFields(repeated.withNewFields(element));
+        if (isOldListElementType) {

Review comment:
       I think this should be a check for whether the element is `REPEATED` instead.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#issuecomment-1025256683


   @SinghAsDev, looks like it is missing a test for Iceberg generics. I see the update for it now. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#issuecomment-1025256683


   @SinghAsDev, looks like it is missing a test for Iceberg generics. I see the update for it now. THanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794107201



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,32 +66,52 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type listElement = ParquetSchemaUtil.determineListElementType(list);
+    if (listElement.isRepetition(Type.Repetition.REPEATED)) {
+      return visitTwoLevelList(list, listElement, visitor);
+    } else {
+      return visitThreeLevelList(list, listElement, visitor);
+    }
+  }
+
+  private static <T> T visitTwoLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    T elementResult = visitListElement(list, listElement, visitor);
+    return visitor.list(list, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);

Review comment:
       The repeated element was already extracted in the caller. Why not pass it in?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r786221839



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -96,11 +96,13 @@ public Type struct(GroupType struct, List<Type> fieldTypes) {
 
   @Override
   public Type list(GroupType array, Type elementType) {
-    GroupType repeated = array.getType(0).asGroupType();
-    org.apache.parquet.schema.Type element = repeated.getType(0);
+    org.apache.parquet.schema.Type repeated = array.getType(0);
+    org.apache.parquet.schema.Type repeatedElement = array.getFields().get(0);
+    boolean isElementType = ParquetSchemaUtil.isListElementType(repeatedElement, array.getName());

Review comment:
       To be consistent with `IsOldListElementType` method name, I will rename this to `isOldListElementType`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r789318600



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -96,11 +96,13 @@ public Type struct(GroupType struct, List<Type> fieldTypes) {
 
   @Override
   public Type list(GroupType array, Type elementType) {
-    GroupType repeated = array.getType(0).asGroupType();
-    org.apache.parquet.schema.Type element = repeated.getType(0);
+    org.apache.parquet.schema.Type repeated = array.getType(0);
+    org.apache.parquet.schema.Type repeatedElement = array.getFields().get(0);

Review comment:
       From Parquet:
   
   ```java
     public Type getType(int index) {
       return fields.get(index);
     }
   
     public List<Type> getFields() {
       return fields;
     }
   ```
   
   So `g.getType(0)` and `g.getFields().get(0)` will return the same object. That makes me suspect what's happening here. What was your intent with both `repeated` and `repeatedElement`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790075128



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -76,16 +81,22 @@
               element = list.fields().get(0);
             }
 
-            visitor.fieldNames.push(repeatedElement.getName());
+            if (!isOldListElementType) {
+              visitor.fieldNames.push(repeatedElement.getName());
+            }

Review comment:
       This needs newlines between control flow blocks. It also warrants a comment about why this is skipping the name handling.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r789323342



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +164,48 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  // Parquet LIST backwards-compatibility rules.
+  // https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
+  public static boolean isOldListElementType(Type repeatedType, String parentName) {

Review comment:
       It also seems like this is always used after extracting the repeated type. It might be easier just to pass in the list itself and return whether it is a 2-level list or not.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r791319575



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +134,50 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(
+      org.apache.iceberg.types.Type iType,
+      TypeWithSchemaVisitor<T> visitor,
+      GroupType group,
+      Type repeatedElement) {

Review comment:
       Iceberg doesn't use one line per argument. Can you wrap these at 120 chars?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790077667



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java
##########
@@ -610,7 +623,71 @@ public void testStructOfThreeLevelLists() throws Exception {
     structOfThreeLevelLists(false);
   }
 
-  public void threeLevelList(boolean useLegacyMode) throws Exception {
+  @Test
+  public void testTwoLevelList() throws IOException {
+    spark.conf().set("spark.sql.parquet.writeLegacyFormat", true);
+
+    String tableName = sourceName("testTwoLevelList");
+    File location = temp.newFolder();
+
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                        "col1", new ArrayType(
+                            new StructType(
+                                new StructField[]{
+                                    new StructField(
+                                        "col2",
+                                        DataTypes.IntegerType,
+                                        false,
+                                        Metadata.empty())
+                                }), false), true, Metadata.empty())});
+    String expectedParquetSchema =

Review comment:
       I think this test case warrants a comment that explains that this is not a 3-level list even though it looks like one. It is a 2-level list where the items are structs with 1 field.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790076972



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java
##########
@@ -610,7 +623,71 @@ public void testStructOfThreeLevelLists() throws Exception {
     structOfThreeLevelLists(false);
   }
 
-  public void threeLevelList(boolean useLegacyMode) throws Exception {
+  @Test
+  public void testTwoLevelList() throws IOException {
+    spark.conf().set("spark.sql.parquet.writeLegacyFormat", true);

Review comment:
       I think we should guarantee that this is not set for any other test cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r789316594



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +164,48 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  // Parquet LIST backwards-compatibility rules.
+  // https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
+  public static boolean isOldListElementType(Type repeatedType, String parentName) {

Review comment:
       Does this need to be public? I don't see much value in Iceberg exposing it if we don't need to.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794112634



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +142,29 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(Types.ListType iListType, Types.NestedField iListElement, GroupType pListType,
+      Type pListElement, TypeWithSchemaVisitor<T> visitor) {
+    T elementResult = visitField(iListElement, pListElement, visitor);
+
+    return visitor.list(iListType, pListType, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(Types.ListType iListType, Types.NestedField iListElement,
+      GroupType pListType, Type pListElement, TypeWithSchemaVisitor<T> visitor) {
+    visitor.fieldNames.push(pListType.getFieldName(0));
+
+    try {
+      T elementResult = null;
+      if (pListElement != null) {

Review comment:
       The `determineListElementType` method doesn't perform the field count check and return null, so this is always true. I think you may have thought that `List.get` would return null? I think you can remove this null check and always visit the field.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r795249047



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,32 +66,52 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type listElement = ParquetSchemaUtil.determineListElementType(list);
+    if (listElement.isRepetition(Type.Repetition.REPEATED)) {
+      return visitTwoLevelList(list, listElement, visitor);
+    } else {
+      return visitThreeLevelList(list, listElement, visitor);
+    }
+  }
+
+  private static <T> T visitTwoLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    T elementResult = visitListElement(list, listElement, visitor);
+    return visitor.list(list, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);

Review comment:
       I was trying to keep the function signature simple. As the repeatedElement can be obtained from list easily, I didn't see a reason to add one more param. Leaving it as is for now, but let me know if you strongly feel we should pass the `repeatedElement` instead.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#issuecomment-1025235542


   > @SinghAsDev, I did another round of review. This also still needs to update [Iceberg generics](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetReaders.java), I think.
   
   Thanks for thorough reviews @rdblue ! I thought I already covered the Iceberg generic with this [change](https://github.com/apache/iceberg/pull/3774/files#diff-b8e8443fcec3843e538dbc702d4c131ff58359cb83ccdb211d8679c1d77c16bd). Am I missing something?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r795265677



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,32 +66,52 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type listElement = ParquetSchemaUtil.determineListElementType(list);
+    if (listElement.isRepetition(Type.Repetition.REPEATED)) {
+      return visitTwoLevelList(list, listElement, visitor);
+    } else {
+      return visitThreeLevelList(list, listElement, visitor);
+    }
+  }
+
+  private static <T> T visitTwoLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    T elementResult = visitListElement(list, listElement, visitor);
+    return visitor.list(list, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);

Review comment:
       This is okay.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r790075128



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -76,16 +81,22 @@
               element = list.fields().get(0);
             }
 
-            visitor.fieldNames.push(repeatedElement.getName());
+            if (!isOldListElementType) {
+              visitor.fieldNames.push(repeatedElement.getName());
+            }

Review comment:
       This needs newlines between control flow blocks. It also warrants a comment about why this is skipping the name handling.
   
   I think I would also handle the 2-level lists separately at a higher level, rather than checking `isOldListElementType` 3 times.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r795267302



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -149,6 +142,29 @@
     }
   }
 
+  private static <T> T visitTwoLevelList(Types.ListType iListType, Types.NestedField iListElement, GroupType pListType,
+      Type pListElement, TypeWithSchemaVisitor<T> visitor) {
+    T elementResult = visitField(iListElement, pListElement, visitor);
+

Review comment:
       It should already be fixed. Does the update still have style issue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794106516



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,32 +66,52 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Type listElement = ParquetSchemaUtil.determineListElementType(list);
+    if (listElement.isRepetition(Type.Repetition.REPEATED)) {
+      return visitTwoLevelList(list, listElement, visitor);
+    } else {
+      return visitThreeLevelList(list, listElement, visitor);
+    }
+  }
+
+  private static <T> T visitTwoLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    T elementResult = visitListElement(list, listElement, visitor);
+    return visitor.list(list, elementResult);
+  }
+
+  private static <T> T visitThreeLevelList(GroupType list, Type listElement, ParquetTypeVisitor<T> visitor) {
+    Type repeatedElement = list.getFields().get(0);
 
     visitor.beforeRepeatedElement(repeatedElement);
     try {
-      T elementResult = null;
-      if (repeatedElement.getFieldCount() > 0) {
-        Type elementField = repeatedElement.getType(0);
-        visitor.beforeElementField(elementField);
-        try {
-          elementResult = visit(elementField, visitor);
-        } finally {
-          visitor.afterElementField(elementField);
-        }
-      }
+      T elementResult = visitListElement(list, listElement, visitor);
 
       return visitor.list(list, elementResult);
-

Review comment:
       Nit: unnecessary whitespace change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r794106237



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -164,4 +165,62 @@ public Boolean primitive(PrimitiveType primitive) {
     }
   }
 
+  public static Type determineListElementType(GroupType array) {
+    Type repeated = array.getFields().get(0);
+    boolean isOldListElementType = isOldListElementType(array);
+
+    Preconditions.checkArgument(isOldListElementType ||
+            repeated.asGroupType().getFieldCount() <= 1,
+        "Invalid list: repeated group is not a single field: %s", array);

Review comment:
       This will never happen because if the field count is > 1, then `isOldListElementType` will be true.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3774: [Parquet] Add support to read Parquet files written with old 2-level list structures.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3774:
URL: https://github.com/apache/iceberg/pull/3774#discussion_r775695230



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestParquetSchemaUtil.java
##########
@@ -204,6 +204,54 @@ public void testSchemaConversionForHiveStyleLists() {
     Assert.assertEquals("Schema must match", expectedSchema.asStruct(), actualSchema.asStruct());
   }
 
+  @Test
+  public void testSchemaConversionForHiveStyleTwoLevelList() {
+    String messageType =
+            "message root {" +
+            "  required group f0 {" +
+            "    required group f00 (LIST) {" +
+            "      repeated group element {" +
+            "        required int32 f000;" +
+            "        optional int64 f001;" +
+            "      }" +
+            "    }" +
+            "  }" +
+            "}";
+
+    MessageType parquetScehma = MessageTypeParser.parseMessageType(messageType);
+    Schema expectedSchema = new Schema(
+            required(1, "f0", Types.StructType.of(

Review comment:
       Thanks @kbendick , I just did that and updated the PR as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org