You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/13 00:22:04 UTC

[GitHub] [iceberg] SinghAsDev opened a new pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

SinghAsDev opened a new pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723


   Demonstrate issue where Iceberg's parquet reader silently returns nulls for fields with 2-level lists. Also, update migrate tables documentation to let users know of the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-996993351


   @RussellSpitzer @jackye1995 thanks for your reviews, the diff is updated to address your suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992082883


   > "bag" is added by hive's parquet serde, ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java. https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java#L53. We use this serde from hive's 1.2.1 version. However, even latest versions still use "bag", https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java.
   > 
   > Another thing to note here is that even if we use `list` as name for repeated group of a list, I think Iceberg returns just empty list. I am working on a fix for this, along with backward compatibility support. Would love to get your reviews on that as well.
   
   I’d be happy to take a look when you have a fix. We applied a fix for Iceberg 0.12.1 where we stopped relying on the name provided in the parquet-schema for map fields.
   
   I think something like that would be the best route forward so that we aren’t imposing any restrictions on the parquet provided names as we’ve observed these change across systems and parquet versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r770977781



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.io.FilenameFilter;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.mapping.MappingUtil;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.parquet.schema.MessageType;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.ArrayType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader {
+  private static SparkSession spark = null;
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList() throws IOException {
+    File location = new File(temp.getRoot(), "parquetReaderTest");
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                    "col1", new ArrayType(
+                        new StructType(
+                            new StructField[]{
+                                new StructField(
+                                    "col2",
+                                    DataTypes.IntegerType,
+                                    false,
+                                    Metadata.empty())
+                            }), true), true, Metadata.empty())});
+
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group bag {\n" +
+            "      optional group array {\n" +
+            "        required int32 col2;\n" +
+            "      }\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(

Review comment:
       Could we just run a SQL Insert statement here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev edited a comment on pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
SinghAsDev edited a comment on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992079085


   "bag" is added by hive's parquet serde, ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java. https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java#L53. We use this serde from hive's 1.2.1 version. However, even latest versions still use "bag", https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java.
   
   Another thing to note here is that even if we use `list` as name for repeated group of a list, I think Iceberg returns just empty list. I am working on a fix for this, along with backward compatibility support. Would love to get your reviews on that as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992812205


   > Hive still writes with same names, so I would say it is a problem for any parquet file generated by writing to a hive table with ParquetHiveSerDe.
   > -- - Ashis
   
   I suggest we consider going with the something similar to the map projection element  rename fix that Ryan applied for maps from Parquet 1.11.0 to 1.11.1, where we don't depend on the name at all.
   
   PR for context: https://github.com/apache/iceberg/pull/3309
   
   It might not work because that was for projection, but it looks like you're updating similar areas of the code.
   
   If we can get something like that where we don't assume names at all from parquet, just element structure and types, that would ideally allow for N levels of nesting etc. and further remove us from individual engines choice of element names for parquet (or changes within the parquet spec itself). cc @SinghAsDev 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771552852



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test

Review comment:
       I thought a bit about how to trick Spark into using the Old Map Red style serde. This is a little difficult because spark checks to see if "parquet" is in the name of the serde and then just uses it's own if it sees that.
   
   So step 1 Make a fake Serde that is the mapred parquet serde but isn't named that
   ```
   import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;
   import org.apache.hadoop.hive.serde.serdeConstants;
   import org.apache.hadoop.hive.serde2.SerDeSpec;
   import org.apache.parquet.hadoop.ParquetOutputFormat;
   
   @SerDeSpec(schemaProps = {serdeConstants.LIST_COLUMNS, serdeConstants.LIST_COLUMN_TYPES,
       ParquetOutputFormat.COMPRESSION})
   public class OldParSerde extends ParquetHiveSerDe {
   
   }
   ```
   
   Step 2 : Create and write to our new table using the old SERDE, *then* switch it back to the normal parquet one. This is important because Iceberg needs "parquet" to be in the format for migrate to work
   ```java
       File location = temp.newFolder();
       // Create the table using 
       sql("CREATE TABLE %s (col1 ARRAY<STRUCT<col2 INT>>)" +
           " ROW FORMAT SERDE '%s'" +
           "  STORED AS INPUTFORMAT" +
           " 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'" +
           "  OUTPUTFORMAT" +
           " 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'" +
           " LOCATION '%s'", tableName, OldParSerde.class.getName(), location);
   
       int testValue = 12345;
       sql(String.format("INSERT INTO %s VALUES (ARRAY(STRUCT(%s)))", tableName, testValue));
       List<Object[]> expected = sql(String.format("SELECT * FROM %s", tableName));
   
       sql("ALTER TABLE %s SET SERDE '%s'", tableName, ParquetHiveSerDe.class.getName());
   ```
   
   Step 3: Migrate the now normal hive table
   ```java
    // migrate table
       SparkActions.get().migrateTable(tableName).execute();
   
       // check migrated table is returning expected result
       List<Object[]> results = sql(String.format("SELECT * FROM %s", tableName));
       Assert.assertTrue(results.size() > 0);
       assertEquals("Output must match", expected, results);
   ```
   
   Output :
   ```
   java.lang.AssertionError: Output must match: row 1 col 1 contents should match expected:<[[12345]]> but was:<null>
   	at org.junit.Assert.fail(Assert.java:89)
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771676587



##########
File path: site/docs/spark-procedures.md
##########
@@ -334,6 +339,13 @@ CALL catalog_name.system.rewrite_manifests('db.sample', false)
 
 The `snapshot` and `migrate` procedures help test and migrate existing Hive or Spark tables to Iceberg.
 
+!!! Note
+    Parquet files written with Parquet writers that use names other than `list` and `element` for repeated group
+    and element of the list respectively are **read incorrectly as nulls** by Iceberg upto 0.12.1 Iceberg versions.
+    Most commonly such files are written by follow writers. 

Review comment:
       nit: `the following writers:`
   
   btw, have you tried `mkdocs serve`? Not sure if the indentation would be correctly displayed, please verify that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] uncleGen edited a comment on pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
uncleGen edited a comment on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-998427166


   IIUC, this pr dose not resolve #3604, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r770980781



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.io.FilenameFilter;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.mapping.MappingUtil;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.parquet.schema.MessageType;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.ArrayType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader {
+  private static SparkSession spark = null;
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList() throws IOException {
+    File location = new File(temp.getRoot(), "parquetReaderTest");
+    StructType sparkSchema =

Review comment:
       So this is a different schema than would be created by
   
   CREATE HIVEFORMAT TABLE x ( col_1 : List<struct<col2: int>>)
   
   and
   
   CREATE TABLE x  ( col_1 : List<struct<col2: int>>) ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r767371571



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.IOException;
+import java.util.List;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.junit.Assert;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+
+public class TestParquetReader {
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @Test
+  public void testParquet2LevelList() throws IOException {
+    Schema icebergSchema = new Schema(
+            optional(1, "key", Types.StringType.get()),
+            optional(2, "val", Types.ListType.ofRequired(3, Types.StructType.of(
+                              optional(4, "a1", Types.StringType.get()),
+                              optional(5, "a2", Types.StringType.get())
+                      )
+                    )));
+    List<InternalRow> rows;
+
+    /* Using a static file rather than generating test data in test, as parquet writers in Iceberg only supports
+     * three level lists. The twoLevelList.pq is a parquet file that contains following Parquet schema.
+     * message hive_schema {
+     *  optional binary key (STRING);
+     *  optional group val (LIST) {
+     *    repeated group bag {
+     *      optional group array_element {
+     *        optional binary a1 (STRING);
+     *        optional binary a2 (STRING);
+     *      }
+     *    }
+     *  }
+     * }
+     *
+     * It contains only one row. Below is the json dump of the file.
+     * {"key":"k1","val":{"bag":[{"array_element":{"a1":"a","a2":"b"}}]}}
+     */

Review comment:
       I think your schema is off vs your JSON element:
   
   It looks like `val` should be a list, but val in the JSON is a struct.
   
   I might have that wrong though. I'm still working on a reproduction.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r768280621



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -88,6 +96,31 @@ public Type primitive(PrimitiveType primitive) {
     return field == null ? primitive : primitive.withId(field.id());
   }
 
+  @Override
+  public void beforeElementField(Type element) {
+    super.beforeElementField(makeElement(element));
+  }
+
+  @Override
+  public void afterElementField(Type element) {
+    super.afterElementField(makeElement(element));
+  }
+
+  private Type makeElement(Type element) {
+    // List's element in 3-level lists can be named differently across different parquet writers.
+    // For example, hive names it "array_element", whereas new parquet writers names it as "element".
+    if (element.getName().equals("element") || element.isPrimitive()) {
+      return element;
+    }

Review comment:
       Yeah I'll absolutely give this another pass. Sorry I was head down in something else today.
   
   I'll also look for somebody to start the tests for you unless you'd like to wait. 😄 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771503055



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test

Review comment:
       I tried out these tests and they all pass without the above pr, so I think spark may not be creating the schema in the same way native hive is?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771493825



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestParquetSchemaUtil.java
##########
@@ -181,6 +182,28 @@ public void testSchemaConversionWithoutAssigningIds() {
     Assert.assertEquals("Schema must match", expectedSchema.asStruct(), actualSchema.asStruct());
   }
 
+  @Test
+  public void testSchemaConversionForHiveStyleLists() {
+    String parquetSchemaString =
+        "message spark_schema {\n" +

Review comment:
       This schema cannot be constructed using the Parquet API directly? I was hoping this would look similar to the tests above that create the schema using Parquet directly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771588922



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestParquetSchemaUtil.java
##########
@@ -181,6 +182,28 @@ public void testSchemaConversionWithoutAssigningIds() {
     Assert.assertEquals("Schema must match", expectedSchema.asStruct(), actualSchema.asStruct());
   }
 
+  @Test
+  public void testSchemaConversionForHiveStyleLists() {
+    String parquetSchemaString =
+        "message spark_schema {\n" +

Review comment:
       Unfortunately no, as the moment we call list, it will use 3-level list with list and element names




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-997056355


   Thanks so much to @SinghAsDev for the PR and  @jackye1995 and @kbendick for the review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer merged pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer merged pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r770981780



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.io.FilenameFilter;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.mapping.MappingUtil;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.parquet.schema.MessageType;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.ArrayType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader {
+  private static SparkSession spark = null;
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList() throws IOException {
+    File location = new File(temp.getRoot(), "parquetReaderTest");
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                    "col1", new ArrayType(
+                        new StructType(
+                            new StructField[]{
+                                new StructField(
+                                    "col2",
+                                    DataTypes.IntegerType,
+                                    false,
+                                    Metadata.empty())
+                            }), true), true, Metadata.empty())});
+
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group bag {\n" +
+            "      optional group array {\n" +
+            "        required int32 col2;\n" +
+            "      }\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(
+            JavaSparkContext.fromSparkContext(spark.sparkContext()).parallelize(testData))
+        .coalesce(1).write().format("parquet").mode(SaveMode.Append).save(location.getPath());
+
+    File parquetFile = Arrays.stream(Objects.requireNonNull(location.listFiles(new FilenameFilter() {
+      @Override
+      public boolean accept(File dir, String name) {
+        return name.endsWith("parquet");
+      }
+    }))).findAny().get();
+
+    // verify generated parquet file has expected schema
+    ParquetFileReader pqReader = ParquetFileReader.open(
+        HadoopInputFile.fromPath(new Path(parquetFile.getPath()), new Configuration()));
+    MessageType schema = pqReader.getFooter().getFileMetaData().getSchema();
+    Assert.assertEquals(expectedParquetSchema, schema.toString());
+
+    // read from Iceberg's parquet reader and ensure data is read correctly into Spark's internal row
+    Schema icebergSchema = SparkSchemaUtil.convert(sparkSchema);

Review comment:
       I think we should do a test to make sure files imported are read correctly, instead of using the internal methods doing an import files into an iceberg table and showing that the rows are correct? Part of add_files or Migrate tests?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev edited a comment on pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
SinghAsDev edited a comment on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992079085


   `bag` is added by hive's [schema converter](https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L125) (defined [here](https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java#L53). We use this serde from hive's 1.2.1 version. However, even latest versions still use `bag` [here](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java).
   
   Another thing to note here is that even if we use `list` as name for repeated group of a list, I think Iceberg returns just empty list. I am working on a fix for this, along with backward compatibility support. Would love to get your reviews on that as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r767372037



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.IOException;
+import java.util.List;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.junit.Assert;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+
+public class TestParquetReader {
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @Test
+  public void testParquet2LevelList() throws IOException {
+    Schema icebergSchema = new Schema(
+            optional(1, "key", Types.StringType.get()),
+            optional(2, "val", Types.ListType.ofRequired(3, Types.StructType.of(
+                              optional(4, "a1", Types.StringType.get()),
+                              optional(5, "a2", Types.StringType.get())
+                      )
+                    )));
+    List<InternalRow> rows;
+
+    /* Using a static file rather than generating test data in test, as parquet writers in Iceberg only supports
+     * three level lists. The twoLevelList.pq is a parquet file that contains following Parquet schema.
+     * message hive_schema {
+     *  optional binary key (STRING);
+     *  optional group val (LIST) {
+     *    repeated group bag {
+     *      optional group array_element {
+     *        optional binary a1 (STRING);
+     *        optional binary a2 (STRING);
+     *      }
+     *    }
+     *  }
+     * }
+     *
+     * It contains only one row. Below is the json dump of the file.
+     * {"key":"k1","val":{"bag":[{"array_element":{"a1":"a","a2":"b"}}]}}
+     */

Review comment:
       When I read in the element from the file, using spark, it doesn't have the same shape as the JSON that you showed.
   
   ```
   scala> val df = spark.read.parquet("./spark/v3.2/spark/src/test/resources/twoLevelList.pq")
   df: org.apache.spark.sql.DataFrame = [key: string, val: array<struct<a1:string,a2:string>>]
   
   scala> df.printSchema
   root
    |-- key: string (nullable = true)
    |-- val: array (nullable = true)
    |    |-- element: struct (containsNull = true)
    |    |    |-- a1: string (nullable = true)
    |    |    |-- a2: string (nullable = true)
   
   
   scala> df.toJSON.show(false)
   +----------------------------------------+
   |value                                   |
   +----------------------------------------+
   |{"key":"k1","val":[{"a1":"a","a2":"b"}]}|
   +----------------------------------------+
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992079085


   "bag" is added by hive's parquet serde, ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java. https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java#L53. We use this serde from hive's 1.2.1 version. However, even latest versions still use "bag", https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java.
   
   Another thing to note here is that even if this is list, I think Iceberg returns just empty list. I am working on a fix for this, along with backward compatibility support. Would love to get your reviews on that as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] uncleGen commented on pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
uncleGen commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-998427166


   IUCC, this pr dose not resolve #3604, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r770983167



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.io.FilenameFilter;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.mapping.MappingUtil;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.parquet.schema.MessageType;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.ArrayType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader {
+  private static SparkSession spark = null;
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList() throws IOException {
+    File location = new File(temp.getRoot(), "parquetReaderTest");
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                    "col1", new ArrayType(
+                        new StructType(
+                            new StructField[]{
+                                new StructField(
+                                    "col2",
+                                    DataTypes.IntegerType,
+                                    false,
+                                    Metadata.empty())
+                            }), true), true, Metadata.empty())});
+
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group bag {\n" +
+            "      optional group array {\n" +
+            "        required int32 col2;\n" +
+            "      }\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(
+            JavaSparkContext.fromSparkContext(spark.sparkContext()).parallelize(testData))
+        .coalesce(1).write().format("parquet").mode(SaveMode.Append).save(location.getPath());
+
+    File parquetFile = Arrays.stream(Objects.requireNonNull(location.listFiles(new FilenameFilter() {
+      @Override
+      public boolean accept(File dir, String name) {
+        return name.endsWith("parquet");
+      }
+    }))).findAny().get();
+
+    // verify generated parquet file has expected schema
+    ParquetFileReader pqReader = ParquetFileReader.open(
+        HadoopInputFile.fromPath(new Path(parquetFile.getPath()), new Configuration()));
+    MessageType schema = pqReader.getFooter().getFileMetaData().getSchema();
+    Assert.assertEquals(expectedParquetSchema, schema.toString());
+
+    // read from Iceberg's parquet reader and ensure data is read correctly into Spark's internal row
+    Schema icebergSchema = SparkSchemaUtil.convert(sparkSchema);

Review comment:
       We should also have the low level tests in TestParquetSchemaUtil.java




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771552852



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test

Review comment:
       I thought a bit about how to trick Spark into using the Old Map Red style serde. This is a little difficult because spark checks to see if "parquet" is in the name of the serde and then just uses it's own if it sees that.
   
   So step 1 Make a fake Serde that is the mapred parquet serde but isn't named that
   ```
   import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;
   import org.apache.hadoop.hive.serde.serdeConstants;
   import org.apache.hadoop.hive.serde2.SerDeSpec;
   import org.apache.parquet.hadoop.ParquetOutputFormat;
   
   @SerDeSpec(schemaProps = {serdeConstants.LIST_COLUMNS, serdeConstants.LIST_COLUMN_TYPES,
       ParquetOutputFormat.COMPRESSION})
   public class OldParSerde extends ParquetHiveSerDe {
   
   }
   ```
   
   Step 2 : Create and write to our new table using the old SERDE, *then* switch it back to the normal parquet one. This is important because Iceberg needs "parquet" to be in the format for migrate to work
   ```java
       File location = temp.newFolder();
       // Create the table using 
       sql("CREATE TABLE %s (col1 ARRAY<STRUCT<col2 INT>>)" +
           " ROW FORMAT SERDE '%s'" +
           "  STORED AS INPUTFORMAT" +
           " 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'" +
           "  OUTPUTFORMAT" +
           " 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'" +
           " LOCATION '%s'", tableName, OldParSerde.class.getName(), location);
   
       int testValue = 12345;
       sql(String.format("INSERT INTO %s VALUES (ARRAY(STRUCT(%s)))", tableName, testValue));
       List<Object[]> expected = sql(String.format("SELECT * FROM %s", tableName));
   
       sql("ALTER TABLE %s SET SERDE '%s'", tableName, ParquetHiveSerDe.class.getName());
   ```
   
   Here is the file generated
   ```
   file:            file:/var/folders/yl/6cwgks7919s1td2mfdq86cbm0000gn/T/junit5001523939508400535/junit1467988604906296682/part-00000-24c39fe2-a264-4e77-b23e-6b89a5883986-c000
   creator:         parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
   
   file schema:     hive_schema
   --------------------------------------------------------------------------------
   col1:            OPTIONAL F:1
   .bag:            REPEATED F:1
   ..array_element: OPTIONAL F:1
   ...col2:         OPTIONAL INT32 R:1 D:4
   
   row group 1:     RC:1 TS:41 OFFSET:4
   --------------------------------------------------------------------------------
   col1:
   .bag:
   ..array_element:
   ...col2:          INT32 SNAPPY DO:0 FPO:4 SZ:43/41/0.95 VC:1 ENC:PLAIN,RLE ST:[min: 12345, max: 12345, num_nulls: 0]
   ```
   
   Step 3: Migrate the now normal hive table
   ```java
    // migrate table
       SparkActions.get().migrateTable(tableName).execute();
   
       // check migrated table is returning expected result
       List<Object[]> results = sql(String.format("SELECT * FROM %s", tableName));
       Assert.assertTrue(results.size() > 0);
       assertEquals("Output must match", expected, results);
   ```
   
   Output :
   ```
   java.lang.AssertionError: Output must match: row 1 col 1 contents should match expected:<[[12345]]> but was:<null>
   	at org.junit.Assert.fail(Assert.java:89)
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771688135



##########
File path: site/docs/spark-procedures.md
##########
@@ -334,6 +339,13 @@ CALL catalog_name.system.rewrite_manifests('db.sample', false)
 
 The `snapshot` and `migrate` procedures help test and migrate existing Hive or Spark tables to Iceberg.
 
+!!! Note
+    Parquet files written with Parquet writers that use names other than `list` and `element` for repeated group
+    and element of the list respectively are **read incorrectly as nulls** by Iceberg upto 0.12.1 Iceberg versions.
+    Most commonly such files are written by follow writers. 

Review comment:
       Updated docs to take of nit comment and made sure it looks as expected with `mkdocs serve` as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771589244



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {

Review comment:
       I see, I can move them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev edited a comment on pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
SinghAsDev edited a comment on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992053535


   @kbendick thanks for quick review.
   
   > Does this also happen with OSS Iceberg?
   
   Yes.
   
   > It would be great if instead of using a checked in file, Spark was used to generate the data instead (although I know that doesn't capture every case).
   
   Yea, the issue is that this issue only happens for parquet files generated from old non iceberg parquet writer. I has also mentioned this as a comment in the test file.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992053535


   @kbendick thanks for quick review.
   
   > Does this also happen with OSS Iceberg?
   Yes.
   
   > It would be great if instead of using a checked in file, Spark was used to generate the data instead (although I know that doesn't capture every case).
   Yea, the issue is that this issue only happens for parquet files generated from old non iceberg parquet writer. I has also mentioned this as a comment in the test file.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r768281192



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -88,6 +96,31 @@ public Type primitive(PrimitiveType primitive) {
     return field == null ? primitive : primitive.withId(field.id());
   }
 
+  @Override
+  public void beforeElementField(Type element) {
+    super.beforeElementField(makeElement(element));
+  }
+
+  @Override
+  public void afterElementField(Type element) {
+    super.afterElementField(makeElement(element));
+  }
+
+  private Type makeElement(Type element) {
+    // List's element in 3-level lists can be named differently across different parquet writers.
+    // For example, hive names it "array_element", whereas new parquet writers names it as "element".
+    if (element.getName().equals("element") || element.isPrimitive()) {
+      return element;
+    }

Review comment:
       And yeah, understood the projection related changes, while similar in concept, don't necessarily apply here unfortunately.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r768309087



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -88,6 +96,31 @@ public Type primitive(PrimitiveType primitive) {
     return field == null ? primitive : primitive.withId(field.id());
   }
 
+  @Override
+  public void beforeElementField(Type element) {
+    super.beforeElementField(makeElement(element));
+  }
+
+  @Override
+  public void afterElementField(Type element) {
+    super.afterElementField(makeElement(element));
+  }
+
+  private Type makeElement(Type element) {
+    // List's element in 3-level lists can be named differently across different parquet writers.
+    // For example, hive names it "array_element", whereas new parquet writers names it as "element".
+    if (element.getName().equals("element") || element.isPrimitive()) {
+      return element;
+    }

Review comment:
       Thanks @kbendick , getting the tests started will definitely be helpful. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r768068721



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -88,6 +96,31 @@ public Type primitive(PrimitiveType primitive) {
     return field == null ? primitive : primitive.withId(field.id());
   }
 
+  @Override
+  public void beforeElementField(Type element) {
+    super.beforeElementField(makeElement(element));
+  }
+
+  @Override
+  public void afterElementField(Type element) {
+    super.afterElementField(makeElement(element));
+  }
+
+  private Type makeElement(Type element) {
+    // List's element in 3-level lists can be named differently across different parquet writers.
+    // For example, hive names it "array_element", whereas new parquet writers names it as "element".
+    if (element.getName().equals("element") || element.isPrimitive()) {
+      return element;
+    }
+

Review comment:
       If it's po




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r769067559



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.IOException;
+import java.util.List;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.junit.Assert;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+
+public class TestParquetReader {
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @Test
+  public void testParquet2LevelList() throws IOException {
+    Schema icebergSchema = new Schema(
+            optional(1, "key", Types.StringType.get()),
+            optional(2, "val", Types.ListType.ofRequired(3, Types.StructType.of(
+                              optional(4, "a1", Types.StringType.get()),
+                              optional(5, "a2", Types.StringType.get())
+                      )
+                    )));
+    List<InternalRow> rows;
+
+    /* Using a static file rather than generating test data in test, as parquet writers in Iceberg only supports
+     * three level lists. The twoLevelList.pq is a parquet file that contains following Parquet schema.
+     * message hive_schema {
+     *  optional binary key (STRING);
+     *  optional group val (LIST) {
+     *    repeated group bag {
+     *      optional group array_element {
+     *        optional binary a1 (STRING);
+     *        optional binary a2 (STRING);
+     *      }
+     *    }
+     *  }
+     * }
+     *
+     * It contains only one row. Below is the json dump of the file.
+     * {"key":"k1","val":{"bag":[{"array_element":{"a1":"a","a2":"b"}}]}}
+     */

Review comment:
       Yeah that is correct. I did get the json you posted here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r769075621



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -88,6 +96,31 @@ public Type primitive(PrimitiveType primitive) {
     return field == null ? primitive : primitive.withId(field.id());
   }
 
+  @Override
+  public void beforeElementField(Type element) {
+    super.beforeElementField(makeElement(element));
+  }
+
+  @Override
+  public void afterElementField(Type element) {
+    super.afterElementField(makeElement(element));
+  }
+
+  private Type makeElement(Type element) {
+    // List's element in 3-level lists can be named differently across different parquet writers.
+    // For example, hive names it "array_element", whereas new parquet writers names it as "element".
+    if (element.getName().equals("element") || element.isPrimitive()) {
+      return element;
+    }

Review comment:
       The updated tests look great @SinghAsDev! Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r768239122



##########
File path: site/docs/spark-procedures.md
##########
@@ -334,6 +334,10 @@ CALL catalog_name.system.rewrite_manifests('db.sample', false)
 
 The `snapshot` and `migrate` procedures help test and migrate existing Hive or Spark tables to Iceberg.
 
+**Note** Parquet files written with Parquet writers that use names other than `list` and `element` for repeated group
+and element of the list respectively are **read incorrectly** by Iceberg upto 0.12.1 Iceberg versions. Parquet files
+generated by Hive fall in this category.

Review comment:
       Yea, some details [here](https://github.com/apache/iceberg/pull/3723#issuecomment-992079085)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r767376318



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.IOException;
+import java.util.List;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.junit.Assert;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+
+public class TestParquetReader {
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @Test
+  public void testParquet2LevelList() throws IOException {
+    Schema icebergSchema = new Schema(
+            optional(1, "key", Types.StringType.get()),
+            optional(2, "val", Types.ListType.ofRequired(3, Types.StructType.of(
+                              optional(4, "a1", Types.StringType.get()),
+                              optional(5, "a2", Types.StringType.get())
+                      )
+                    )));
+    List<InternalRow> rows;
+
+    /* Using a static file rather than generating test data in test, as parquet writers in Iceberg only supports
+     * three level lists. The twoLevelList.pq is a parquet file that contains following Parquet schema.
+     * message hive_schema {
+     *  optional binary key (STRING);
+     *  optional group val (LIST) {
+     *    repeated group bag {
+     *      optional group array_element {
+     *        optional binary a1 (STRING);
+     *        optional binary a2 (STRING);
+     *      }
+     *    }
+     *  }
+     * }
+     *
+     * It contains only one row. Below is the json dump of the file.
+     * {"key":"k1","val":{"bag":[{"array_element":{"a1":"a","a2":"b"}}]}}
+     */

Review comment:
       Yea, it depends on how you are getting the json dump. If you use parquet-tools and use `cat --json` you would get the json I had pasted here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-996382892


   @RussellSpitzer mind giving it another pass, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771504231



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test

Review comment:
       ```parquet-tools meta /var/folders/yl/6cwgks7919s1td2mfdq86cbm0000gn/T/junit5012408329488933688/junit3620601311176219234
   file:        file:/var/folders/yl/6cwgks7919s1td2mfdq86cbm0000gn/T/junit5012408329488933688/junit3620601311176219234/part-00000-14bc6af7-9778-42c9-9fb2-703938925eb9-c000.snappy.parquet
   creator:     parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
   extra:       org.apache.spark.version = 3.2.0
   extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"col1","type":{"type":"array","elementType":{"type":"struct","fields":[{"name":"col2","type":"integer","nullable":true,"metadata":{}}]},"containsNull":true},"nullable":false,"metadata":{}}]}
   
   file schema: spark_schema
   --------------------------------------------------------------------------------
   col1:        REQUIRED F:1
   .list:       REPEATED F:1
   ..element:   OPTIONAL F:1
   ...col2:     OPTIONAL INT32 R:1 D:3
   
   row group 1: RC:1 TS:40 OFFSET:4
   --------------------------------------------------------------------------------
   col1:
   .list:
   ..element:
   ...col2:      INT32 SNAPPY DO:0 FPO:4 SZ:42/40/0.95 VC:1 ENC:RLE,PLAIN ST:[min: 12345, max: 12345, num_nulls: 0]```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r768241931



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -88,6 +96,31 @@ public Type primitive(PrimitiveType primitive) {
     return field == null ? primitive : primitive.withId(field.id());
   }
 
+  @Override
+  public void beforeElementField(Type element) {
+    super.beforeElementField(makeElement(element));
+  }
+
+  @Override
+  public void afterElementField(Type element) {
+    super.afterElementField(makeElement(element));
+  }
+
+  private Type makeElement(Type element) {
+    // List's element in 3-level lists can be named differently across different parquet writers.
+    // For example, hive names it "array_element", whereas new parquet writers names it as "element".
+    if (element.getName().equals("element") || element.isPrimitive()) {
+      return element;
+    }

Review comment:
       @kbendick I also updated test to dynamically generate parquet file in required format. Mind giving it another pass.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992651581


   Hive still writes with same names, so I would say it is a problem for any
   parquet file generated by writing to a hive table with ParquetHiveSerDe.
   
   On Mon, Dec 13, 2021 at 6:02 AM Russell Spitzer ***@***.***>
   wrote:
   
   > ***@***.**** commented on this pull request.
   > ------------------------------
   >
   > In site/docs/spark-procedures.md
   > <https://github.com/apache/iceberg/pull/3723#discussion_r767783605>:
   >
   > > @@ -334,6 +334,10 @@ CALL catalog_name.system.rewrite_manifests('db.sample', false)
   >
   >  The `snapshot` and `migrate` procedures help test and migrate existing Hive or Spark tables to Iceberg.
   >
   > +**Note** Parquet files written with Parquet writers that use names other than `list` and `element` for repeated group
   > +and element of the list respectively are **read incorrectly** by Iceberg upto 0.12.1 Iceberg versions. Parquet files
   > +generated by Hive fall in this category.
   >
   > Can we try to get a version bound here? Ideally the oss hive version where
   > this starts and stops being an issue?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/pull/3723#pullrequestreview-830274905>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/ABFQCZPI7FO53GWJLMU5M23UQX4FZANCNFSM5J43VMOA>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   >
   -- 
   - Ashish
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r768064340



##########
File path: site/docs/spark-procedures.md
##########
@@ -334,6 +334,10 @@ CALL catalog_name.system.rewrite_manifests('db.sample', false)
 
 The `snapshot` and `migrate` procedures help test and migrate existing Hive or Spark tables to Iceberg.
 
+**Note** Parquet files written with Parquet writers that use names other than `list` and `element` for repeated group
+and element of the list respectively are **read incorrectly** by Iceberg upto 0.12.1 Iceberg versions. Parquet files
+generated by Hive fall in this category.

Review comment:
       I checked 3.x and it's still called `bag`.
   
   I suggest we consider going with the something similar to the map projection element  rename fix that Ryan applied for maps from Parquet 1.11.0 to 1.11.1, where we don't depend on the name at all.
   
   PR for context: https://github.com/apache/iceberg/pull/3309
   
   It might not work because that was for projection, but if we can get something like that where we don't assume names, that would ideally allow for N levels of nesting etc. cc @SinghAsDev 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-993149367


   @kbendick we were using Hive on MR with Hive 1.2.1 and so still have tons of files written by it. However, as we discussed this is not specific to old versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-997057690


   Thanks for reviews @RussellSpitzer @jackye1995 @kbendick !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-996064884


   > "bag" is added by hive's parquet serde, ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java
   
   So sounds like the issue will only come when users load table through Hive and import to Iceberg through add-files procedure. Technically users could have a way out by using migrate-table procedure to rewrite the entire table using the Iceberg writer instead.
   
   But this seems to impact quite a lot of users for fast Hive to Iceberg migration, so I will add this in 0.13 milestone for now, please let me know if anyone objects.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771552852



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test

Review comment:
       I thought a bit about how to trick Spark into using the Old Map Red style serde. This is a little difficult because spark checks to see if "parquet" is in the name of the serde and then just uses it's own serde/format if it sees that.
   
   So step 1 Make a fake Serde that is the mapred parquet serde but isn't named that and doesn't have "parquet" in the name
   ```
   import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;
   import org.apache.hadoop.hive.serde.serdeConstants;
   import org.apache.hadoop.hive.serde2.SerDeSpec;
   import org.apache.parquet.hadoop.ParquetOutputFormat;
   
   @SerDeSpec(schemaProps = {serdeConstants.LIST_COLUMNS, serdeConstants.LIST_COLUMN_TYPES,
       ParquetOutputFormat.COMPRESSION})
   public class OldParSerde extends ParquetHiveSerDe {
   
   }
   ```
   
   Step 2 : Create and write to our new table using the old SERDE, *then* switch it back to the normal parquet one. This is important because Iceberg needs "parquet" to be in the format for migrate to work
   ```java
       File location = temp.newFolder();
       // Create the table using 
       sql("CREATE TABLE %s (col1 ARRAY<STRUCT<col2 INT>>)" +
           " ROW FORMAT SERDE '%s'" +
           "  STORED AS INPUTFORMAT" +
           " 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'" +
           "  OUTPUTFORMAT" +
           " 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'" +
           " LOCATION '%s'", tableName, OldParSerde.class.getName(), location);
   
       int testValue = 12345;
       sql(String.format("INSERT INTO %s VALUES (ARRAY(STRUCT(%s)))", tableName, testValue));
       List<Object[]> expected = sql(String.format("SELECT * FROM %s", tableName));
   
       sql("ALTER TABLE %s SET SERDE '%s'", tableName, ParquetHiveSerDe.class.getName());
   ```
   
   Here is the file generated
   ```
   file:            file:/var/folders/yl/6cwgks7919s1td2mfdq86cbm0000gn/T/junit5001523939508400535/junit1467988604906296682/part-00000-24c39fe2-a264-4e77-b23e-6b89a5883986-c000
   creator:         parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
   
   file schema:     hive_schema
   --------------------------------------------------------------------------------
   col1:            OPTIONAL F:1
   .bag:            REPEATED F:1
   ..array_element: OPTIONAL F:1
   ...col2:         OPTIONAL INT32 R:1 D:4
   
   row group 1:     RC:1 TS:41 OFFSET:4
   --------------------------------------------------------------------------------
   col1:
   .bag:
   ..array_element:
   ...col2:          INT32 SNAPPY DO:0 FPO:4 SZ:43/41/0.95 VC:1 ENC:PLAIN,RLE ST:[min: 12345, max: 12345, num_nulls: 0]
   ```
   
   Step 3: Migrate the now normal hive table
   ```java
    // migrate table
       SparkActions.get().migrateTable(tableName).execute();
   
       // check migrated table is returning expected result
       List<Object[]> results = sql(String.format("SELECT * FROM %s", tableName));
       Assert.assertTrue(results.size() > 0);
       assertEquals("Output must match", expected, results);
   ```
   
   Output :
   ```
   java.lang.AssertionError: Output must match: row 1 col 1 contents should match expected:<[[12345]]> but was:<null>
   	at org.junit.Assert.fail(Assert.java:89)
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771588673



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test

Review comment:
       AHI missed the "legacy mode" flag! Well let's just set that on a per test basis in the TestCreate suit




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771594390



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test

Review comment:
       Yea, as we discussed with legacy mode these tests do fail without this change. I will move the tests to testCreate as suggested. Thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771496005



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {

Review comment:
       Sorry I should have been specific, I believe these should be in TestCreateActions (where the tests for migrate and snapshot live)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992060629


   I think it's worthwhile to add backwards compatibility. For reference we discussed this on slack and I noted the one case where we hit this internally #2167 but there we ended up just writing our files without the old structure in the source of the files. We hadn't hit it since then even with various other older hadoop distributions lying around.
   
   That said if other folks are hitting this we should fix it.
   
   It would be great to know the exact versions of hive/parquet that are generating this structure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771040472



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.io.FilenameFilter;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.mapping.MappingUtil;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.parquet.schema.MessageType;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.ArrayType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader {
+  private static SparkSession spark = null;
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList() throws IOException {
+    File location = new File(temp.getRoot(), "parquetReaderTest");
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                    "col1", new ArrayType(
+                        new StructType(
+                            new StructField[]{
+                                new StructField(
+                                    "col2",
+                                    DataTypes.IntegerType,
+                                    false,
+                                    Metadata.empty())
+                            }), true), true, Metadata.empty())});
+
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group bag {\n" +
+            "      optional group array {\n" +
+            "        required int32 col2;\n" +
+            "      }\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(

Review comment:
       Sure

##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.io.FilenameFilter;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.mapping.MappingUtil;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.parquet.schema.MessageType;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.ArrayType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader {
+  private static SparkSession spark = null;
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList() throws IOException {
+    File location = new File(temp.getRoot(), "parquetReaderTest");
+    StructType sparkSchema =

Review comment:
       I think the reason I had explicitly defined the schemas here is to demonstrate the issue a bit more clearly. However, I think now that we all are on same page in terms of the issue, I can make the tests a bit easy to follow. Will follow up with an update shortly.

##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.io.FilenameFilter;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.mapping.MappingUtil;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.parquet.schema.MessageType;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.ArrayType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader {
+  private static SparkSession spark = null;
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList() throws IOException {
+    File location = new File(temp.getRoot(), "parquetReaderTest");
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                    "col1", new ArrayType(
+                        new StructType(
+                            new StructField[]{
+                                new StructField(
+                                    "col2",
+                                    DataTypes.IntegerType,
+                                    false,
+                                    Metadata.empty())
+                            }), true), true, Metadata.empty())});
+
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group bag {\n" +
+            "      optional group array {\n" +
+            "        required int32 col2;\n" +
+            "      }\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(
+            JavaSparkContext.fromSparkContext(spark.sparkContext()).parallelize(testData))
+        .coalesce(1).write().format("parquet").mode(SaveMode.Append).save(location.getPath());
+
+    File parquetFile = Arrays.stream(Objects.requireNonNull(location.listFiles(new FilenameFilter() {
+      @Override
+      public boolean accept(File dir, String name) {
+        return name.endsWith("parquet");
+      }
+    }))).findAny().get();
+
+    // verify generated parquet file has expected schema
+    ParquetFileReader pqReader = ParquetFileReader.open(
+        HadoopInputFile.fromPath(new Path(parquetFile.getPath()), new Configuration()));
+    MessageType schema = pqReader.getFooter().getFileMetaData().getSchema();
+    Assert.assertEquals(expectedParquetSchema, schema.toString());
+
+    // read from Iceberg's parquet reader and ensure data is read correctly into Spark's internal row
+    Schema icebergSchema = SparkSchemaUtil.convert(sparkSchema);

Review comment:
       Good idea, will update shortly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r770981780



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.io.FilenameFilter;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.mapping.MappingUtil;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.parquet.schema.MessageType;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.ArrayType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader {
+  private static SparkSession spark = null;
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList() throws IOException {
+    File location = new File(temp.getRoot(), "parquetReaderTest");
+    StructType sparkSchema =
+        new StructType(
+            new StructField[]{
+                new StructField(
+                    "col1", new ArrayType(
+                        new StructType(
+                            new StructField[]{
+                                new StructField(
+                                    "col2",
+                                    DataTypes.IntegerType,
+                                    false,
+                                    Metadata.empty())
+                            }), true), true, Metadata.empty())});
+
+    String expectedParquetSchema =
+        "message spark_schema {\n" +
+            "  optional group col1 (LIST) {\n" +
+            "    repeated group bag {\n" +
+            "      optional group array {\n" +
+            "        required int32 col2;\n" +
+            "      }\n" +
+            "    }\n" +
+            "  }\n" +
+            "}\n";
+
+
+    // generate parquet file with required schema
+    List<String> testData = Collections.singletonList("{\"col1\": [{\"col2\": 1}]}");
+    spark.read().schema(sparkSchema).json(
+            JavaSparkContext.fromSparkContext(spark.sparkContext()).parallelize(testData))
+        .coalesce(1).write().format("parquet").mode(SaveMode.Append).save(location.getPath());
+
+    File parquetFile = Arrays.stream(Objects.requireNonNull(location.listFiles(new FilenameFilter() {
+      @Override
+      public boolean accept(File dir, String name) {
+        return name.endsWith("parquet");
+      }
+    }))).findAny().get();
+
+    // verify generated parquet file has expected schema
+    ParquetFileReader pqReader = ParquetFileReader.open(
+        HadoopInputFile.fromPath(new Path(parquetFile.getPath()), new Configuration()));
+    MessageType schema = pqReader.getFooter().getFileMetaData().getSchema();
+    Assert.assertEquals(expectedParquetSchema, schema.toString());
+
+    // read from Iceberg's parquet reader and ensure data is read correctly into Spark's internal row
+    Schema icebergSchema = SparkSchemaUtil.convert(sparkSchema);

Review comment:
       I think we should do a test to make sure files imported are read correctly, instead of using the internal methods doing an import files into an iceberg table and showing that the rows are correct?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771503787



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList1() throws Exception {

Review comment:
       Why List1()?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771629679



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -88,6 +96,31 @@ public Type primitive(PrimitiveType primitive) {
     return field == null ? primitive : primitive.withId(field.id());
   }
 
+  @Override
+  public void beforeElementField(Type element) {
+    super.beforeElementField(makeElement(element));
+  }
+
+  @Override
+  public void afterElementField(Type element) {
+    super.afterElementField(makeElement(element));
+  }
+
+  private Type makeElement(Type element) {
+    // List's element in 3-level lists can be named differently across different parquet writers.
+    // For example, hive names it "array_element", whereas new parquet writers names it as "element".
+    if (element.getName().equals("element") || element.isPrimitive()) {

Review comment:
       nit: static variable for "element"

##########
File path: site/docs/spark-procedures.md
##########
@@ -334,6 +334,10 @@ CALL catalog_name.system.rewrite_manifests('db.sample', false)
 
 The `snapshot` and `migrate` procedures help test and migrate existing Hive or Spark tables to Iceberg.
 
+**Note** Parquet files written with Parquet writers that use names other than `list` and `element` for repeated group
+and element of the list respectively are **read incorrectly** by Iceberg upto 0.12.1 Iceberg versions. Parquet files
+generated by Hive fall in this category.

Review comment:
       use `!!! Note` for note.
   
   Also "Parquet files generated by Hive" sounds a bit vague. I think we can specify example cases are Parquet files written using (1) `org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe` in Hive, or (2) `spark.sql.parquet.writeLegacyFormat=true` in Spark.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771684086



##########
File path: site/docs/spark-procedures.md
##########
@@ -334,6 +339,13 @@ CALL catalog_name.system.rewrite_manifests('db.sample', false)
 
 The `snapshot` and `migrate` procedures help test and migrate existing Hive or Spark tables to Iceberg.
 
+!!! Note
+    Parquet files written with Parquet writers that use names other than `list` and `element` for repeated group
+    and element of the list respectively are **read incorrectly as nulls** by Iceberg upto 0.12.1 Iceberg versions.
+    Most commonly such files are written by follow writers. 

Review comment:
       sure, makes sense




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r768093798



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -88,6 +96,31 @@ public Type primitive(PrimitiveType primitive) {
     return field == null ? primitive : primitive.withId(field.id());
   }
 
+  @Override
+  public void beforeElementField(Type element) {
+    super.beforeElementField(makeElement(element));
+  }
+
+  @Override
+  public void afterElementField(Type element) {
+    super.afterElementField(makeElement(element));
+  }
+
+  private Type makeElement(Type element) {
+    // List's element in 3-level lists can be named differently across different parquet writers.
+    // For example, hive names it "array_element", whereas new parquet writers names it as "element".
+    if (element.getName().equals("element") || element.isPrimitive()) {
+      return element;
+    }

Review comment:
       I think we can make it generic from the perspective that we don't have magic keywords like "element", however I am not sure if the projection change you are referring to would work here, but do let me know if you think otherwise. Here, we are making this change so that constructed parquet schema had correct ids, but still has the same schema as the one in parquet file. I will at least make this part generic, will update the diff shortly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992004729


   @RussellSpitzer @rdblue what do you think about this, while I work on adding backwards compatibility support?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev edited a comment on pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
SinghAsDev edited a comment on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992053535


   @kbendick thanks for quick review.
   
   > Does this also happen with OSS Iceberg?
   
   Yes.
   
   > It would be great if instead of using a checked in file, Spark was used to generate the data instead (although I know that doesn't capture every case).
   
   Yea, the issue is that this issue only happens for parquet files generated from old non iceberg parquet writer. I have also mentioned this as a comment in the test file.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992304112


   @kbendick @RussellSpitzer I updated the PR with fix for this issue. However, the fix is only for 3-level lists and I will work on compatibility support in a separate diff. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r768071019



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -88,6 +96,31 @@ public Type primitive(PrimitiveType primitive) {
     return field == null ? primitive : primitive.withId(field.id());
   }
 
+  @Override
+  public void beforeElementField(Type element) {
+    super.beforeElementField(makeElement(element));
+  }
+
+  @Override
+  public void afterElementField(Type element) {
+    super.afterElementField(makeElement(element));
+  }
+
+  private Type makeElement(Type element) {
+    // List's element in 3-level lists can be named differently across different parquet writers.
+    // For example, hive names it "array_element", whereas new parquet writers names it as "element".
+    if (element.getName().equals("element") || element.isPrimitive()) {
+      return element;
+    }

Review comment:
       If it's possible to make this more generic and remove the dependency on the name entirely, that would be great.
   
   Also, could you try to make a test that uses the logic in that file to design the schema instead of a checked in binary file? I know that might be difficult but if you wouldn't mind giving it a try, that would be preferred vs a binary file checked in.
   
   As mentioned, here's a similar PR where you can find good test logic for making arbitrarily named Parquet message types: https://github.com/apache/iceberg/pull/3309




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r768113767



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ApplyNameMapping.java
##########
@@ -88,6 +96,31 @@ public Type primitive(PrimitiveType primitive) {
     return field == null ? primitive : primitive.withId(field.id());
   }
 
+  @Override
+  public void beforeElementField(Type element) {
+    super.beforeElementField(makeElement(element));
+  }
+
+  @Override
+  public void afterElementField(Type element) {
+    super.afterElementField(makeElement(element));
+  }
+
+  private Type makeElement(Type element) {
+    // List's element in 3-level lists can be named differently across different parquet writers.
+    // For example, hive names it "array_element", whereas new parquet writers names it as "element".
+    if (element.getName().equals("element") || element.isPrimitive()) {
+      return element;
+    }

Review comment:
       I tried a bit to remove the usage of "element", however there isn't a clean way to do so. The only way I could come up with is build a dummy list and then get the element out of it. I am not sure if that is any cleaner than the current approach. Below is the change I am referring to.
   ```
     private Type makeElement(Type element) {
       // List's element in 3-level lists can be named differently across different parquet writers.
       // For example, hive names it "array_element", whereas new parquet writers names it as "element".
       if (element.isPrimitive()) {
         return element;
       }
   
       Types.BaseListBuilder.GroupElementBuilder<GroupType, Types.ListBuilder<GroupType>> dummyBuilder = Types.list(Type.Repetition.OPTIONAL)
               .groupElement(element.getRepetition())
               .addFields(element.asGroupType().getFields().toArray(new Type[0]));
       if (element.getId() != null) {
         dummyBuilder.id(element.getId().intValue());
       }
       return dummyBuilder.named("dummy").getType(0).asGroupType().getType(0);
     }
   ```
   
   As such, I don't have a strong preference on it. @kbendick let me know if you like this one better and I will make the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992060629


   I think it's worthwhile to add backwards compatibility. For reference we discussed this on slack and I noted the one case where we hit this internally #2167 but there we ended up just writing our files without the old structure in the source of the files. We hadn't hit it since then even with various other older hadoop distributions lying around.
   
   That said if other folks are hitting this we should fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev edited a comment on pull request #3723: Demonstrate issue where Iceberg's parquet reader silently returns nulls

Posted by GitBox <gi...@apache.org>.
SinghAsDev edited a comment on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-992079085


   `bag` is added by hive's [schema converter](https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L125) (defined [here](https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java#L53). We use this serde from hive's 1.2.1 version. However, even latest versions still use `bag` [here](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L155).
   
   Another thing to note here is that even if we use `list` as name for repeated group of a list, I think Iceberg returns just empty list. I am working on a fix for this, along with backward compatibility support. Would love to get your reviews on that as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r767783605



##########
File path: site/docs/spark-procedures.md
##########
@@ -334,6 +334,10 @@ CALL catalog_name.system.rewrite_manifests('db.sample', false)
 
 The `snapshot` and `migrate` procedures help test and migrate existing Hive or Spark tables to Iceberg.
 
+**Note** Parquet files written with Parquet writers that use names other than `list` and `element` for repeated group
+and element of the list respectively are **read incorrectly** by Iceberg upto 0.12.1 Iceberg versions. Parquet files
+generated by Hive fall in this category.

Review comment:
       Can we try to get a version bound here? Ideally the oss hive version where this starts and stops being an issue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r768064340



##########
File path: site/docs/spark-procedures.md
##########
@@ -334,6 +334,10 @@ CALL catalog_name.system.rewrite_manifests('db.sample', false)
 
 The `snapshot` and `migrate` procedures help test and migrate existing Hive or Spark tables to Iceberg.
 
+**Note** Parquet files written with Parquet writers that use names other than `list` and `element` for repeated group
+and element of the list respectively are **read incorrectly** by Iceberg upto 0.12.1 Iceberg versions. Parquet files
+generated by Hive fall in this category.

Review comment:
       I checked 3.x and it's still called `bag`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on pull request #3723: Fix Iceberg's parquet writer returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-993880217


   Hey @kbendick @RussellSpitzer would you be able to kick start the builds again? I had to fix a couple of checkStyle errors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771522012



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList1() throws Exception {
+    String tableName = "default.testHiveStyleThreeLevelList";
+    File location = temp.newFolder();
+    sql(String.format("CREATE TABLE %s (col1 ARRAY<STRUCT<col2 INT>>)" +
+        " STORED AS parquet" +
+        " LOCATION '%s'", tableName, location));
+
+    int testValue = 12345;
+    sql(String.format("INSERT INTO %s VALUES (ARRAY(STRUCT(%s)))", tableName, testValue));

Review comment:
       our "sql" method doesn't require a format, it uses the first arg as a format string
   It's essentially
   
   sql( format, args*) {
      return spark.sql(String.format(format, args))
   }




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771677391



##########
File path: site/docs/spark-procedures.md
##########
@@ -334,6 +339,13 @@ CALL catalog_name.system.rewrite_manifests('db.sample', false)
 
 The `snapshot` and `migrate` procedures help test and migrate existing Hive or Spark tables to Iceberg.
 
+!!! Note
+    Parquet files written with Parquet writers that use names other than `list` and `element` for repeated group
+    and element of the list respectively are **read incorrectly as nulls** by Iceberg upto 0.12.1 Iceberg versions.
+    Most commonly such files are written by follow writers. 

Review comment:
       Hey @jackye1995 , to ease the cherry-picks we decided to extract out format change to a separate PR. Will follow up on this right after we get this PR in.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] uncleGen removed a comment on pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
uncleGen removed a comment on pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#issuecomment-998427166


   IIUC, this pr dose not resolve #3604, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771566311



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test

Review comment:
       Full test example for TestCreateActions
   
   ```java
   @Test
     public void testHiveStyleThreeLevelList1() throws Exception {
       File location = temp.newFolder();
   
       // Create the table as an old mapred style Hive table so the lists elements have different names
       sql("CREATE TABLE %s (col1 ARRAY<STRUCT<col2 INT>>) ROW FORMAT SERDE '%s' " +
               "STORED AS INPUTFORMAT '%s' OUTPUTFORMAT '%s' LOCATION '%s'",
           tableName, OldParSerde.class.getName(),
           MapredParquetInputFormat.class.getName(), MapredParquetOutputFormat.class.getName(), location);
   
       int testValue = 12345;
       sql(String.format("INSERT INTO %s VALUES (ARRAY(STRUCT(%s)))", tableName, testValue));
       List<Object[]> expected = sql(String.format("SELECT * FROM %s", tableName));
   
       // Change Storage Format so Iceberg Migrate understands how to Import
       sql("ALTER TABLE %s SET SERDE '%s'", tableName, ParquetHiveSerDe.class.getName());
   
       // migrate table
       SparkActions.get().migrateTable(tableName).execute();
   
       // check migrated table is returning expected result
       List<Object[]> results = sql(String.format("SELECT * FROM %s", tableName));
       Assert.assertTrue(results.size() > 0);
       assertEquals("Output must match", expected, results);
     }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3723: Fix Iceberg's parquet reader returning nulls incorrectly for parquet files written by writers that don't use list and element as names.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3723:
URL: https://github.com/apache/iceberg/pull/3723#discussion_r771522012



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList1() throws Exception {
+    String tableName = "default.testHiveStyleThreeLevelList";
+    File location = temp.newFolder();
+    sql(String.format("CREATE TABLE %s (col1 ARRAY<STRUCT<col2 INT>>)" +
+        " STORED AS parquet" +
+        " LOCATION '%s'", tableName, location));
+
+    int testValue = 12345;
+    sql(String.format("INSERT INTO %s VALUES (ARRAY(STRUCT(%s)))", tableName, testValue));

Review comment:
       our "sql" method doesn't require a format, it uses the first arg as a format string
   It's essentially
   ```
   sql( format, args*) {
      return spark.sql(String.format(format, args))
   }```

##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestParquetReader.java
##########
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.iceberg.spark.SparkTestBaseWithCatalog;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.spark.sql.SparkSession;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestParquetReader extends SparkTestBaseWithCatalog {
+  private static SparkSession spark = null;
+  private static String catalogName = "spark_catalog";
+  private static String implementation = SparkSessionCatalog.class.getName();
+  private static Map<String, String> config = ImmutableMap.of(
+      "type", "hive",
+      "default-namespace", "default",
+      "parquet-enabled", "true",
+      "cache-enabled", "false"
+  );
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestParquetReader() {
+    super(catalogName, implementation, config);
+    spark.sessionState().catalogManager().catalog(catalogName);
+  }
+
+  @BeforeClass
+  public static void startSpark() {
+    TestParquetReader.spark = SparkSession.builder().master("local[2]")
+            .config("spark.sql.parquet.writeLegacyFormat", true)
+            .getOrCreate();
+    spark.sessionState().catalogManager().catalog("spark_catalog");
+  }
+
+  @AfterClass
+  public static void stopSpark() {
+    SparkSession currentSpark = TestParquetReader.spark;
+    TestParquetReader.spark = null;
+    currentSpark.stop();
+  }
+
+  @Test
+  public void testHiveStyleThreeLevelList1() throws Exception {
+    String tableName = "default.testHiveStyleThreeLevelList";
+    File location = temp.newFolder();
+    sql(String.format("CREATE TABLE %s (col1 ARRAY<STRUCT<col2 INT>>)" +
+        " STORED AS parquet" +
+        " LOCATION '%s'", tableName, location));
+
+    int testValue = 12345;
+    sql(String.format("INSERT INTO %s VALUES (ARRAY(STRUCT(%s)))", tableName, testValue));

Review comment:
       our "sql" method doesn't require a format, it uses the first arg as a format string
   It's essentially
   ```
   sql( format, args*) {
      return spark.sql(String.format(format, args))
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org