You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/28 02:12:13 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

RussellSpitzer opened a new pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167


   Previously the Iceberg conversion functions for Parquet would throw an exception if
   they encountered a Binary type field. This was internally represented as a repeated
   primitive field that is not nested in another group type. This violated some expecations
   within our schema conversion code.
   
   We encountered this with a user who was using Parquet's AvroParquetWriter class to write Parquet files. The files, while readable by hive and spark, were not readable by iceberg.
   
   Investigating this I found the following Avro Schema element caused the problem
   
   ```java
     String schema = "{\n" +
           "   \"type\":\"record\",\n" +
           "   \"name\":\"DbRecord\",\n" +
           "   \"namespace\":\"com.russ\",\n" +
           "   \"fields\":[\n" +
           "      {\n" +
           "         \"name\":\"foo\",\n" +
           "         \"type\":[\n" +
           "            \"null\",\n" +
           "            {\n" +
           "               \"type\":\"array\",\n" +
           "               \"items\":\"bytes\"\n" +
           "            }\n" +
           "         ],\n" +
           "         \"default\":null\n" +
           "      }\n" +
           "   ]\n" +
           "}";
      ```
      
      Parquet would convert this element into
      
      ```
      foo:
    OPTIONAL F:1
      .array: REPEATED BINARY R:1 D:2
      ```
      
      Which violates Iceberg's reader, which assumes the list will be nested.
      
      Doing a quick test with
      ```
      org.apache.avro.Schema.Parser parser = new org.apache.avro.Schema.Parser();
       org.apache.avro.Schema avroSchema = parser.parse(schema);
       AvroSchemaConverter converter = new AvroSchemaConverter();
       MessageType parquetSchema = converter.convert(avroSchema);
   ```
   
   I saw that this was reproducible in the current version of Parquet and not just in our User's code.
   
   To fix this I added some tests for this particular datatype and loosened some of the restrictions
   in our Parquet Schema parsing code.
      


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-769413615


   This doesn't seem _that_ bad. Can you add a test that writes a Parquet file with that schema and validates that Spark can read it? Maybe a `TestMalformedParquet` suite?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#discussion_r571340787



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/TypeWithSchemaVisitor.java
##########
@@ -63,10 +63,12 @@
             Preconditions.checkArgument(group.getFieldCount() == 1,
                 "Invalid list: does not contain single repeated field: %s", group);
 
-            GroupType repeatedElement = group.getFields().get(0).asGroupType();
+            Type repeatedElement = group.getFields().get(0);
+
             Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
                 "Invalid list: inner group is not repeated");
-            Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
+            Preconditions.checkArgument(repeatedElement.isPrimitive() ||
+                    repeatedElement.asGroupType().getFieldCount() <= 1,

Review comment:
       Nit: it would be nice to keep the expression on one line, wrapping after `checkArgument(`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer closed pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
RussellSpitzer closed pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-769513559


   @rdblue I have an additional issue :/
   
   In SparkParquetReaders we call
   ```java
         ColumnDescriptor desc = type.getColumnDescription(currentPath());
   ```
   
   Which fails when traversing with
   
   ```
   Arrived at primitive node, path invalid
   org.apache.parquet.io.InvalidRecordException: Arrived at primitive node, path invalid
   	at org.apache.parquet.schema.PrimitiveType.getMaxRepetitionLevel(PrimitiveType.java:665)
   	at org.apache.parquet.schema.GroupType.getMaxRepetitionLevel(GroupType.java:294)
   	at org.apache.parquet.schema.GroupType.getMaxRepetitionLevel(GroupType.java:294)
   	at org.apache.parquet.schema.MessageType.getMaxRepetitionLevel(MessageType.java:77)
   	at org.apache.parquet.schema.MessageType.getColumnDescription(MessageType.java:94)
   	at org.apache.iceberg.spark.data.SparkParquetReaders$ReadBuilder.primitive(SparkParquetReaders.java:222)
   ```
   
   I'm not sure what the "Desc" is supposed to be here. There is no issue with writing the file or reading the file using the generic Parquet.read() file. But once I use the SparkParquetReader I have the issue.
   
   Do you have any idea's if this can be fixed?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-768747140


   @rdblue, could you help on this one?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#discussion_r571341289



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetTypeVisitor.java
##########
@@ -66,17 +66,19 @@
     Preconditions.checkArgument(list.getFieldCount() == 1,
         "Invalid list: does not contain single repeated field: %s", list);
 
-    GroupType repeatedElement = list.getFields().get(0).asGroupType();
+    Type repeatedElement = list.getFields().get(0);
+
     Preconditions.checkArgument(repeatedElement.isRepetition(Type.Repetition.REPEATED),
         "Invalid list: inner group is not repeated");
-    Preconditions.checkArgument(repeatedElement.getFieldCount() <= 1,
-        "Invalid list: repeated group is not a single field: %s", list);
+
+    Preconditions.checkArgument(repeatedElement.isPrimitive() || repeatedElement.asGroupType().getFieldCount() <= 1,

Review comment:
       It would be good to wrap so the condition starts on the next line, since it is long now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-769555106


   [malformed_parquet_not_txt.txt](https://github.com/apache/iceberg/files/5891488/malformed_parquet_not_txt.txt)
   
   This is the Parquet file generated by the test, just in case anyone wants to take a look at what we are dealing with here with another framework.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-775451501


   I've been thinking about this more and I'm leaning toward trying to work around it. I think the problem is that the Parquet/Avro writer uses the old list format by default to avoid breaking existing pipelines. But there should be an easy way to update the behavior to produce records that Iceberg accepts by setting `parquet.avro.write-old-list-structure=false`. That's true by default.
   
   If we can fix it that way, then I think we should go with that. Otherwise, we're implementing only part of the [backward-compatibility rules from Parquet](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists). I'm not sure what the impact would be on compatibility if we only partially implement the rules, so I think the safer thing is to just implement all of the backward-compatibility rules. But that's a bigger change and more to maintain (which is why we don't support the 2-level lists in the first place). So I think the preferred solution is to avoid this instead.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-769414479


   @rdblue Just bad enough I hope, I really have no idea what's going on here at a spec level, i'm just trying to match what I see as valid from the internal apis :) 
   
   I'll add in another test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-769413615


   This doesn't seem _that_ bad. Can you add a test that writes a Parquet file with that schema and validates that Spark can read it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-769513559


   @rdblue I have an additional issue :/
   
   In SparkParquetReaders we call
   ```java
         ColumnDescriptor desc = type.getColumnDescription(currentPath());
   ```
   
   Which fails when traversing with
   
   ```
   Arrived at primitive node, path invalid
   org.apache.parquet.io.InvalidRecordException: Arrived at primitive node, path invalid
   	at org.apache.parquet.schema.PrimitiveType.getMaxRepetitionLevel(PrimitiveType.java:665)
   	at org.apache.parquet.schema.GroupType.getMaxRepetitionLevel(GroupType.java:294)
   	at org.apache.parquet.schema.GroupType.getMaxRepetitionLevel(GroupType.java:294)
   	at org.apache.parquet.schema.MessageType.getMaxRepetitionLevel(MessageType.java:77)
   	at org.apache.parquet.schema.MessageType.getColumnDescription(MessageType.java:94)
   	at org.apache.iceberg.spark.data.SparkParquetReaders$ReadBuilder.primitive(SparkParquetReaders.java:222)
   ```
   
   I'm not sure what the "Desc" is supposed to be here. There is no issue with writing the file or reading the file using the generic Parquet.read() file. But once I use the SparkParquetReader I have the issue.
   
   Do you have any ideas how this can be fixed?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-769550250


   Figured out Array<Binary> I may have to write a test for a tope level repeated binary array as well :/ 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-768746220


   @aokolnychyi + @rdblue This is the issue I was talking about before


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-775408970


   Formatting fixes in, @rdblue Thanks for looking over this, I really appreciate it


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#discussion_r571341225



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestMalformedParquetFromAvro.java
##########
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.GenericRecordBuilder;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.parquet.ParquetSchemaUtil;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.parquet.avro.AvroParquetWriter;
+import org.apache.parquet.avro.AvroSchemaConverter;
+import org.apache.parquet.hadoop.ParquetWriter;
+import org.apache.parquet.schema.MessageType;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.junit.Assert;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestMalformedParquetFromAvro {
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+
+  @Test
+  public void testWriteReadAvroBinary() throws IOException {
+    String schema = "{" +
+        "\"type\":\"record\"," +
+        "\"name\":\"DbRecord\"," +
+        "\"namespace\":\"com.iceberg\"," +
+        "\"fields\":[" +
+          "{\"name\":\"arraybytes\", " +
+            "\"type\":[ \"null\", { \"type\":\"array\", \"items\":\"bytes\"}], \"default\":null}," +
+          "{\"name\":\"topbytes\", \"type\":\"bytes\"}" +
+        "]" +
+        "}";
+
+    org.apache.avro.Schema.Parser parser = new org.apache.avro.Schema.Parser();
+    org.apache.avro.Schema avroSchema = parser.parse(schema);
+    AvroSchemaConverter converter = new AvroSchemaConverter();
+    MessageType parquetScehma = converter.convert(avroSchema);
+    Schema icebergSchema = ParquetSchemaUtil.convert(parquetScehma);
+
+    File testFile = temp.newFile();
+    Assert.assertTrue(testFile.delete());
+
+    ParquetWriter<GenericRecord> writer = AvroParquetWriter.<GenericRecord>builder(new Path(testFile.toURI()))
+        .withDataModel(GenericData.get())
+        .withSchema(avroSchema)
+        .build();
+
+    GenericRecordBuilder recordBuilder = new GenericRecordBuilder(avroSchema);
+    List<ByteBuffer> expectedByteList = new ArrayList();
+    byte[] expectedByte = {0x00, 0x01};
+    expectedByteList.add(ByteBuffer.wrap(expectedByte));
+
+    recordBuilder.set("arraybytes", expectedByteList);
+    recordBuilder.set("topbytes", ByteBuffer.wrap(expectedByte));
+    GenericData.Record record = recordBuilder.build();
+    writer.write(record);
+    writer.close();
+
+    List<InternalRow> rows;
+    try (CloseableIterable<InternalRow> reader =
+             Parquet.read(Files.localInput(testFile))

Review comment:
       Nit: indentation is off.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2167: Fix for Conversion of Parquet ByteArray to Iceberg Schema

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167#issuecomment-769555106


   
   [malformed_parquet_not_txt.txt](https://github.com/apache/iceberg/files/5891488/malformed_parquet_not_txt.txt)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org