You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/08/28 00:38:12 UTC

[GitHub] [iceberg] rdblue commented on a change in pull request #1388: [Parquet Vectorized Reads] Fix reading of files with mix of dictionary and non-dictionary encoded row groups

rdblue commented on a change in pull request #1388:
URL: https://github.com/apache/iceberg/pull/1388#discussion_r478768220



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetDictionaryEncodedVectorizedReads.java
##########
@@ -39,4 +51,48 @@
   public void testVectorizedReadsWithNewContainers() throws IOException {
 
   }
+
+  @Test
+  public void testMixedDictionaryNonDictionaryReads() throws IOException {
+    Schema schema = new Schema(SUPPORTED_PRIMITIVES.fields());
+
+    File dictionaryEncodedFile = temp.newFile();
+    Assert.assertTrue("Delete should succeed", dictionaryEncodedFile.delete());
+    Iterable<GenericData.Record> dictionaryEncodableData = RandomData.generateDictionaryEncodableData(
+        schema,
+        10000,
+        0L,
+        RandomData.DEFAULT_NULL_PERCENTAGE);
+    try (FileAppender<GenericData.Record> writer = getParquetWriter(schema, dictionaryEncodedFile)) {
+      writer.addAll(dictionaryEncodableData);
+    }
+
+    File plainEncodingFile = temp.newFile();
+    Assert.assertTrue("Delete should succeed", plainEncodingFile.delete());
+    Iterable<GenericData.Record> nonDictionaryData = RandomData.generate(schema, 10000, 0L,
+        RandomData.DEFAULT_NULL_PERCENTAGE);
+    try (FileAppender<GenericData.Record> writer = getParquetWriter(schema, plainEncodingFile)) {
+      writer.addAll(nonDictionaryData);
+    }
+
+    File mixedFile = temp.newFile();
+    Assert.assertTrue("Delete should succeed", mixedFile.delete());
+    OutputFile outputFile = Files.localOutput(mixedFile);
+    int rowGroupSize = Integer.parseInt(PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT);
+    ParquetFileWriter writer = new ParquetFileWriter(

Review comment:
       What about adding a `Parquet.concat` util method? I don't think it is a good idea to make `ParquetIO` public just for this test case. But it would be nice to have a `concat` method somewhere that could concatenate Parquet files.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org