You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/03 14:56:14 UTC

[GitHub] [iceberg] gszadovszky opened a new pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

gszadovszky opened a new pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551


   Fixes #2546


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#discussion_r635534280



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java
##########
@@ -911,20 +928,31 @@ public void testTypePromotion() {
 
   @Test
   public void testFixedLenByteArray() {
-    // This test is to validate the handling of FIXED_LEN_BYTE_ARRAY Parquet type being dictionary encoded.
+    // This test is to validate the handling of FIXED_LEN_BYTE_ARRAY Parquet type
+    // being dictionary encoded.
     // (No need to validate all the possible predicates)
 
-    Assert.assertTrue("decimal_fixed should be dictionary encoded",
-        getColumnForName(rowGroupMetadata, "_decimal_fixed").getEncodings().contains(Encoding.RLE_DICTIONARY));
-
-    boolean shouldRead = new ParquetDictionaryRowGroupFilter(SCHEMA,
-        greaterThanOrEqual("decimal_fixed", BigDecimal.ZERO)).shouldRead(parquetSchema, rowGroupMetadata,
-            dictionaryStore);
-    Assert.assertTrue("Should read: Half of the decimal_fixed values are greater than 0", shouldRead);
-
-    shouldRead = new ParquetDictionaryRowGroupFilter(SCHEMA, lessThan("decimal_fixed", DECIMAL_MIN_VALUE))
-        .shouldRead(parquetSchema, rowGroupMetadata, dictionaryStore);
-    Assert.assertFalse("Should not read: No decimal_fixed values less than -1234567890.0987654321", shouldRead);
+    switch (writerVersion) {
+      case PARQUET_1_0:

Review comment:
       Normally, we would use `Assume` to simply skip this test for Parquet v1. Would you mind doing that here? I don't think we need to validate that Parquet doesn't dictionary encode.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#discussion_r627751459



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java
##########
@@ -140,6 +152,7 @@ public void createInputFile() throws IOException {
     OutputFile outFile = Files.localOutput(parquetFile);
     try (FileAppender<Record> appender = Parquet.write(outFile)
         .schema(FILE_SCHEMA)
+        .withWriterVersion(WriterVersion.PARQUET_2_0) // V2 is needed to force dictionary encoding for FIXED type

Review comment:
       Why is this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gszadovszky commented on pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#issuecomment-842367150


   I think, it is fine to extend the API with package-private methods for testing purposes. It makes the testing simpler and easier to implement. Meanwhile, I am open to invest on writing the test without the need of this extension if you think it is required.
   @rdblue, what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gszadovszky commented on a change in pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#discussion_r629267854



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java
##########
@@ -140,6 +152,7 @@ public void createInputFile() throws IOException {
     OutputFile outFile = Files.localOutput(parquetFile);
     try (FileAppender<Record> appender = Parquet.write(outFile)
         .schema(FILE_SCHEMA)
+        .withWriterVersion(WriterVersion.PARQUET_2_0) // V2 is needed to force dictionary encoding for FIXED type

Review comment:
       I don't know why we do not use dictionary encoding for V1 but this is the current case. See [DefaultV1ValuesWriterFactory](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L78-L80). Meanwhile this behavior is not specified anywhere so other implementations may write it and Impala certainly does.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#issuecomment-842503304


   @gszadovszky, I think the package-private method is fine. But we need to make sure that this doesn't stop testing v1 since that's what people primarily use. Could you add v1/v2 as a test parameter so all the tests run on both?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#discussion_r633422087



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java
##########
@@ -140,6 +152,7 @@ public void createInputFile() throws IOException {
     OutputFile outFile = Files.localOutput(parquetFile);
     try (FileAppender<Record> appender = Parquet.write(outFile)
         .schema(FILE_SCHEMA)
+        .withWriterVersion(WriterVersion.PARQUET_2_0) // V2 is needed to force dictionary encoding for FIXED type

Review comment:
       @rdblue: Does Iceberg specification defines which writer version we should use?
   
   @gszadovszky: What happens when a V2 writer is used and an old parquet client tries to read the file.
   
   Thanks,
   Peter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gszadovszky commented on a change in pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#discussion_r633449947



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java
##########
@@ -140,6 +152,7 @@ public void createInputFile() throws IOException {
     OutputFile outFile = Files.localOutput(parquetFile);
     try (FileAppender<Record> appender = Parquet.write(outFile)
         .schema(FILE_SCHEMA)
+        .withWriterVersion(WriterVersion.PARQUET_2_0) // V2 is needed to force dictionary encoding for FIXED type

Review comment:
       @pvary: V1 and V2 are quite tricky references because it is not always clear what it covers. From parquet-mr point of view the most basic differences are the page headers (v1 or v2) and the used encodings. Since parquet-mr has support for v2 since a while it is mostly not about the old parquet readers but the other implementations of parquet readers. For example Impala does not support v2 page headers nor the encodings used by parquet-mr in case of v2 is set.
   BUT, using dictionary encoding for fixed data types are not against parquet specification for v1. And Impala does it. (It is only parquet-mr that does not do dictionary encoding for fixed types in case of v1 but it can read it and the filtering also works just fine.)
   
   I've been hesitating to switch the whole unit test to V2 (or implement a separate test for fixed+dictionary) but this seemed easier and simpler. Also, V1 or V2 seems irrelevant to this test (excluding the fact that we cannot test fixed+dictionary with V1).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gszadovszky commented on a change in pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#discussion_r633449947



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java
##########
@@ -140,6 +152,7 @@ public void createInputFile() throws IOException {
     OutputFile outFile = Files.localOutput(parquetFile);
     try (FileAppender<Record> appender = Parquet.write(outFile)
         .schema(FILE_SCHEMA)
+        .withWriterVersion(WriterVersion.PARQUET_2_0) // V2 is needed to force dictionary encoding for FIXED type

Review comment:
       @pvary: V1 and V2 are quite tricky references because it is not always clear what it covers. From parquet-mr point of view the most basic differences are the page headers (v1 or v2) and the used encodings. Since parquet-mr has support for v2 since a while it is mostly not about the old parquet readers but the other implementations of parquet readers. For example Impala does not support v2 page headers nor the encodings used by parquet-mr in case of v2 is set.
   BUT, using dictionary encoding for fixed data types are not against parquet specification for v1. And Impala does it. (It is only parquet-mr that does not do dictionary encoding for fixed types in case of v1 but it can read it and the filtering also works just fine.)
   
   I've been hesitating to switch the whole unit test to V2 (or implement a separate test for fixed+dictionary) but this seemed easier and simpler. Also, V1 or V2 seems irrelevant to this test (excluding the fact that we cannot test fixed+dictionary with V1).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#discussion_r633721650



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java
##########
@@ -140,6 +152,7 @@ public void createInputFile() throws IOException {
     OutputFile outFile = Files.localOutput(parquetFile);
     try (FileAppender<Record> appender = Parquet.write(outFile)
         .schema(FILE_SCHEMA)
+        .withWriterVersion(WriterVersion.PARQUET_2_0) // V2 is needed to force dictionary encoding for FIXED type

Review comment:
       @pvary, I'm assuming that people mostly don't use Parquet v2 because it was never finalized (to my knowledge). I think there's an ongoing effort in the Parquet community to clean up some of the docs, but I've not seen a vote to adopt any of the v2 encodings as official.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#issuecomment-842503304


   @gszadovszky, I think the package-private method is fine. But we need to make sure that this doesn't stop testing v1 since that's what people primarily use. Could you add v1/v2 as a test parameter so all the tests run on both?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gszadovszky commented on pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#issuecomment-842960211


   Thanks @rdblue. I've thought v1/v2 is irrelevant for this test but I agree it is better to be on the safe side and it's a good idea to add writer version as a test parameter. I'll make this change soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gszadovszky commented on a change in pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on a change in pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#discussion_r625148745



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java
##########
@@ -406,6 +406,8 @@ public Boolean or(Boolean leftResult, Boolean rightResult) {
 
       for (int i = 0; i <= dict.getMaxId(); i++) {
         switch (col.getPrimitiveType().getPrimitiveTypeName()) {
+          case FIXED_LEN_BYTE_ARRAY: dictSet.add((T) conversion.apply(dict.decodeToBinary(i)));
+            break;

Review comment:
       I wanted to write the following but Checkstyle failed on it because of indention. I guess, it is/was a [Checkstyle bug](https://github.com/checkstyle/checkstyle/issues/341). I don't have experience on gradle so I am not sure how to fix this in Iceberg configuration. Probably, we need to upgrade Checkstyle somehow.
   
   Anyway, both the current and this other one is correct but I would prefer the latter.
   ```suggestion
             case FIXED_LEN_BYTE_ARRAY:  // Same as BINARY
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#issuecomment-845417000


   Thanks, @gszadovszky!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#discussion_r633721650



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java
##########
@@ -140,6 +152,7 @@ public void createInputFile() throws IOException {
     OutputFile outFile = Files.localOutput(parquetFile);
     try (FileAppender<Record> appender = Parquet.write(outFile)
         .schema(FILE_SCHEMA)
+        .withWriterVersion(WriterVersion.PARQUET_2_0) // V2 is needed to force dictionary encoding for FIXED type

Review comment:
       @pvary, I'm assuming that people mostly don't use Parquet v2 because it was never finalized (to my knowledge). I think there's an ongoing effort in the Parquet community to clean up some of the docs, but I've not seen a vote to adopt any of the v2 encodings as official.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gszadovszky commented on pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#issuecomment-842367150


   I think, it is fine to extend the API with package-private methods for testing purposes. It makes the testing simpler and easier to implement. Meanwhile, I am open to invest on writing the test without the need of this extension if you think it is required.
   @rdblue, what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on a change in pull request #2551: Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

Posted by GitBox <gi...@apache.org>.

pvary commented on a change in pull request #2551:
URL: https://github.com/apache/iceberg/pull/2551#discussion_r633422087



##########
File path: parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java
##########
@@ -140,6 +152,7 @@ public void createInputFile() throws IOException {
     OutputFile outFile = Files.localOutput(parquetFile);
     try (FileAppender<Record> appender = Parquet.write(outFile)
         .schema(FILE_SCHEMA)
+        .withWriterVersion(WriterVersion.PARQUET_2_0) // V2 is needed to force dictionary encoding for FIXED type

Review comment:
       @rdblue: Does Iceberg specification defines which writer version we should use?
   
   @gszadovszky: What happens when a V2 writer is used and an old parquet client tries to read the file.
   
   Thanks,
   Peter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org