You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/05/23 00:51:35 UTC

[GitHub] [drill] vvysotskyi opened a new pull request #2232: DRILL-7919: Fix reading parquet with decimal dictionary encoding

vvysotskyi opened a new pull request #2232:
URL: https://github.com/apache/drill/pull/2232


   # [DRILL-7919](https://issues.apache.org/jira/browse/DRILL-7919): Fix reading parquet with decimal dictionary encoding
   
   ## Description
   Cleaned up some code related to choosing readers and combined logic related to nullable readers.
   Implemented logic for reading decimals with dictionary encoding stored as fixed_len_byte_array or binary.
   
   ## Documentation
   NA
   
   ## Testing
   Added UT.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] cgivre merged pull request #2232: DRILL-7919: Fix reading parquet with decimal dictionary encoding

Posted by GitBox <gi...@apache.org>.

cgivre merged pull request #2232:
URL: https://github.com/apache/drill/pull/2232


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vvysotskyi commented on a change in pull request #2232: DRILL-7919: Fix reading parquet with decimal dictionary encoding

Posted by GitBox <gi...@apache.org>.

vvysotskyi commented on a change in pull request #2232:
URL: https://github.com/apache/drill/pull/2232#discussion_r640144770



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableFixedByteAlignedReaders.java
##########
@@ -318,6 +318,27 @@ protected void readField(long recordsToReadInThisPass) {
             }
           }
           break;
+        case FIXED_LEN_BYTE_ARRAY:
+        case BINARY:
+          if (usingDictionary) {

Review comment:
       Not only this change but also some changes in `ColumnReaderFactory`. Initially one of those methods returned the wrong reader for `FIXED_LEN_BYTE_ARRAY` decimals with a dictionary encoding. But I have made additional cleanup to combine the logic of choosing the reader a little bit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vdiravka commented on a change in pull request #2232: DRILL-7919: Fix reading parquet with decimal dictionary encoding

Posted by GitBox <gi...@apache.org>.

vdiravka commented on a change in pull request #2232:
URL: https://github.com/apache/drill/pull/2232#discussion_r640131705



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableFixedByteAlignedReaders.java
##########
@@ -318,6 +318,27 @@ protected void readField(long recordsToReadInThisPass) {
             }
           }
           break;
+        case FIXED_LEN_BYTE_ARRAY:
+        case BINARY:
+          if (usingDictionary) {
+            NullableVarDecimalVector.Mutator mutator = valueVec.getMutator();
+            for (int i = 0; i < recordsReadInThisIteration; i++) {
+              Binary currDictValToWrite = pageReader.dictionaryValueReader.readBytes();
+              mutator.setSafe(valuesReadInCurrentPass + i, currDictValToWrite.toByteBuffer().slice(), 0,
+                  currDictValToWrite.length());
+            }
+            // Set the write Index. The next page that gets read might be a page that does not use dictionary encoding
+            // and we will go into the else condition below. The readField method of the parent class requires the
+            // writer index to be set correctly.
+            int writerIndex = valueVec.getBuffer().writerIndex();
+            valueVec.getBuffer().setIndex(0, writerIndex + (int) readLength);
+          } else {
+            for (int i = 0; i < recordsToReadInThisPass; i++) {
+              Binary valueToWrite = pageReader.valueReader.readBytes();
+              valueVec.getMutator().setSafe(valuesReadInCurrentPass + i, valueToWrite.toByteBuffer().slice(), 0,

Review comment:
       Originally it was `valueVec.getMutator().setSafe(index, 1, start, start + dataTypeLengthInBytes, bytebuf);`. Why it is different now?

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableFixedByteAlignedReaders.java
##########
@@ -318,6 +318,27 @@ protected void readField(long recordsToReadInThisPass) {
             }
           }
           break;
+        case FIXED_LEN_BYTE_ARRAY:
+        case BINARY:
+          if (usingDictionary) {

Review comment:
       Am I right, this is the main code to solve the ticket issue and all other changes from this PR are just refactoring?

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ColumnReaderFactory.java
##########
@@ -252,91 +222,110 @@
     }
   }
 
-  public static NullableColumnReader<?> getNullableColumnReader(ParquetRecordReader parentReader,
+  public static ColumnReader<?> getNullableColumnReader(ParquetRecordReader parentReader,

Review comment:
       The method is names as `getNullableColumnReader`, so why the returned type is not `NullableColumnReader` anymore?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vvysotskyi commented on a change in pull request #2232: DRILL-7919: Fix reading parquet with decimal dictionary encoding

Posted by GitBox <gi...@apache.org>.

vvysotskyi commented on a change in pull request #2232:
URL: https://github.com/apache/drill/pull/2232#discussion_r640146168



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableFixedByteAlignedReaders.java
##########
@@ -318,6 +318,27 @@ protected void readField(long recordsToReadInThisPass) {
             }
           }
           break;
+        case FIXED_LEN_BYTE_ARRAY:
+        case BINARY:
+          if (usingDictionary) {
+            NullableVarDecimalVector.Mutator mutator = valueVec.getMutator();
+            for (int i = 0; i < recordsReadInThisIteration; i++) {
+              Binary currDictValToWrite = pageReader.dictionaryValueReader.readBytes();
+              mutator.setSafe(valuesReadInCurrentPass + i, currDictValToWrite.toByteBuffer().slice(), 0,
+                  currDictValToWrite.length());
+            }
+            // Set the write Index. The next page that gets read might be a page that does not use dictionary encoding
+            // and we will go into the else condition below. The readField method of the parent class requires the
+            // writer index to be set correctly.
+            int writerIndex = valueVec.getBuffer().writerIndex();
+            valueVec.getBuffer().setIndex(0, writerIndex + (int) readLength);
+          } else {
+            for (int i = 0; i < recordsToReadInThisPass; i++) {
+              Binary valueToWrite = pageReader.valueReader.readBytes();
+              valueVec.getMutator().setSafe(valuesReadInCurrentPass + i, valueToWrite.toByteBuffer().slice(), 0,

Review comment:
       Because it was a different reader. This class extends `NullableColumnReader`, but that one was `NullableConvertedReader`, so it uses a slightly different approach for calculating required indexes.

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ColumnReaderFactory.java
##########
@@ -252,91 +222,110 @@
     }
   }
 
-  public static NullableColumnReader<?> getNullableColumnReader(ParquetRecordReader parentReader,
+  public static ColumnReader<?> getNullableColumnReader(ParquetRecordReader parentReader,

Review comment:
       It is because of the workaround for `NullableBitReader`, that actually extends `ColumnReader` and doesn't work properly with `NullableColumnReader` because of its internal representation. Initially, I also tried to extend `NullableBitReader` from `NullableColumnReader`, but it creates a lot of issues and is actually out of the scope for this PR.

##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableFixedByteAlignedReaders.java
##########
@@ -318,6 +318,27 @@ protected void readField(long recordsToReadInThisPass) {
             }
           }
           break;
+        case FIXED_LEN_BYTE_ARRAY:
+        case BINARY:
+          if (usingDictionary) {

Review comment:
       Not only this change but also some changes in `ColumnReaderFactory`. Initially one of those methods returned the wrong reader for decimals with a dictionary encoding. But I have made additional cleanup to combine the logic of choosing the reader a little bit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org