You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2022/02/15 13:50:59 UTC

[GitHub] [drill] jnturton opened a new pull request #2460: DRILL-8134: Cannot query Parquet INT96 columns as timestamps

jnturton opened a new pull request #2460:
URL: https://github.com/apache/drill/pull/2460


   # [DRILL-8134](https://issues.apache.org/jira/browse/DRILL-8134): Cannot query Parquet INT96 columns as timestamps
   
   ## Description
   
   As of Drill 1.19 some Parquet readers column contained code to position the value vector write buffer index after a read pass, while some did not.  The Parquet v2 PR added write buffer positioning to the cases that were missing it, but failed to cater for the fact that INT96 timestamps are downcast to 64 bit timestamps.  This PR removes all of this write buffer positioning (and mispositioning) since testing indicates that Drill's value vector write paths advance the write buffer index to correct place already.
    
   ## Documentation
   N/A
   
   ## Testing
   ParquetTestWriter#testSparkParquetBinaryAsTimeStamp_DictChange
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton commented on pull request #2460: DRILL-8134: Cannot query Parquet INT96 columns as timestamps

Posted by GitBox <gi...@apache.org>.
jnturton commented on pull request #2460:
URL: https://github.com/apache/drill/pull/2460#issuecomment-1042669734


   @vdiravka, I've moved the DRILL-8139 fix out to a separate PR.  I've also changed the unit test that exposes DRILL-8138 and left a TODO there for since it is not a release blocker and can be solved after 1.20.  Please review again...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vvysotskyi commented on a change in pull request #2460: DRILL-8134: Cannot query Parquet INT96 columns as timestamps

Posted by GitBox <gi...@apache.org>.
vvysotskyi commented on a change in pull request #2460:
URL: https://github.com/apache/drill/pull/2460#discussion_r808223918



##########
File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/compression/DrillCompressionCodecFactory.java
##########
@@ -86,6 +99,9 @@ public BytesInputCompressor getCompressor(CompressionCodecName codecName) {
           codecName,
           c -> new AirliftBytesInputCompressor(codecName, allocator)
       );
+    } else if (codecName == CompressionCodecName.GZIP) {
+      // hack for gzip: construct a new codec factory every time to avoid a concurrrency bug c.f. DRILL-8139
+      return CodecFactory.createDirectCodecFactory(config, allocator, pageSize).getCompressor(codecName);

Review comment:
       If I recall correctly, the codec factory should manage the full lifecycle of compressors / decompressors, release created codecs when calling its `CompressionCodecFactory.release()` method, and so on. But with this change, it will be impossible to do that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on pull request #2460: DRILL-8134: Cannot query Parquet INT96 columns as timestamps

Posted by GitBox <gi...@apache.org>.
vdiravka commented on pull request #2460:
URL: https://github.com/apache/drill/pull/2460#issuecomment-1040442325


   I like PRs with code removal :) They certainly mean the code starts to be simpler and cleaner.
   The changes looks good to me.
   
   Just please fix setting `system` options in `TestParquetWriter`#`compareParquetReadersColumnar` method. It should be `session` level. Then `testSparkParquetBinaryAsTimeStamp_DictChange` and `testTPCHReadWriteLz4` will show DRILL-8134 bug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] vdiravka commented on pull request #2460: DRILL-8134: Cannot query Parquet INT96 columns as timestamps

Posted by GitBox <gi...@apache.org>.
vdiravka commented on pull request #2460:
URL: https://github.com/apache/drill/pull/2460#issuecomment-1040442325


   I like PRs with code removal :) They certainly mean the code starts to be simpler and cleaner.
   The changes looks good to me.
   
   Just please fix setting `system` options in `TestParquetWriter`#`compareParquetReadersColumnar` method. It should be `session` level. Then `testSparkParquetBinaryAsTimeStamp_DictChange` and `testTPCHReadWriteLz4` will show DRILL-8134 bug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [drill] jnturton merged pull request #2460: DRILL-8134: Cannot query Parquet INT96 columns as timestamps

Posted by GitBox <gi...@apache.org>.
jnturton merged pull request #2460:
URL: https://github.com/apache/drill/pull/2460


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org