You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by GitBox <gi...@apache.org> on 2020/12/22 04:06:17 UTC

[GitHub] [orc] chaoyli opened a new pull request #586: Fix RLE encoding bug on large negative integer.

chaoyli opened a new pull request #586:
URL: https://github.com/apache/orc/pull/586


   ORC has use RLE to encoding/decoding integer.
   Four types are comprised of the RLE encoding/decoding algorithm.
   Short Repeat : used for short repeating integer sequences.
   Direct : used for integer sequences whose values have a relatively constant bit width.
   Patched Base : used for integer sequences whose bit widths varies a lot.
   Delta : used for monotonically increasing or decreasing sequences.
   
   This bug occurs in Patched Base Type for large negative number.
   In patched base, base value is stored 1 to 8 bytes and encoding to 0 ~ 7.
   If the base value is 8 byte, the encoding value for base width should be 7.
   But now will encoding to 8, this is problem.
   It will result in inconsistent data with loaded data because wrong encoding procedure.
   In extreme case, the process will be cored dump because illegal address.
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of describing the problem.
     3. Make PR title and description complete because these will be the permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli edited a comment on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli edited a comment on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-753990154


   @pgaref OK, I will check it. I have a question to consult. Although I have not construct a case about large positive integer, you can ensure that large positive integer will be not encoded as PatchedBase?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on a change in pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on a change in pull request #586:
URL: https://github.com/apache/orc/pull/586#discussion_r547612574



##########
File path: c++/src/RleEncoderV2.cc
##########
@@ -577,7 +577,10 @@ void RleEncoderV2::writePatchedBasedValues(EncodingOption& option) {
     // find the number of bytes required for base and shift it by 5 bits
     // to accommodate patch width. The additional bit is used to store the sign
     // of the base value.
-    const uint32_t baseWidth = findClosestNumBits(option.min) + 1;
+    uint32_t baseWidth = findClosestNumBits(option.min) + 1;

Review comment:
       I will change it to skip the sign bit addition when the width is already 64 bits




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dirtysalt commented on a change in pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
dirtysalt commented on a change in pull request #586:
URL: https://github.com/apache/orc/pull/586#discussion_r608296123



##########
File path: java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java
##########
@@ -530,7 +531,7 @@ private void determineEncoding() {
       // fallback to DIRECT encoding.
       // The decision to use patched base was based on zigzag values, but the
       // actual patching is done on base reduced literals.
-      if ((brBits100p - brBits95p) != 0 && Math.abs(min) < BASE_VALUE_LIMIT) {
+      if ((brBits100p - brBits95p) != 0) {

Review comment:
       Thanks. Look like this is the correct way to resolve the problem instead of this workaround. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-792411118


   @pgaref I have test RleV2_Patched_max1 many times in linux, It's all OK. But in MSVC, it'a failed. I copy the data from MVSC to  reproduce the bug, it's also OK. I have a question, in MSVC, the alignBitpacking in TestRleEncoder.cc will be changed and make the case failed. It's may be late,  sorry, I have many jobs in the previous two months.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-753717390


   @pgaref Thank you to check the RLE path. I have read the patch code for [ORC-616](https://issues.apache.org/jira/browse/ORC-616)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dirtysalt commented on a change in pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
dirtysalt commented on a change in pull request #586:
URL: https://github.com/apache/orc/pull/586#discussion_r608296123



##########
File path: java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java
##########
@@ -530,7 +531,7 @@ private void determineEncoding() {
       // fallback to DIRECT encoding.
       // The decision to use patched base was based on zigzag values, but the
       // actual patching is done on base reduced literals.
-      if ((brBits100p - brBits95p) != 0 && Math.abs(min) < BASE_VALUE_LIMIT) {
+      if ((brBits100p - brBits95p) != 0) {

Review comment:
       Thanks. Look like this is the correct way to resolve the problem instead of this workaround. w




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli edited a comment on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli edited a comment on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-753717390


   @pgaref Thank you to check the RLE path. I have read the patch code for [ORC-616](https://issues.apache.org/jira/browse/ORC-616). I found it use DIRECT code instead. The solution in #586 may be better for encoding. Can I revert the [ORC-616](https://issues.apache.org/jira/browse/ORC-616) and change it to my solution. Could you agree it? If you agree it, I will write the code to update the pull request.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref edited a comment on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
pgaref edited a comment on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-752990616


   Apologies for the slight delay but this got more complicated than I expected.
   While testing the newly added [Java test](https://github.com/apache/orc/pull/586/files#diff-8f851a9e2ba9c8fd9f357a9df6482df5c264fdfdbb0e7625e9b2fe71ca9366a7R1226) I realised that the values were encoded using **DIRECT** encoding instead of the expected **PATCHED_BASE**.
   
   The reason for picking **DIRECT** encoding was that some of the values exceed the [**BASE_VALUE_LIMIT**](https://github.com/apache/orc/blob/f2201f396cdd0c4e5b15589ec499e477235a553c/java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java#L537) that was introduced as part of ORC-616 (setting an upper limit to 56 bits numbers).
   
   More precisely, **RunLengthIntegerWriterV2** checks if Math.abs(min) -- where min is the min input number of the sequence -- is greater than or equal to (1 << 56) bits -- to make sure the value of baseBytes is never above 8.  Seems that this check was never introduced to the Cpp side of things.
   
   That said, I would recommend to port the above logic to the Cpp library to keep things consistent.
   @chaoyli  please let me know if you want to make the change and again thanks for the patience :) 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on a change in pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on a change in pull request #586:
URL: https://github.com/apache/orc/pull/586#discussion_r547618763



##########
File path: c++/test/TestRleEncoder.cc
##########
@@ -280,6 +280,33 @@ namespace orc {
     decodeAndVerify(RleVersion_2, memStream, data, numValues, nullptr, isSigned);
   }
 
+  // This case is used to testing encoding large negative integer.

Review comment:
       I have test large positive number, but found it will not use PatchedBase encoding, so it's OK.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli edited a comment on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli edited a comment on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-753717390


   @pgaref Thank you to check the RLE path. I have read the patch code for [ORC-616](https://issues.apache.org/jira/browse/ORC-616). I found it use DIRECT code instead. The solution in #586 may be better for encoding. Do I can revert the [ORC-616](https://issues.apache.org/jira/browse/ORC-616) and change it to my solution. Could you agree it? If you agree it, I will write the code to update the pull request.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-754073505


   > @pgaref OK, I will check it. I have a question to consult. Although I have not construct a case about large positive integer, you can ensure that large position integer will be encoded as PatchedBase?
   
   [testPatchedBaseMax4](https://github.com/apache/orc/blob/60c9e9757ffb766edaaaeb7bed9c565a0316612d/java/core/src/test/org/apache/orc/TestNewIntegerEncoding.java#L1168) should already take care of that but sure we can add a more specific test too -- lets make sure these tests are replicated on the Cpp side as well 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-753990154


   @pgaref OK, I will check it. I have a question to consult. Although I have not construct a case about large positive integer, you can ensure that large position integer will be encoded as PatchedBase?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-754333405


   @pgaref OK, I will check the code and copy the testPatchedBaseMax4 to Cpp side. I will submit the following pull request in the next few days. It will be trouble you to review at that time.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-752990616


   Apologies for the slight delay but this got more complicated than I expected.
   While testing the newly added [Java test](https://github.com/apache/orc/pull/586/files#diff-8f851a9e2ba9c8fd9f357a9df6482df5c264fdfdbb0e7625e9b2fe71ca9366a7R1226) I realised that the values were encoded using DIRECT encoding instead of the expected PATCHED_BASE.
   
   The reason for picking DIRECT encoding was that some of the values exceed the [BASE_VALUE_LIMIT](https://github.com/apache/orc/blob/f2201f396cdd0c4e5b15589ec499e477235a553c/java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java#L537) that was introduced as part of ORC-616 (setting an upper limit to 56 bits numbers).
   
   More precisely, **RunLengthIntegerWriterV2** checks if Math.abs(min) -- where min is the min input number of the sequence -- is greater than or equal to (1 << 56) bits -- to make sure the value of baseBytes is never above 8.  Seems that this check was never introduced to the Cpp side of things.
   
   That said, I would recommend to port the above logic to the Cpp library to keep things consistent.
   @chaoyli  please let me know if you want to make the change and again thanks for the patience :) 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli edited a comment on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli edited a comment on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-753990154


   @pgaref OK, I will check it. I have a question to consult. Although I have not construct a case about large positive integer, you can ensure that large positive integer will be encoded as PatchedBase?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-753857936


   > @pgaref Thank you to check the RLE path. I have read the patch code for [ORC-616](https://issues.apache.org/jira/browse/ORC-616). I found it use DIRECT code instead. The solution in #586 may be better for encoding. Can I revert the [ORC-616](https://issues.apache.org/jira/browse/ORC-616) and change it to my solution? Could you agree it? If you agree it, I will write the code to update the pull request.
   
   Hey @chaoyli -- definitely agree that ORC-616 is not ideal in terms of encoding. 
   Can you also check #601 and my [comment above?](https://github.com/apache/orc/pull/586#issuecomment-753592692) before updating the PR -- I believe that checking for negative numbers is sufficient here.
   
   Thanks for working on this :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on a change in pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on a change in pull request #586:
URL: https://github.com/apache/orc/pull/586#discussion_r547611235



##########
File path: c++/src/RleEncoderV2.cc
##########
@@ -577,7 +577,10 @@ void RleEncoderV2::writePatchedBasedValues(EncodingOption& option) {
     // find the number of bytes required for base and shift it by 5 bits
     // to accommodate patch width. The additional bit is used to store the sign
     // of the base value.
-    const uint32_t baseWidth = findClosestNumBits(option.min) + 1;
+    uint32_t baseWidth = findClosestNumBits(option.min) + 1;

Review comment:
       Change getClosestFixedBits() logic will affect another code paths.
   Because the sign is only needed in PatchBase encoding.
   So change the PatchBase encoding will be better.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-792677721


   > @pgaref I have test RleV2_Patched_max1 many times in linux, It's all OK. But in MSVC, it'a failed. I copy the data from MVSC to reproduce the bug, it's also OK. I have a question, in MSVC, the alignBitpacking in TestRleEncoder.cc will be changed and make the case failed. It's may be late, sorry, I have many jobs in the previous two months.
   
   Hey @chaoyli  np at all -- I know @omalley had some comments about this (as part of #601) maybe he can share more here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dirtysalt commented on a change in pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
dirtysalt commented on a change in pull request #586:
URL: https://github.com/apache/orc/pull/586#discussion_r608297071



##########
File path: java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java
##########
@@ -286,7 +285,9 @@ private void writePatchedBaseValues() throws IOException {
     // find the number of bytes required for base and shift it by 5 bits
     // to accommodate patch width. The additional bit is used to store the sign
     // of the base value.
-    final int baseWidth = utils.findClosestNumBits(min) + 1;
+    int baseBitsNum = utils.findClosestNumBits(min);

Review comment:
       The root cause here is because we add 1 here, so `baseWidth` ranges from 2 to 65,  and `baseBytes` ranges from 0 to 8 which does not fit into 3 bits. 

##########
File path: java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java
##########
@@ -286,7 +285,9 @@ private void writePatchedBaseValues() throws IOException {
     // find the number of bytes required for base and shift it by 5 bits
     // to accommodate patch width. The additional bit is used to store the sign
     // of the base value.
-    final int baseWidth = utils.findClosestNumBits(min) + 1;
+    int baseBitsNum = utils.findClosestNumBits(min);

Review comment:
       The root cause here is because we add 1 here, so `baseWidth` ranges from 2 to 65,  and `baseBytes` ranges from 0 to 8 which does not fit into 3 bits.  Thanks for your fixing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-759145613


   > np at all, we all have our own distractions.
   
   OK


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-749728496


   Thank you, @chaoyli !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref edited a comment on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
pgaref edited a comment on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-752990616


   Apologies for the slight delay but this got more complicated than I expected.
   While testing the newly added [Java test](https://github.com/apache/orc/pull/586/files#diff-8f851a9e2ba9c8fd9f357a9df6482df5c264fdfdbb0e7625e9b2fe71ca9366a7R1226) I realised that the values were encoded using **DIRECT** encoding instead of the expected **PATCHED_BASE**.
   
   The reason for picking **DIRECT** encoding was that some of the values exceed the [**BASE_VALUE_LIMIT**](https://github.com/apache/orc/blob/f2201f396cdd0c4e5b15589ec499e477235a553c/java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java#L537) that was introduced as part of ORC-616 (setting an upper limit to 56 bits numbers).
   
   More precisely, **RunLengthIntegerWriterV2** checks if Math.abs(min) -- where min is the min input number of the sequence -- is greater than or equal to (1 << 56) bits -- to make sure the value of baseBytes is never above 8.  Seems that this check was never introduced to the [Cpp side of things](https://github.com/apache/orc/blob/f2201f396cdd0c4e5b15589ec499e477235a553c/c%2B%2B/src/RleEncoderV2.cc#L419).
   
   That said, I would recommend to port the above logic to the Cpp library to keep things consistent.
   @chaoyli  please let me know if you want to make the change and again thanks for the patience :) 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on a change in pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on a change in pull request #586:
URL: https://github.com/apache/orc/pull/586#discussion_r547616091



##########
File path: java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java
##########
@@ -286,7 +286,10 @@ private void writePatchedBaseValues() throws IOException {
     // find the number of bytes required for base and shift it by 5 bits
     // to accommodate patch width. The additional bit is used to store the sign
     // of the base value.
-    final int baseWidth = utils.findClosestNumBits(min) + 1;
+    int baseWidth = utils.findClosestNumBits(min) + 1;

Review comment:
       OK, I add this java tests




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-757721990


   @pgaref Sorry, I have delayed this patch for days. I introduced the PatchedBase unit test to Cpp side. Thank you to review it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli edited a comment on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli edited a comment on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-753717390


   @pgaref  Thank you to check the RLE path. I have read the patch code for [ORC-616](https://issues.apache.org/jira/browse/ORC-616). I found it use DIRECT code instead. The solution in #586 may be better for encoding. Can I revert the [ORC-616](https://issues.apache.org/jira/browse/ORC-616) and change it to my solution? Could you agree it? If you agree it, I will write the code to update the pull request.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on a change in pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on a change in pull request #586:
URL: https://github.com/apache/orc/pull/586#discussion_r547618824



##########
File path: java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java
##########
@@ -286,7 +286,10 @@ private void writePatchedBaseValues() throws IOException {
     // find the number of bytes required for base and shift it by 5 bits
     // to accommodate patch width. The additional bit is used to store the sign
     // of the base value.
-    final int baseWidth = utils.findClosestNumBits(min) + 1;
+    int baseWidth = utils.findClosestNumBits(min) + 1;

Review comment:
       OK, I add java test into the pull request.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-753592692


   Btw, seems like the ORC-616 FIX is actually missing PATCH encoding opportunities with the LIMIT check (and using the most wasteful DIRECT instead) https://github.com/apache/orc/pull/601/files#diff-c76000edca5c4182015a83c604c01b032d4a97e31c0050c0e3ce0ccd195f284fR536
   
   In reality, the only thing we need to check is if the number to be encoded is negative (already including the sign bit) and avoid accounting for the extra sign bit. https://github.com/apache/orc/pull/601/files#diff-c76000edca5c4182015a83c604c01b032d4a97e31c0050c0e3ce0ccd195f284fR291
   
   Created a proof of concept for the java side of things as part of #601  -- please let me know what you think


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on a change in pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #586:
URL: https://github.com/apache/orc/pull/586#discussion_r547507745



##########
File path: c++/test/TestRleEncoder.cc
##########
@@ -280,6 +280,33 @@ namespace orc {
     decodeAndVerify(RleVersion_2, memStream, data, numValues, nullptr, isSigned);
   }
 
+  // This case is used to testing encoding large negative integer.

Review comment:
       Is there a case that this would be an issue for positive numbers as well?
   
   Can we also add some checks for the expected encoded values?

##########
File path: c++/src/RleEncoderV2.cc
##########
@@ -577,7 +577,10 @@ void RleEncoderV2::writePatchedBasedValues(EncodingOption& option) {
     // find the number of bytes required for base and shift it by 5 bits
     // to accommodate patch width. The additional bit is used to store the sign
     // of the base value.
-    const uint32_t baseWidth = findClosestNumBits(option.min) + 1;
+    uint32_t baseWidth = findClosestNumBits(option.min) + 1;

Review comment:
       Seems like getClosestFixedBits method can indeed return up to 64 bits and adding thee sign bit here would exceed 8 bytes -- would it make sense to add the logic as part of the underlying method?
   
   An alternative would be to skip the sign bit addition when the width is already 64 bits

##########
File path: java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java
##########
@@ -286,7 +286,10 @@ private void writePatchedBaseValues() throws IOException {
     // find the number of bytes required for base and shift it by 5 bits
     // to accommodate patch width. The additional bit is used to store the sign
     // of the base value.
-    final int baseWidth = utils.findClosestNumBits(min) + 1;
+    int baseWidth = utils.findClosestNumBits(min) + 1;

Review comment:
       Same comment applies here (as the C++ version) -- I would try to avoid the extra if statement.
   
   Can we please add some java tests as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-794719033


   @omalley can you share some message here?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-749815081


   Also, cc @omalley 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on pull request #586: Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-749326736


   The following data can be used to reproduced the problem.
   ```
   -17887939293638656
   -15605417571528704
   -15605417571528704
   -13322895849418752
   -13322895849418752
   -84742859065569280
   -15605417571528704
   -13322895849418752
   -13322895849418752
   -15605417571528704
   -13322895849418752
   -13322895849418752
   -15605417571528704
   -15605417571528704
   -13322895849418752
   -13322895849418752
   -15605417571528704
   -15605417571528704
   -13322895849418752
   -13322895849418752
   -11040374127308800
   -15605417571528704
   -13322895849418752
   -13322895849418752
   -15605417571528704
   -15605417571528704
   -13322895849418752
   -13322895849418752
   -15605417571528704
   -13322895849418752
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-751976940


   > Could you review this once more, @pgaref ?
   
   Sure, wanted to check the RLE path more thoroughly before approving the fix -- will do that within the day.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] chaoyli commented on a change in pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
chaoyli commented on a change in pull request #586:
URL: https://github.com/apache/orc/pull/586#discussion_r547616091



##########
File path: java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java
##########
@@ -286,7 +286,10 @@ private void writePatchedBaseValues() throws IOException {
     // find the number of bytes required for base and shift it by 5 bits
     // to accommodate patch width. The additional bit is used to store the sign
     // of the base value.
-    final int baseWidth = utils.findClosestNumBits(min) + 1;
+    int baseWidth = utils.findClosestNumBits(min) + 1;

Review comment:
       OK, I add this java tests




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] pgaref commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
pgaref commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-758574909


   > @pgaref Sorry, I have delayed this patch for days. I introduced the PatchedBase unit test to Cpp side. Thank you to review it.
   
   Hey @chaoyli  -- np at all, we all have our own distractions. 
   PR is on the right path, can you please check the failing tests on the Cpp side?
   I would also keep the **testBaseValueLimit** on the Java side even thought you have a similar test case -- I would just rename to something like testMaxPatchBaseValue.
   
   Thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [orc] dongjoon-hyun commented on pull request #586: ORC-703 : Fix RLE encoding bug on large negative integer.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #586:
URL: https://github.com/apache/orc/pull/586#issuecomment-751915520


   Could you review this once more, @pgaref ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org