You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/13 17:31:00 UTC

[GitHub] [arrow] zeroshade opened a new pull request, #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder

zeroshade opened a new pull request, #14407:
URL: https://github.com/apache/arrow/pull/14407

   For `BitReader::GetAligned` we need to specialize the behavior for `bool` so that we don't get undefined behavior on potential bad input. Found by the fuzzer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1277953917

   The problem here is: how is a RLE boolean literal supposed to be represented? What do other implementations do? cc @sfc-gh-nthimmegowda 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder

Posted by GitBox <gi...@apache.org>.
zeroshade commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1277957810

   Given the way it's described in the Parquet spec, it looks like it should be a single bit 0 or 1, padded to a single byte, hence just checking that LSB and setting the bool value to that. Of course if others do it differently, we should conform.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1279681724

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16b380a7fb484dd592ef7a7e7eba8216...abcca0b9268648c180953aafe755c9cd/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1277961528

   Also, should first open a PR to arrow-testing to add the fuzz regression file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1278041474

   You now need to update the testing submodule here :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou merged pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder

Posted by GitBox <gi...@apache.org>.
kou merged PR #14407:
URL: https://github.com/apache/arrow/pull/14407


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sfc-gh-nthimmegowda commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder

Posted by GitBox <gi...@apache.org>.
sfc-gh-nthimmegowda commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1278094061

   Thank you @zeroshade for the fix. I agree with your explanation. 
   
   @pitrou 
   For reference this is how[ java implementation decodes](https://github.com/apache/parquet-mr/blob/5608695f5777de1eb0899d9075ec9411cfdf31d3/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java#L87). 
   
   At the end the type does not matter because everything is casted to int 64 in [Link](https://github.com/apache/arrow/blob/38b956f4da174c1e72607159c9a98f9dbacf6f75/cpp/src/arrow/util/rle_encoding.h#L667) . This is an unfortunate side effect of reusing functions, but I do not see a better way
   
   I disagree with how we are relying on type to calculate the byte size, but unfortunatel I do not see a better way for this. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder

Posted by GitBox <gi...@apache.org>.
zeroshade commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1277965425

   The error was in handling of a repeated value, which in the spec (that you linked) shows: 
   
   `repeated-value := value that is repeated, using a fixed-width of round-up-to-next-byte(bit-width)`
   
   Hence take the bit-width value (0 or 1 for bit-width == 1) and then pad it to round-up-to-next-byte(1) == 1 byte.
   
   At least that's my read on it. Since you provided the Fuzz file, do you want to add it or should I open the PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1277960873

   Where did you find that? Judging by https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3 , it doesn't seem very obvious to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1279681650

   Benchmark runs are scheduled for baseline = 20806db3da3fbf52a3b79f641c13b4211893f60e and contender = 49e6d2773f0c97e90f30a696eb665f15acca20f4. 49e6d2773f0c97e90f30a696eb665f15acca20f4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/eaab856b8055422788d138b07451d4cb...390ba2f8b9434c8e9861f8731047865c/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9a024cac67024d4b987ec1ef4aa25534...842dc6d4f9fc4e05a58bbf796e23de31/)
   [Failed :arrow_down:3.29% :arrow_up:3.56%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16b380a7fb484dd592ef7a7e7eba8216...abcca0b9268648c180953aafe755c9cd/)
   [Finished :arrow_down:0.07% :arrow_up:0.07%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/5ebcbcfae7a147f4bfc992c67ae35dfb...ac74188bfee4450d946a678dfd75c71a/)
   Buildkite builds:
   [Finished] [`49e6d277` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1695)
   [Failed] [`49e6d277` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1714)
   [Failed] [`49e6d277` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1697)
   [Finished] [`49e6d277` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1708)
   [Finished] [`20806db3` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1694)
   [Failed] [`20806db3` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1713)
   [Failed] [`20806db3` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1696)
   [Finished] [`20806db3` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1707)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org