You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/13 17:31:00 UTC
[GitHub] [arrow] zeroshade opened a new pull request, #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder
zeroshade opened a new pull request, #14407:
URL: https://github.com/apache/arrow/pull/14407
For `BitReader::GetAligned` we need to specialize the behavior for `bool` so that we don't get undefined behavior on potential bad input. Found by the fuzzer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder
Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1277953917
The problem here is: how is a RLE boolean literal supposed to be represented? What do other implementations do? cc @sfc-gh-nthimmegowda
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] zeroshade commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder
Posted by GitBox <gi...@apache.org>.
zeroshade commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1277957810
Given the way it's described in the Parquet spec, it looks like it should be a single bit 0 or 1, padded to a single byte, hence just checking that LSB and setting the bool value to that. Of course if others do it differently, we should conform.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder
Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1279681724
['Python', 'R'] benchmarks have high level of regressions.
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16b380a7fb484dd592ef7a7e7eba8216...abcca0b9268648c180953aafe755c9cd/)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder
Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1277961528
Also, should first open a PR to arrow-testing to add the fuzz regression file.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder
Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1278041474
You now need to update the testing submodule here :-)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kou merged pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder
Posted by GitBox <gi...@apache.org>.
kou merged PR #14407:
URL: https://github.com/apache/arrow/pull/14407
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] sfc-gh-nthimmegowda commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder
Posted by GitBox <gi...@apache.org>.
sfc-gh-nthimmegowda commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1278094061
Thank you @zeroshade for the fix. I agree with your explanation.
@pitrou
For reference this is how[ java implementation decodes](https://github.com/apache/parquet-mr/blob/5608695f5777de1eb0899d9075ec9411cfdf31d3/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java#L87).
At the end the type does not matter because everything is casted to int 64 in [Link](https://github.com/apache/arrow/blob/38b956f4da174c1e72607159c9a98f9dbacf6f75/cpp/src/arrow/util/rle_encoding.h#L667) . This is an unfortunate side effect of reusing functions, but I do not see a better way
I disagree with how we are relying on type to calculate the byte size, but unfortunatel I do not see a better way for this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] zeroshade commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder
Posted by GitBox <gi...@apache.org>.
zeroshade commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1277965425
The error was in handling of a repeated value, which in the spec (that you linked) shows:
`repeated-value := value that is repeated, using a fixed-width of round-up-to-next-byte(bit-width)`
Hence take the bit-width value (0 or 1 for bit-width == 1) and then pad it to round-up-to-next-byte(1) == 1 byte.
At least that's my read on it. Since you provided the Fuzz file, do you want to add it or should I open the PR?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder
Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1277960873
Where did you find that? Judging by https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3 , it doesn't seem very obvious to me.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #14407: ARROW-18031: [C++][Parquet] Undefined behavior in bool RLE decoder
Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #14407:
URL: https://github.com/apache/arrow/pull/14407#issuecomment-1279681650
Benchmark runs are scheduled for baseline = 20806db3da3fbf52a3b79f641c13b4211893f60e and contender = 49e6d2773f0c97e90f30a696eb665f15acca20f4. 49e6d2773f0c97e90f30a696eb665f15acca20f4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/eaab856b8055422788d138b07451d4cb...390ba2f8b9434c8e9861f8731047865c/)
[Failed :arrow_down:0.0% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9a024cac67024d4b987ec1ef4aa25534...842dc6d4f9fc4e05a58bbf796e23de31/)
[Failed :arrow_down:3.29% :arrow_up:3.56%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/16b380a7fb484dd592ef7a7e7eba8216...abcca0b9268648c180953aafe755c9cd/)
[Finished :arrow_down:0.07% :arrow_up:0.07%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/5ebcbcfae7a147f4bfc992c67ae35dfb...ac74188bfee4450d946a678dfd75c71a/)
Buildkite builds:
[Finished] [`49e6d277` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1695)
[Failed] [`49e6d277` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1714)
[Failed] [`49e6d277` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1697)
[Finished] [`49e6d277` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1708)
[Finished] [`20806db3` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1694)
[Failed] [`20806db3` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1713)
[Failed] [`20806db3` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1696)
[Finished] [`20806db3` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1707)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org