You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/07 05:01:28 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #3036: Fix decoding long and/or padded RLE data (#3029) (#3035)

tustvold opened a new pull request, #3036:
URL: https://github.com/apache/arrow-rs/pull/3036

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #3029 
   Closes #3035
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   See tickets
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Fixes two bugs around decoding RLE data that prevent handling `https://s3.amazonaws.com/anaconda-package-data/conda/monthly/2022/2022-05.parquet`
   
   This appears to be a file written by fastparquet, which might explain why it has a somewhat different interpretation of the parquet specification as some other writers.
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ursabot commented on pull request #3036: Fix decoding long and/or padded RLE data (#3029) (#3035)

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #3036:
URL: https://github.com/apache/arrow-rs/pull/3036#issuecomment-1306333392

   Benchmark runs are scheduled for baseline = b7bc79bf2cbf593fafa0dc552cc2bb16b084d132 and contender = 879b461af8c1259d48fbb1bc67d50fa2a38bea68. 879b461af8c1259d48fbb1bc67d50fa2a38bea68 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/797e584d96f54b058589c68fcc520344...06777e24daec44759b014f8c79db98e5/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/397fc0cca59b478bb300695dcd9b21a8...ffdcfe01cf9c4ff4b28d63f8657c5895/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/03e29050372543898443f9f238bc27a9...7fb62b70bb5045c28c27742ca7751c20/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/6e4e5698d7b446dd94212e52935fa1ab...8d6aa5bcd9bf47728637cf9c5b90e4b2/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3036: Fix decoding long and/or padded RLE data (#3029) (#3035)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3036:
URL: https://github.com/apache/arrow-rs/pull/3036#discussion_r1015006451


##########
parquet/src/encodings/rle.rs:
##########
@@ -473,13 +473,18 @@ impl RleDecoder {
                 let bit_reader =
                     self.bit_reader.as_mut().expect("bit_reader should be set");
 
-                let mut num_values =
-                    cmp::min(max_values - values_read, self.bit_packed_left as usize);
-
-                num_values = cmp::min(num_values, index_buf.len());
                 loop {
-                    num_values = bit_reader.get_batch::<i32>(
-                        &mut index_buf[..num_values],
+                    let to_read = index_buf

Review Comment:
   This is the fix for #3029 



##########
parquet/src/encodings/rle.rs:
##########
@@ -509,6 +514,12 @@ impl RleDecoder {
         let bit_reader = self.bit_reader.as_mut().expect("bit_reader should be set");
 
         if let Some(indicator_value) = bit_reader.get_vlq_int() {
+            // fastparquet adds padding to the end of pages. This is not spec-compliant

Review Comment:
   This is the fix for #3035 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3036: Fix decoding long and/or padded RLE data (#3029) (#3035)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3036:
URL: https://github.com/apache/arrow-rs/pull/3036#discussion_r1015006451


##########
parquet/src/encodings/rle.rs:
##########
@@ -473,13 +473,18 @@ impl RleDecoder {
                 let bit_reader =
                     self.bit_reader.as_mut().expect("bit_reader should be set");
 
-                let mut num_values =
-                    cmp::min(max_values - values_read, self.bit_packed_left as usize);
-
-                num_values = cmp::min(num_values, index_buf.len());
                 loop {
-                    num_values = bit_reader.get_batch::<i32>(
-                        &mut index_buf[..num_values],
+                    let to_read = index_buf

Review Comment:
   This is the fix for #3029 
   
   Effectively for runs longer than 1024 it would loop round again and attempt to read a further 1024 values. This could lead to incorrect data and/or panics



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] wolfv commented on pull request #3036: Fix decoding long and/or padded RLE data (#3029) (#3035)

Posted by GitBox <gi...@apache.org>.
wolfv commented on PR #3036:
URL: https://github.com/apache/arrow-rs/pull/3036#issuecomment-1305192116

   Awesome, thanks for tackling this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3036: Fix decoding long and/or padded RLE data (#3029) (#3035)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3036:
URL: https://github.com/apache/arrow-rs/pull/3036#discussion_r1015009172


##########
parquet/src/encodings/rle.rs:
##########
@@ -509,6 +514,12 @@ impl RleDecoder {
         let bit_reader = self.bit_reader.as_mut().expect("bit_reader should be set");
 
         if let Some(indicator_value) = bit_reader.get_vlq_int() {
+            // fastparquet adds padding to the end of pages. This is not spec-compliant

Review Comment:
   Looking at the fastparquet code, the bitpacking logic appears like it would pad to 4-byte boundaries as a consequence of it using a 32-bit accumulator
   
   https://github.com/dask/fastparquet/blob/main/fastparquet/cencoding.pyx#L277



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #3036: Fix decoding long and/or padded RLE data (#3029) (#3035)

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #3036:
URL: https://github.com/apache/arrow-rs/pull/3036


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] wolfv commented on pull request #3036: Fix decoding long and/or padded RLE data (#3029) (#3035)

Posted by GitBox <gi...@apache.org>.
wolfv commented on PR #3036:
URL: https://github.com/apache/arrow-rs/pull/3036#issuecomment-1305236649

   I tested this locally for our project and it appears to work fine :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3036: Fix decoding long and/or padded RLE data (#3029) (#3035)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3036:
URL: https://github.com/apache/arrow-rs/pull/3036#discussion_r1015006451


##########
parquet/src/encodings/rle.rs:
##########
@@ -473,13 +473,18 @@ impl RleDecoder {
                 let bit_reader =
                     self.bit_reader.as_mut().expect("bit_reader should be set");
 
-                let mut num_values =
-                    cmp::min(max_values - values_read, self.bit_packed_left as usize);
-
-                num_values = cmp::min(num_values, index_buf.len());
                 loop {
-                    num_values = bit_reader.get_batch::<i32>(
-                        &mut index_buf[..num_values],
+                    let to_read = index_buf

Review Comment:
   This is the fix for #3029 
   
   Effectively for runs longer than 1024 it would loop round again and attempt to read a further 1024 values. This could lead to panics



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org