You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/05/31 12:48:14 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #4319: Handle trailing padding when skipping repetition levels (#3911)

tustvold opened a new pull request, #4319:
URL: https://github.com/apache/arrow-rs/pull/4319

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #3911
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   The length of [RLE encoded](https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) data is ambiguous if the final block is bit packed. As such the decoder might think it has more level data than it actually has when skipping repetition levels, which conceivably could cause the decoder to get out of sync 
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   No, these APIs are not public
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #4319: Handle trailing padding when skipping repetition levels (#3911)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #4319:
URL: https://github.com/apache/arrow-rs/pull/4319#discussion_r1213042654


##########
parquet/src/column/reader/decoder.rs:
##########
@@ -473,17 +485,39 @@ mod tests {
     }
 
     #[test]
-    fn test_skip() {
-        let mut rng = thread_rng();
-        let total_len = 10000;
-        let encoded: Vec<i16> = (0..total_len).map(|_| rng.gen_range(0..5)).collect();
-        let mut encoder = RleEncoder::new(3, 1024);
-        for v in &encoded {
-            encoder.put(*v as _)
-        }
+    fn test_skip_padding() {
+        let mut encoder = RleEncoder::new(1, 1024);
+        encoder.put(0);
+        (0..3).for_each(|_| encoder.put(1));
         let data = ByteBufferPtr::new(encoder.consume());
 
+        let mut decoder = ColumnLevelDecoderImpl::new(1);
+        decoder.set_data(Encoding::RLE, data.clone());
+        let (records, levels) = decoder.skip_rep_levels(100, 4).unwrap();
+        assert_eq!(records, 1);
+        assert_eq!(levels, 4);
+
+        // The length of the final bit packed run is ambiguous, so without the correct
+        // levels limit, it will decode zero padding
+        let mut decoder = ColumnLevelDecoderImpl::new(1);
+        decoder.set_data(Encoding::RLE, data);
+        let (records, levels) = decoder.skip_rep_levels(100, 6).unwrap();
+        assert_eq!(records, 3);
+        assert_eq!(levels, 6);
+    }
+
+    #[test]
+    fn test_skip() {
         for _ in 0..10 {
+            let mut rng = thread_rng();

Review Comment:
   it might make sense to add a note to the test giving the rationale for the loop (and maybe name the test 'test_ski-_fuzz` or something



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4319: Handle trailing padding when skipping repetition levels (#3911)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #4319:
URL: https://github.com/apache/arrow-rs/pull/4319#discussion_r1212317953


##########
parquet/src/column/reader/decoder.rs:
##########
@@ -473,17 +485,39 @@ mod tests {
     }
 
     #[test]
-    fn test_skip() {
-        let mut rng = thread_rng();
-        let total_len = 10000;
-        let encoded: Vec<i16> = (0..total_len).map(|_| rng.gen_range(0..5)).collect();
-        let mut encoder = RleEncoder::new(3, 1024);
-        for v in &encoded {
-            encoder.put(*v as _)
-        }
+    fn test_skip_padding() {
+        let mut encoder = RleEncoder::new(1, 1024);
+        encoder.put(0);
+        (0..3).for_each(|_| encoder.put(1));
         let data = ByteBufferPtr::new(encoder.consume());
 
+        let mut decoder = ColumnLevelDecoderImpl::new(1);
+        decoder.set_data(Encoding::RLE, data.clone());
+        let (records, levels) = decoder.skip_rep_levels(100, 4).unwrap();
+        assert_eq!(records, 1);
+        assert_eq!(levels, 4);
+
+        // The length of the final bit packed run is ambiguous, so without the correct
+        // levels limit, it will decode zero padding
+        let mut decoder = ColumnLevelDecoderImpl::new(1);
+        decoder.set_data(Encoding::RLE, data);
+        let (records, levels) = decoder.skip_rep_levels(100, 6).unwrap();
+        assert_eq!(records, 3);
+        assert_eq!(levels, 6);
+    }
+
+    #[test]
+    fn test_skip() {
         for _ in 0..10 {
+            let mut rng = thread_rng();

Review Comment:
   Moving this into the for loop makes it more likely to encounter issues inherent to the RLE data, as opposed to the selection of it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #4319: Handle trailing padding when skipping repetition levels (#3911)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #4319:
URL: https://github.com/apache/arrow-rs/pull/4319#issuecomment-1570665490

   https://github.com/apache/arrow-rs/pull/4326/files#diff-850b3a44587149637b8545f66603a2b1252959fd36f7ddc55f37d6b5357816c6R2526 adds a further test of this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #4319: Handle trailing padding when skipping repetition levels (#3911)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #4319:
URL: https://github.com/apache/arrow-rs/pull/4319


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org