You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/01 22:59:50 UTC

[GitHub] [arrow-rs] yordan-pavlov commented on issue #1111: ArrowArrayReader Reads Too Many Values From Bit-Packed Runs

yordan-pavlov commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003631709


   UPDATE: for the short-term fix, the only option I can think of is (when def levels are present) to count the number of actual values here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L393 before creating the value reader and using this instead of num_values.
   
   This then makes the new test (using dictionary encoded pages) pass - notice how in the test output below the value of num_values in the `VariableLenDictionaryDecoder` is the actual number of values instead of including null-values:
   
   running 1 test
   page num_values: 100, values.len(): 25
   page num_values: 100, values.len(): 31
   VariableLenPlainDecoder::new, num_values: 10
   ---------- reading a batch of 50 values ----------
   VariableLenDictionaryDecoder::new, num_values: 25
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 25, num_values: 11
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 11, self.num_values: 14 
   ---------- reading a batch of 100 values ----------
   VariableLenPlainDecoder::new, num_values: 10
   VariableLenDictionaryDecoder::new, num_values: 31
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 14, num_values: 31
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 0  
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 0, num_values: 17 
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 0, self.num_values: 0   
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 31, num_values: 17
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 17, self.num_values: 14 
   ---------- reading a batch of 100 values ----------
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 14, num_values: 14
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 0
   test arrow::arrow_array_reader::tests::test_arrow_array_reader_dict_string ... ok
   
   test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 471 filtered out; finished in 0.01s
   
   
   Tomorrow I will be checking the impact on performance and possibly create a pull request for the new test plus short-term fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org