You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/30 16:42:51 UTC

[GitHub] [arrow-rs] tustvold opened a new issue #1111: ArrowArrayReader Incorrect Data

tustvold opened a new issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111


   **Describe the bug**
   
   Originally reported in https://github.com/apache/arrow-datafusion/issues/1441 and encountered again in #1110, `ParquetFileArrowReader` appears to read incorrect data for string columns that contain nulls.
   
   In particular the conditions required are for the column to be nullable, contain nulls, and multiple row groups. 
   
   **To Reproduce**
   
   Read [simple_strings.parquet.zip](https://github.com/apache/arrow-rs/files/7793788/simple_strings.parquet.zip) with the following code
   
   ```
   #[test]
       fn test_read_strings() {
           let testdata = arrow::util::test_util::parquet_test_data();
           let path = format!("{}/simple_strings.parquet", testdata);
           let parquet_file_reader =
               SerializedFileReader::try_from(File::open(&path).unwrap()).unwrap();
           let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(parquet_file_reader));
           let record_batch_reader = arrow_reader
               .get_record_reader(60)
               .expect("Failed to read into array!");
   
           let batches = record_batch_reader
               .collect::<arrow::error::Result<Vec<_>>>()
               .unwrap();
   
           assert_eq!(batches.len(), 1);
           let batch = batches.into_iter().next().unwrap();
           assert_eq!(batch.num_rows(), 6);
   
           let strings = batch
               .column(0)
               .as_any()
               .downcast_ref::<StringArray>()
               .unwrap();
   
           let strings: Vec<_> = strings.iter().collect();
   
           assert_eq!(
               &strings,
               &[
                   None,
                   Some("-1685637712"),
                   Some("512814980"),
                   Some("868743207"),
                   None,
                   Some("-1001940778")
               ]
           )
       }
   ```
   
   Fails with
   
   ```
   thread 'arrow::arrow_reader::tests::test_read_strings' panicked at 'assertion failed: `(left == right)`
     left: `[None, Some("-1685637712"), Some("512814980"), Some("-1685637712"), None, Some("868743207")]`,
    right: `[None, Some("-1685637712"), Some("512814980"), Some("868743207"), None, Some("-1001940778")]`', parquet/src/arrow/arrow_reader.rs:715:9
   ```
   
   For comparison
   
   ```
   $ python
   > import duckdb
   > duckdb.query("select * from 'simple_strings.parquet'").fetchall()
   [(None,), ('-1685637712',), ('512814980',), ('868743207',), (None,), ('-1001940778',)]
   ```
   
   **Expected behavior**
   
   The test should pass
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold edited a comment on issue #1111: ArrowArrayReader Reads Too Many Values From Bit-Packed Runs

Posted by GitBox <gi...@apache.org>.
tustvold edited a comment on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003153596


   So I'm not sure there is an easy way to fix this... `ArrowArrayReader` flattens all the pages from all the column chunks into iterators and then feeds these to `CompositeValueDecoder` which decode the levels and values independently. This makes it a non-trivial change to decode the levels and corresponding values from a given page in lock-step, which I believe is necessary in order to decode the correct number.
   
   Rather than spending time re-working `ArrowArrayReader` in order to fix this bug, I'm **personally** going to focus on getting #1082 and the PRs it builds on polished up. This provides an alternative implementation for reading byte arrays, that builds on the existing `ColumnReaderImpl` and `RecordReader` logic and so, much like `PrimitiveArrayReader`, does not run into this bug. My hope is that by being both faster, and duplicating less code, it will make sense to swap out `ArrowArrayReader` and therefore fix this bug for anything not using `ArrowArrayReader` explicitly. 
   
   If someone else wishes to work on fixing `ArrowArrayReader` that would be brilliant, but I'm going to focus my efforts elsewhere.
   
   FYI @yordan-pavlov @alamb 
   
   Edit: In the short-term switching back to `ComplexObjectArrayReader` does fix the bug, but represents a non-trivial performance regression (up to 6x) and so I'm somewhat loathe to suggest it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on issue #1111: ArrowArrayReader Incorrect Data

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003153596


   So I'm not sure there is an easy way to fix this... `ArrowArrayReader` flattens all the pages from all the column chunks into iterators and then feeds these to `CompositeValueDecoder` which decode the levels and values independently. This makes it a non-trivial change to decode the levels and corresponding values from a given page in lock-step, which I believe is necessary in order to decode the correct number.
   
   Rather than spending time re-working `ArrowArrayReader` in order to fix this bug, I'm **personally** going to focus on getting #1082 and the PRs it builds on polished up. This provides an alternative implementation for reading byte arrays, that builds on the existing `ColumnReaderImpl` and `RecordReader` logic and so, much like `PrimitiveArrayReader`, does not run into this bug. My hope is that by being both faster, and duplicating less code, it will make sense to swap out `ArrowArrayReader` and therefore fix this bug for anything not using `ArrowArrayReader` explicitly. 
   
   If someone else wishes to work on fixing `ArrowArrayReader` that would be brilliant, but I'm going to focus my efforts elsewhere.
   
   FYI @yordan-pavlov @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold edited a comment on issue #1111: ArrowArrayReader Incorrect Data

Posted by GitBox <gi...@apache.org>.
tustvold edited a comment on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003119362


   So adding a print statement to `VariableLenDictionaryDecoder::new` it is being created twice with `num_values` of 3, i.e. the number of rows in the row group. 
   
   This is the `num_values` field from `Page`, which confusingly is the number of values **including** nulls. This value is then used to determine how many values to read from `RleDecoder` for this page. 
   
   Now a somewhat strange quirk of the [hybrid encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) is packed "runs" are **always** multiples of 8 in length. This means if the final run of a page is packed encoded, as opposed to RLE, it will zero-padded to length. Unfortunately the parquet designers opted to not store the actual length for a packed run, but the length / 8. This means the length of the final packed run of a page is not actually knowable...
   
   This is where the issue arises. `VariableLenDictionaryDecoder` thinks it has more actual values than it does, as it is being fed the `value_count` for the page which counts nulls which aren't encoded. This means it asks `RleDecoder` for more keys than should actually be present. As `RleDecoder` contains a zero-padded final run, it returns too many values, which has the effect of "shifting" the string values in the final result.
   
   The fix should be a case of making whatever calls `ValueDecoder::read_value_bytes` only request a number of values that the page should be expected to yield. This is what `ColumnReaderImpl` [handles](https://github.com/apache/arrow-rs/blob/master/parquet/src/column/reader.rs#L208) for the non-`ArrowArrayReader` `ArrayReader` implementations. I need to do some digging to see how feasible this is with the design of ArrowArrayReader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on issue #1111: ArrowArrayReader Incorrect Data

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003119362


   So adding a print statement to `VariableLenDictionaryDecoder::new` it is being created twice with `num_values` of 3, i.e. the number of rows in the row group. 
   
   This is the `num_values` field from `Page`, which confusingly is the number of values **including** nulls. This value is then used to determine how many values to read from `RleDecoder` for this page. 
   
   Now a somewhat strange quirk of the [hybrid encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) is packed "runs" are **always** multiples of 8 in length. This means if the final run of a page is packed encoded, as opposed to RLE, it will zero-padded to length. Unfortunately the parquet designers opted to not store the actual length for a packed run, but the length / 8. This means the length of the final packed run of a page is not actually knowable...
   
   This is where the issue arises. `VariableLenDictionaryDecoder` thinks it has more actual values than it does, as it is being fed the `value_count` for the page which counts nulls which aren't encoded. This means it asks `RleDecoder` for more keys than should actually be present. As `RleDecoder` contains a zero-padded final run, it returns too many values, which has the effect of "shifting" the string values in the final result.
   
   The fix should be a case of making whatever calls `ValueDecoder::read_value_bytes` only request a number of values that the page should be expected to yield. This is what `RecordReader` and friends handle. I need to do some digging to see how feasible this is with the design of ArrowArrayReader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #1111: ArrowArrayReader Reads Too Many Values From Bit-Packed Runs

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003359567


   > it's getting pretty late now, but tomorrow I will try to write the missing test (that doesn't rely on an external parquet file) to reproduce the issue with VariableLenDictionaryDecoder / RleDecoder and also think on a short-term fix
   
   A short term fix would be nice so that we can get correct answers from the existing code, while @tustvold  works on the longer term / better fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] yordan-pavlov commented on issue #1111: ArrowArrayReader Reads Too Many Values From Bit-Packed Runs

Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003446768


   I have been able to reproduce the issue where `RleDecoder` returns more keys than values (as explained by @tustvold  above) by adding a test very similar to the existing `test_arrow_array_reader_string` but using dictionary encoding instead of plain. Next, I will looking for a short-term fix.
   
   Here is some sample output from the test:
   
   running 1 test
   page num_values: 100, values.len(): 33
   page num_values: 100, values.len(): 38
   VariableLenPlainDecoder::new, num_values: 9
   
   ---------- reading a batch of 50 values ----------
   VariableLenDictionaryDecoder::new, num_values: 100
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 100, num_values: 14
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 86
   **// ok so far, 33 actual values - 14 values read = 19 values still left in first page**
   
   ---------- reading a batch of 100 values ----------
   VariableLenPlainDecoder::new, num_values: 10
   VariableLenDictionaryDecoder::new, num_values: 100
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 86, num_values: 37
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 26, self.num_values: 0
   **// this is a problem - only 19 values were left in the first page, but 26 values have been read**
   
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 0, num_values: 11
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 0, self.num_values: 0
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 100, num_values: 11
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 11, self.num_values: 89
   thread 'arrow::arrow_array_reader::tests::test_arrow_array_reader_dict_string' panicked at 'assertion failed: `(left == right)`
     left: `"H"`,
    right: `"He"`', parquet\src\arrow\arrow_array_reader.rs:1745:17
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] yordan-pavlov commented on issue #1111: ArrowArrayReader Reads Too Many Values From Bit-Packed Runs

Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003631709


   UPDATE: for the short-term fix, the only option I can think of is (when def levels are present) to count the number of actual values here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L393 before creating the value reader and using this instead of num_values.
   
   This then makes the new test (using dictionary encoded pages) pass - notice how in the test output below the value of num_values in the `VariableLenDictionaryDecoder` is the actual number of values instead of including null-values:
   
   running 1 test
   page num_values: 100, values.len(): 25
   page num_values: 100, values.len(): 31
   VariableLenPlainDecoder::new, num_values: 10
   ---------- reading a batch of 50 values ----------
   VariableLenDictionaryDecoder::new, num_values: 25
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 25, num_values: 11
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 11, self.num_values: 14 
   ---------- reading a batch of 100 values ----------
   VariableLenPlainDecoder::new, num_values: 10
   VariableLenDictionaryDecoder::new, num_values: 31
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 14, num_values: 31
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 0  
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 0, num_values: 17 
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 0, self.num_values: 0   
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 31, num_values: 17
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 17, self.num_values: 14 
   ---------- reading a batch of 100 values ----------
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 14, num_values: 14
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 0
   test arrow::arrow_array_reader::tests::test_arrow_array_reader_dict_string ... ok
   
   test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 471 filtered out; finished in 0.01s
   
   
   Tomorrow I will be checking the impact on performance and possibly create a pull request for the new test plus short-term fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] yordan-pavlov commented on issue #1111: ArrowArrayReader Reads Too Many Values From Bit-Packed Runs

Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003718555


   UPDATE: performance degradation after the fix is actually not so bad - only between 3% and 8% 
   ```
   arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - old
                           time:   [1.4393 ms 1.4485 ms 1.4585 ms]
                           change: [-9.2499% -3.6008% +1.8053%] (p = 0.23 > 0.05)
   arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - new
                           time:   [594.37 us 604.91 us 618.75 us]
                           change: [+5.1998% +8.7402% +12.498%] (p = 0.00 < 0.05)
   
   arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - old
                           time:   [1.1766 ms 1.1857 ms 1.1950 ms]
                           change: [-4.5337% -2.8033% -1.0591%] (p = 0.00 < 0.05)
   arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - new
                           time:   [467.90 us 471.83 us 476.11 us]
                           change: [-0.8465% +3.1251% +7.0623%] (p = 0.12 > 0.05)
   ```
   I will try to submit a PR for this later today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] yordan-pavlov edited a comment on issue #1111: ArrowArrayReader Reads Too Many Values From Bit-Packed Runs

Posted by GitBox <gi...@apache.org>.
yordan-pavlov edited a comment on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003176898


   @tustvold 's work on making the "old" arrow reader faster looks promising, plus I've hit a wall in making the `ArrowArrayReader` faster in some cases, so I agree that it doesn't make sense to try to fix the `ArrowArrayReader` and instead it makes more sense to focus on finishing the work in #1082 .
   
   `ArrowArrayReader` does use def levels to calculate how many values to read, see here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L576 , but reading def levels is done independently of reading of values (one of the main characteristics of `ArrowArrayReader`) and so if the value of `num_values` field from `Page` is incorrect and does indeed include NULL values, then the entire idea of `ArrowArrayReader` falls apart and it won't work correctly. It makes me wonder how the unit tests pass though, I will have a look at this in the next few days.
   
   One reason why I implemented `ArrowArrayReader` in a new struct so that it can easily be swapped out if was necessary, so I guess that time has come. Yes - it will result in a significant reduction in performance of reading string arrays, but correctness is more important than speed and also it would be only until #1082 is finished.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] yordan-pavlov commented on issue #1111: ArrowArrayReader Reads Too Many Values From Bit-Packed Runs

Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003176898


   @tustvold 's work on making the "old" arrow reader faster looks promising, plus I've hit a wall in making the `ArrowArrayReader` faster in some cases, so I agree that it doesn't make sense to try to fix the `ArrowArrayReader` and instead it makes more sense to focus on finishing the work in #1082 .
   
   `ArrowArrayReader` does use def levels to calculate how many values to read, see here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L576 , but reading def levels is done independently of reading of values (one of the main characteristics of `ArrowArrayReader`) and so if the value of `num_values` field from `Page` is incorrect and does indeed include NULL values, then the entire idea of `ArrowArrayReader` falls apart and it won't work correctly.
   
   One reason why I implemented `ArrowArrayReader` in a new struct so that it can easily be swapped out if was necessary, so I guess that time has come. Yes - it will result in a significant reduction in performance of reading string arrays, but correctness is more important than speed and also it would be only until #1082 is finished.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] yordan-pavlov commented on issue #1111: ArrowArrayReader Reads Too Many Values From Bit-Packed Runs

Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003224774


   here is what I've found so far:
   * there is a test for plain-encoded strings which is working with null values and across pages here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L1493
   * `VariableLenPlainDecoder` gets used in the above test and it does work correctly, because although the value of `num_values`  for the decoder does include NULLs, it stops reading from the page correctly because it checks that it doesn't read out of the values buffer ( `while self.position < data_len` ) here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L919
   * what is missing is a test that exercises the `VariableLenDictionaryDecoder`
   * the `VariableLenDictionaryDecoder` relies on the `RleDecoder` to not read out of its buffer and to return 0 when no more values can be read here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L1069
   
   it's getting pretty late now, but tomorrow I will try to write the missing test (that doesn't rely on an external parquet file) to reproduce the issue with `VariableLenDictionaryDecoder` / `RleDecoder` and also think on a short-term fix  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb closed issue #1111: ArrowArrayReader Reads Too Many Values From Bit-Packed Runs

Posted by GitBox <gi...@apache.org>.
alamb closed issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org