You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/30 17:29:56 UTC

[GitHub] [arrow-rs] tustvold commented on issue #1111: ArrowArrayReader Incorrect Data

tustvold commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003119362

So adding a print statement to `VariableLenDictionaryDecoder::new` it is being created twice with `num_values` of 3, i.e. the number of rows in the row group.

This is the `num_values` field from `Page`, which confusingly is the number of values **including** nulls. This value is then used to determine how many values to read from `RleDecoder` for this page.

Now a somewhat strange quirk of the [hybrid encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) is packed "runs" are **always** multiples of 8 in length. This means if the final run of a page is packed encoded, as opposed to RLE, it will zero-padded to length. Unfortunately the parquet designers opted to not store the actual length for a packed run, but the length / 8. This means the length of the final packed run of a page is not actually knowable...

This is where the issue arises. `VariableLenDictionaryDecoder` thinks it has more actual values than it does, as it is being fed the `value_count` for the page which counts nulls which aren't encoded. This means it asks `RleDecoder` for more keys than should actually be present. As `RleDecoder` contains a zero-padded final run, it returns too many values, which has the effect of "shifting" the string values in the final result.

The fix should be a case of making whatever calls `ValueDecoder::read_value_bytes` only request a number of values that the page should be expected to yield. This is what `RecordReader` and friends handle. I need to do some digging to see how feasible this is with the design of ArrowArrayReader.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org