You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Benedikt (Jira)" <ji...@apache.org> on 2022/06/02 08:04:00 UTC
[jira] [Commented] (ARROW-13024) [C++][Parquet] Decoding byte stream split encoded columns fails when it has null values

    [ https://issues.apache.org/jira/browse/ARROW-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545332#comment-17545332 ] 

Benedikt commented on ARROW-13024:
----------------------------------

I've just encountered the same issue (on arrow 7.0.0), and starting from [~romankarlstetter]'s comment I've browsed the {{ByteStreamSplitDecoder}} code a bit and I think the suggestion above points at the right place, but is not a complete fix.

The relevant exception was added in [https://github.com/apache/arrow/pull/6781] in order to fix an out-of-bounds buffer read found by fuzzing. However, it seems to me that this fix wasn't quite right in the first place. My understanding is the following:

In {{SetData}}, the number of {{null}} values interleaved with the data is not known, but {{num_values}} includes them. Thus, it is not possible to validate {{num_values}} against {{len}} in this function yet. Any validation must happen in the {{Decode}} and {{DecodeArrow}} methods (in fact, {{Decode}} is safe via limiting the number of values returned, and {{DecodeArrow}} does precisely this verification).

As suggested above, {{num_values_in_buffer_}} is not correctly initialized, since the meaning of {{num_values}} is different in the presence of {{null}}s, and should be set as argued by [~romankarlstetter] such that it has the meaning suggested by its name.

However, when changing {{num_values_in_buffer_}} in this way, the calculation of {{const int num_decoded_previously = num_values_in_buffer_ - num_values_}} in {{Decode}} and {{DecodeArrow}} becomes incorrect since {{num_values_}} includes {{null}}s, but {{num_values_in_buffer_}} does not. To complete the bug fix, it might be easiest to store {{num_decoded_previously}} explicitly in {{ByteStreamSplitDecoder}}.

If this is done, {{Decode}} should calculate

{{values_to_decode = std::min(num_values_in_buffer_ - num_decoded_previously, max_values)}}

instead of the current

{{values_to_decode = std::min(num_values_, max_values)}}.

I can't really tell from the bug report whether this already accounts for the issues found by the fuzzer.

> [C++][Parquet] Decoding byte stream split encoded columns fails when it has null values
> ---------------------------------------------------------------------------------------
>
>                 Key: ARROW-13024
>                 URL: https://issues.apache.org/jira/browse/ARROW-13024
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>    Affects Versions: 2.0.0, 3.0.0, 4.0.0
>            Reporter: Roman Karlstetter
>            Priority: Major
>
> Reading from a parquet file fails with the following error
> {{Data size too small for number of values (corrupted file?)}}.
> This happens for the case when there is a {{BYTE_STREAM_SPLIT}}-encoded column which has less values stored than number of rows, which is the case when the column has null values (definition levels are present).
> The problematic part is the condition checked in {{ByteStreamSplitDecoder<DType>::SetData}}, which raises the error if the number of values does not match the size of the data array.
> I'm unsure whether I have enough experience with the internals of the encoding/decoding part of this implementation to fix this issue, but my suggestion would be to initialize {{num_values_in_buffer_}} with {{len/static_cast<int64_t>(sizeof(T))}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)