You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "James Turton (Jira)" <ji...@apache.org> on 2023/10/31 15:05:00 UTC
[jira] [Resolved] (DRILL-8458) Reading Parquet v2 data page with repetition levels larger than column data throws IllegalArgumentException

     [ https://issues.apache.org/jira/browse/DRILL-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Turton resolved DRILL-8458.
---------------------------------
    Resolution: Fixed

> Reading Parquet v2 data page with repetition levels larger than column data throws IllegalArgumentException
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-8458
>                 URL: https://issues.apache.org/jira/browse/DRILL-8458
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.21.1
>            Reporter: Peter Franzen
>            Assignee: James Turton
>            Priority: Major
>             Fix For: 1.22.0
>
>
> When the size of the repetition level bytes in a Parquet v2 data page is larger than the size of the column data bytes, {{org.apache.parquet.hadoop.ColumnChunkIncReadStore$ColumnChunkIncPageReader::readPage}} throws an {{{}IllegalArgumentException{}}}. This is caused by trying to set the limit of a ByteBuffer to a value large than its capacity.
>  
> The offending code is at line 226 in {{{}ColumnChunkIncReadStore.java{}}}:
>  
> {code:java}
> 217 int pageBufOffset = 0;
> 218 ByteBuffer bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 219 BytesInput repLevelBytes = BytesInput.from(
> 220   (ByteBuffer) bb.slice().limit(pageBufOffset + repLevelSize)
> 221 );
> 222 pageBufOffset += repLevelSize;
> 223
> 224 bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 225 final BytesInput defLevelBytes = BytesInput.from(
> 226   (ByteBuffer) bb.slice().limit(pageBufOffset + defLevelSize)
> 227 );
> 228 pageBufOffset += defLevelSize;  {code}
>  
> The buffer {{pageBuf}} contains the repetition level bytes followed by the definition level bytes followed by the column data bytes.
>  
> The code at lines 217-221 reads the repetition level bytes, and then updates the position of the {{pageBuf}} buffer to the start of the definition level bytes (lines 222 and 224).
>  
> The code at lines 225-227 reads the definition level bytes, and when creating a slice of the \{{pageBuf }}buffer containing the definition level bytes, the slice's limit is set as if the position was at the beginning of the repetition level bytes (line 226), i.e as if it not had been updated.
>  
> This means that if the capacity of the pageBuf buffer (which is the size of the repetition level bytes + the size of the definition level bytes + the size of the column data bytes) is less than (repLevelSize + repLevelSize + defLevelSize), the call to limit() will throw.
>  
> The fix is to change line 226 to
> {code:java}
>   (ByteBuffer) bb.slice().limit(defLevelSize){code}
>  
> For symmetry, line 220 could also be changed to
> {code:java}
>   (ByteBuffer) bb.slice().limit(repLevelSize){code}
>  
> although {{pageBufOffset}} is always 0 there and will not cause the limit to exceed the capacity.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)