You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Peter Franzen (Jira)" <ji...@apache.org> on 2023/10/23 10:52:00 UTC

[jira] [Created] (DRILL-8458) Reading Parquet data page with repetition levels larger than column data throws IllegalArgumentException

Peter Franzen created DRILL-8458:
------------------------------------

             Summary: Reading Parquet data page with repetition levels larger than column data throws IllegalArgumentException
                 Key: DRILL-8458
                 URL: https://issues.apache.org/jira/browse/DRILL-8458
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.21.1
            Reporter: Peter Franzen


When the size of the repetition level bytes in a Parquet data page is larger than the size of the column data bytes, {{org.apache.parquet.hadoop.ColumnChunkIncReadStore$ColumnChunkIncPageReader::readPage}} throws an {{{}IllegalArgumentException{}}}. This is caused by trying to set the limit of a ByteBuffer to a value large than its capacity.
 
The offending code is at line 226 in {{{}ColumnChunkIncReadStore.java{}}}:
 
{code:java}
217 int pageBufOffset = 0;
218 ByteBuffer bb = (ByteBuffer) pageBuf.position(pageBufOffset);
219 BytesInput repLevelBytes = BytesInput.from(
220   (ByteBuffer) bb.slice().limit(pageBufOffset + repLevelSize)
221 );
222 pageBufOffset += repLevelSize;
223
224 bb = (ByteBuffer) pageBuf.position(pageBufOffset);
225 final BytesInput defLevelBytes = BytesInput.from(
226   (ByteBuffer) bb.slice().limit(pageBufOffset + defLevelSize)
227 );
228 pageBufOffset += defLevelSize;  {code}
 
The buffer {{pageBuf}} contains the repetition level bytes followed by the definition level bytes followed by the column data bytes.
 
The code at lines 217-221 reads the repetition level bytes, and then updates the position of the {{pageBuf}} buffer to the start of the definition level bytes (lines 222 and 224).
 
The code at lines 225-227 reads the definition level bytes, and when creating a slice of the {{pageBuf }}buffer containing the definition level bytes, the slice's limit is set as if the position was at the beginning of the repetition level bytes (line 226), i.e as if it not had been updated.
 
This means that if the capacity of the pageBuf buffer (which is the size of the repetition level bytes + the size of the definition level bytes + the size of the column data bytes) is less than (repLevelSize + repLevelSize + defLevelSize), the call to limit() will throw.
 
The fix is to change line 226 to
{code:java}
  (ByteBuffer) bb.slice().limit(defLevelSize){code}
 
For symmetry, line 220 could also be changed to
{code:java}
  (ByteBuffer) bb.slice().limit(repLevelSize){code}
 
although {{pageBufOffset}} is always 0 there and will not cause the limit to exceed the capacity.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)