You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/10/26 12:43:00 UTC

[jira] [Commented] (DRILL-8458) Reading Parquet v2 data page with repetition levels larger than column data throws IllegalArgumentException

    [ https://issues.apache.org/jira/browse/DRILL-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779873#comment-17779873 ] 

ASF GitHub Bot commented on DRILL-8458:
---------------------------------------

handmadecode commented on PR #2838:
URL: https://github.com/apache/drill/pull/2838#issuecomment-1781047548

   > Argh, a basic buffer arithmetic bug by _yours truly_. I guess it's remained undetected so far because Parquet v2 is still uncommon the wild. And because of insufficient Parquet v2 test coverage.
   
   I guess we've all caused our fair share of those bugs ;-)
   
   > Thank you very much for this fix which looks great. Would you mind seeing if you can relocate the test and its data? We've got TestParquetComplex already and also some Parquet v2 test files for which a naming pattern has been started, e.g.
   > 
   > ```
   > exec/java-exec/src/test/resources/parquet/parquet_v2_logical_types_simple.parquet
   
   Sure, how about renaming the test file to `exec/java-exec/src/test/resources/parquet/parquet_v2_large_repetition_levels.parquet`?
   
   Would you prefer if the test was moved to an existing test class, e.g. `TestParquetComplex`?
   




> Reading Parquet v2 data page with repetition levels larger than column data throws IllegalArgumentException
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-8458
>                 URL: https://issues.apache.org/jira/browse/DRILL-8458
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.21.1
>            Reporter: Peter Franzen
>            Assignee: James Turton
>            Priority: Major
>             Fix For: 1.22.0
>
>
> When the size of the repetition level bytes in a Parquet v2 data page is larger than the size of the column data bytes, {{org.apache.parquet.hadoop.ColumnChunkIncReadStore$ColumnChunkIncPageReader::readPage}} throws an {{{}IllegalArgumentException{}}}. This is caused by trying to set the limit of a ByteBuffer to a value large than its capacity.
>  
> The offending code is at line 226 in {{{}ColumnChunkIncReadStore.java{}}}:
>  
> {code:java}
> 217 int pageBufOffset = 0;
> 218 ByteBuffer bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 219 BytesInput repLevelBytes = BytesInput.from(
> 220   (ByteBuffer) bb.slice().limit(pageBufOffset + repLevelSize)
> 221 );
> 222 pageBufOffset += repLevelSize;
> 223
> 224 bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 225 final BytesInput defLevelBytes = BytesInput.from(
> 226   (ByteBuffer) bb.slice().limit(pageBufOffset + defLevelSize)
> 227 );
> 228 pageBufOffset += defLevelSize;  {code}
>  
> The buffer {{pageBuf}} contains the repetition level bytes followed by the definition level bytes followed by the column data bytes.
>  
> The code at lines 217-221 reads the repetition level bytes, and then updates the position of the {{pageBuf}} buffer to the start of the definition level bytes (lines 222 and 224).
>  
> The code at lines 225-227 reads the definition level bytes, and when creating a slice of the \{{pageBuf }}buffer containing the definition level bytes, the slice's limit is set as if the position was at the beginning of the repetition level bytes (line 226), i.e as if it not had been updated.
>  
> This means that if the capacity of the pageBuf buffer (which is the size of the repetition level bytes + the size of the definition level bytes + the size of the column data bytes) is less than (repLevelSize + repLevelSize + defLevelSize), the call to limit() will throw.
>  
> The fix is to change line 226 to
> {code:java}
>   (ByteBuffer) bb.slice().limit(defLevelSize){code}
>  
> For symmetry, line 220 could also be changed to
> {code:java}
>   (ByteBuffer) bb.slice().limit(repLevelSize){code}
>  
> although {{pageBufOffset}} is always 0 there and will not cause the limit to exceed the capacity.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)