You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "shyam narayan singh (JIRA)" <ji...@apache.org> on 2019/05/15 08:47:00 UTC

[jira] [Updated] (PARQUET-1575) Parquet reader throws error "Reading past RLE/BitPacking stream" for parquet file with null values

     [ https://issues.apache.org/jira/browse/PARQUET-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

shyam narayan singh updated PARQUET-1575:
-----------------------------------------
    Summary: Parquet reader throws error "Reading past RLE/BitPacking stream" for parquet file with null values  (was: Parquet reader throws error "Reading past RLE/BitPacking stream")

> Parquet reader throws error "Reading past RLE/BitPacking stream" for parquet file with null values
> --------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1575
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1575
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.0
>            Reporter: shyam narayan singh
>            Priority: Major
>
> Recently moved from parquet 1.8.x to 1.12 recently.
> Dataset has > 20k null values to be written to a complex type. Earlier with 1.8.x, it would create single page but with 1.12 it creates 20 pages (parquet - 1414). Writing nulls to complex types has been optimised to be cached (null cache) that would be flushed on next non null encounter or explicit flush/close. With 1.8, it would have encountered explicit close and flush the null cache and write the page. But with 1.12, after encountering 20k values, the page is written prematurely.
>  
> Below is the metadata dump in both cases.
> 1.8 :
> index._id TV=111396 RL=0 DL=2 ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[num_nulls: 111396, min/max not defined] SZ:8 VC:111396
>  
> 1.12 :
> index._index TV=111396 RL=0 DL=2 ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:4 VC:0 ...... page 19: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:8 VC:111396
> All the pages in 1.12 except the last page have same metadata. Now the issue is when the parquet reader kicks in, it sees that the RLE is bit packed and reads 8 bytes which goes beyond the stream as the size is only 4 (Reading past RLE/BitPacking stream).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)