You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by shyam narayan singh <sh...@gmail.com> on 2019/04/30 05:20:02 UTC

Issue with writing null values to complex type.

Hi

I have encountered a regression for writing nulls to the complex type. I
have moved from parquet 1.8.x to 1.12 recently.

Here is what I found out.

My dataset has 111k null values to be written to a complex type. Earlier
with 1.8.x, it would create single page but with 1.12 it creates 20 pages
(parquet - 1414).

Writing nulls to complex types has been optimised to be cached (null cache)
that would be flushed on next non null encounter or explicit flush/close.
With 1.8, it would have encountered explicit close and flush the null cache
and write the page. But with 1.12, after encountering 20k values, the page
is written prematurely. Below is the metadata dump in both cases.

1.8 :

index._id TV=111396 RL=0 DL=2
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[num_nulls: 111396, min/max not
defined] SZ:8 VC:111396

1.12 :

index._index TV=111396 RL=0 DL=2

----------------------------------------------------------------------------
    page 0:   DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this
column] SZ:4 VC:0
......
    page 19:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this
column] SZ:8 VC:111396

All the pages in 1.12 except the last page have same metadata. Now the
issue is when the parquet reader kicks in, it sees that the RLE is bit
packed and reads 8 bytes which goes beyond the stream as the size is only 4
(Reading past RLE/BitPacking stream).

For any page write, I thinking the null cache should be flushed.

For now, I have increased the row count limit to INT_MAX that negates
everything done for parquet-1414. Any implications ?

Please let me know the next steps on it.

Regards
Shyam