You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Micah Kornfield (Jira)" <ji...@apache.org> on 2020/10/31 04:28:00 UTC

[jira] [Commented] (PARQUET-1936) WriteBatchSpaced writes incorrect value for parquet when input contains NULL list

    [ https://issues.apache.org/jira/browse/PARQUET-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223995#comment-17223995 ] 

Micah Kornfield commented on PARQUET-1936:
------------------------------------------

[~Ruta Dhaneshwar] part of this might be related to PARQUET-1935 however, part of this is issue  might however be incorrect usage of the API.  In both cases, the "values" array should not contain any elements corresponding to empty or null lists (and neither should the bitmap).

 

Thank for you for the bug report, it is generally most helpful if you provide minimal code that can reproduce the issues instead of providing pseudo-code.

> WriteBatchSpaced writes incorrect value for parquet when input contains NULL list
> ---------------------------------------------------------------------------------
>
>                 Key: PARQUET-1936
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1936
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Ruta Dhaneshwar
>            Priority: Major
>         Attachments: NULL list 1.png, NULL list 2.png, NULL list 3.png, NULL list 4.png, schema.png
>
>
> When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below).
> *!schema.png|width=235,height=106!*
> *CASE 1*
>  Data (3 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>   
>  Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
>  # num_values: 3
>  # def_levels: [3, 0, 3]
>  # rep_levels: [0, 0, 0]
>  # valid_bits: 0x05 (bit representation 101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two"]
> When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get an error when running [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] on the outputted parquet file:
> !NULL list 1.png|width=332,height=52!
> !NULL list 2.png|width=757,height=249!
> Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below.
>   
>  *CASE 2*
>  Data (4 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>  [
>     "three",
>     "four"
>  ]
>   
>  Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
>  # num_values: 5
>  # def_levels: [3, 0, 3, 3, 3]
>  # rep_levels: [0, 0, 0, 0, 1]
>  # valid_bits: 0x29 (bit representation 11101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two", "three", "four"]
> Outputted Parquet File: 
> !NULL list 3.png|width=72,height=145!
> !NULL list 4.png|width=237,height=76!
>   
>  Here we see that the "four" in the last list actually shows up as "one". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)