You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ruta Dhaneshwar (Jira)" <ji...@apache.org> on 2020/11/03 19:39:00 UTC

[jira] [Commented] (PARQUET-1936) WriteBatchSpaced writes incorrect value for parquet when input contains NULL list

    [ https://issues.apache.org/jira/browse/PARQUET-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225630#comment-17225630 ] 

Ruta Dhaneshwar commented on PARQUET-1936:
------------------------------------------

[~emkornfield] thanks for your response. Changing the parameters so "values" and the bitmap didn't include empty and null lists did solve the problem. Is it possible to clarify this in the comment for the WriteBatchSpaced function? The part about "... but the values include the null entries with definition level == (max_definition_level - 1)." was confusing and I don't think there is a mention about empty lists. Thanks again! 

> WriteBatchSpaced writes incorrect value for parquet when input contains NULL list
> ---------------------------------------------------------------------------------
>
>                 Key: PARQUET-1936
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1936
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Ruta Dhaneshwar
>            Priority: Major
>         Attachments: NULL list 1.png, NULL list 2.png, NULL list 3.png, NULL list 4.png, schema.png
>
>
> When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below).
> *!schema.png|width=235,height=106!*
> *CASE 1*
>  Data (3 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>   
>  Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
>  # num_values: 3
>  # def_levels: [3, 0, 3]
>  # rep_levels: [0, 0, 0]
>  # valid_bits: 0x05 (bit representation 101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two"]
> When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get an error when running [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] on the outputted parquet file:
> !NULL list 1.png|width=332,height=52!
> !NULL list 2.png|width=757,height=249!
> Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below.
>   
>  *CASE 2*
>  Data (4 lists):
>  [
>     "one"
>  ]
>  null
>  [
>     "two"
>  ]
>  [
>     "three",
>     "four"
>  ]
>   
>  Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
>  # num_values: 5
>  # def_levels: [3, 0, 3, 3, 3]
>  # rep_levels: [0, 0, 0, 0, 1]
>  # valid_bits: 0x29 (bit representation 11101)
>  # valid_bits_offset: 0
>  # values: ["one", nullptr, "two", "three", "four"]
> Outputted Parquet File: 
> !NULL list 3.png|width=72,height=145!
> !NULL list 4.png|width=237,height=76!
>   
>  Here we see that the "four" in the last list actually shows up as "one". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)