You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ruta Dhaneshwar (Jira)" <ji...@apache.org> on 2020/10/31 00:55:00 UTC

[jira] [Created] (PARQUET-1936) WriteBatchSpaced writes incorrect value for parquet when input contains NULL list

Ruta Dhaneshwar created PARQUET-1936:
----------------------------------------

             Summary: WriteBatchSpaced writes incorrect value for parquet when input contains NULL list
                 Key: PARQUET-1936
                 URL: https://issues.apache.org/jira/browse/PARQUET-1936
             Project: Parquet
          Issue Type: Bug
          Components: parquet-cpp
            Reporter: Ruta Dhaneshwar


When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below).
  
 Schema:
 message schema {
   optional group _COL_0 (LIST) {
     repeated group list

{       optional binary item (UTF8);     }

  }
 }
  
 *CASE 1*
 Data (3 lists):
 [
    "one"
 ]
 null
 [
    "two"
 ]
  
 Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
 # num_values: 3
 # def_levels: [3, 0, 3]
 # rep_levels: [0, 0, 0]
 # valid_bits: 0x05 (bit representation 101)
 # valid_bits_offset: 0 
 # values: ["one", nullptr, "two"]
 When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get the following error when running [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] on the outputted parquet file:
  
  
  
 !image-2020-10-30-17-45-37-423.png|width=358,height=56!
 !image-2020-10-30-17-46-33-370.png|width=638,height=210!
 Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below.
  
 *CASE 2*
 Data (4 lists):
 [
    "one"
 ]
 null
 [
    "two"
 ]
 [
    "three",
    "four"
 ]
  
 TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
 # num_values: 5
 # def_levels: [3, 0, 3, 3, 3]
 # rep_levels: [0, 0, 0, 0, 1]
 # valid_bits: 0x29 (bit representation 11101)
 # valid_bits_offset: 0 
 # values: ["one", nullptr, "two", "three", "four"]

Outputted Parquet File:
  
 !image-2020-10-30-17-47-31-022.png|width=77,height=155!
 !image-2020-10-30-17-48-11-872.png|width=233,height=75!
 Here we see that the "four" in the last list actually shows up as "one". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)