You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ruta Dhaneshwar (Jira)" <ji...@apache.org> on 2020/10/31 00:56:00 UTC
[jira] [Updated] (PARQUET-1936) WriteBatchSpaced writes incorrect
value for parquet when input contains NULL list
[ https://issues.apache.org/jira/browse/PARQUET-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ruta Dhaneshwar updated PARQUET-1936:
-------------------------------------
Description:
When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below).
Schema:
message schema {
optional group _COL_0 (LIST) {
repeated group list
{ optional binary item (UTF8); }
}
}
*CASE 1*
Data (3 lists):
[
"one"
]
null
[
"two"
]
Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
# num_values: 3
# def_levels: [3, 0, 3]
# rep_levels: [0, 0, 0]
# valid_bits: 0x05 (bit representation 101)
# valid_bits_offset: 0
# values: ["one", nullptr, "two"]
When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get the following error when running [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] on the outputted parquet file:
!image-2020-10-30-17-45-37-423.png|width=358,height=56!
!image-2020-10-30-17-46-33-370.png|width=638,height=210!
Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below.
*CASE 2*
Data (4 lists):
[
"one"
]
null
[
"two"
]
[
"three",
"four"
]
TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
# num_values: 5
# def_levels: [3, 0, 3, 3, 3]
# rep_levels: [0, 0, 0, 0, 1]
# valid_bits: 0x29 (bit representation 11101)
# valid_bits_offset: 0
# values: ["one", nullptr, "two", "three", "four"]
Outputted Parquet File:
!image-2020-10-30-17-47-31-022.png|width=77,height=155!
!image-2020-10-30-17-48-11-872.png|width=233,height=75!
Here we see that the "four" in the last list actually shows up as "one".
was:
When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below).
Schema:
message schema {
optional group _COL_0 (LIST) {
repeated group list
{ optional binary item (UTF8); }
}
}
*CASE 1*
Data (3 lists):
[
"one"
]
null
[
"two"
]
Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
# num_values: 3
# def_levels: [3, 0, 3]
# rep_levels: [0, 0, 0]
# valid_bits: 0x05 (bit representation 101)
# valid_bits_offset: 0
# values: ["one", nullptr, "two"]
When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get the following error when running [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] on the outputted parquet file:
!image-2020-10-30-17-45-37-423.png|width=358,height=56!
!image-2020-10-30-17-46-33-370.png|width=638,height=210!
Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below.
*CASE 2*
Data (4 lists):
[
"one"
]
null
[
"two"
]
[
"three",
"four"
]
TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
# num_values: 5
# def_levels: [3, 0, 3, 3, 3]
# rep_levels: [0, 0, 0, 0, 1]
# valid_bits: 0x29 (bit representation 11101)
# valid_bits_offset: 0
# values: ["one", nullptr, "two", "three", "four"]
Outputted Parquet File:
!image-2020-10-30-17-47-31-022.png|width=77,height=155!
!image-2020-10-30-17-48-11-872.png|width=233,height=75!
Here we see that the "four" in the last list actually shows up as "one".
> WriteBatchSpaced writes incorrect value for parquet when input contains NULL list
> ---------------------------------------------------------------------------------
>
> Key: PARQUET-1936
> URL: https://issues.apache.org/jira/browse/PARQUET-1936
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Reporter: Ruta Dhaneshwar
> Priority: Major
>
> When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below).
>
> Schema:
> message schema {
> optional group _COL_0 (LIST) {
> repeated group list
> { optional binary item (UTF8); }
> }
> }
>
> *CASE 1*
> Data (3 lists):
> [
> "one"
> ]
> null
> [
> "two"
> ]
>
> Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
> # num_values: 3
> # def_levels: [3, 0, 3]
> # rep_levels: [0, 0, 0]
> # valid_bits: 0x05 (bit representation 101)
> # valid_bits_offset: 0
> # values: ["one", nullptr, "two"]
> When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get the following error when running [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] on the outputted parquet file:
>
>
>
> !image-2020-10-30-17-45-37-423.png|width=358,height=56!
> !image-2020-10-30-17-46-33-370.png|width=638,height=210!
> Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below.
>
> *CASE 2*
> Data (4 lists):
> [
> "one"
> ]
> null
> [
> "two"
> ]
> [
> "three",
> "four"
> ]
>
> TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
> # num_values: 5
> # def_levels: [3, 0, 3, 3, 3]
> # rep_levels: [0, 0, 0, 0, 1]
> # valid_bits: 0x29 (bit representation 11101)
> # valid_bits_offset: 0
> # values: ["one", nullptr, "two", "three", "four"]
> Outputted Parquet File:
>
> !image-2020-10-30-17-47-31-022.png|width=77,height=155!
> !image-2020-10-30-17-48-11-872.png|width=233,height=75!
> Here we see that the "four" in the last list actually shows up as "one".
--
This message was sent by Atlassian Jira
(v8.3.4#803005)