You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Keith Chapman <ke...@gmail.com> on 2019/06/07 16:19:10 UTC
[PARQUET-CPP] Missing statistics on String columns in 1.4.0
Hi,
I was using Parquet-CPP 1.4.0 to write a parquet file which has a String
and Int column. I see statistics been written on the Int column but do not
see any statistics written on the String column. Is this something known?
If so does a later release (or master) have this feature? (Statistics on
String columns)
Regards,
Keith.
http://keith-chapman.com
Re: [PARQUET-CPP] Missing statistics on String columns in 1.4.0
Posted by Keith Chapman <ke...@gmail.com>.
For the file created with the java library (parquet-mr version 1.10.1), the
stats were written as below,
Schema:
message spark_schema {
optional int32 d_year;
optional int32 brand_id;
optional binary brand (UTF8);
optional double sum_agg;
}
Row group 0: count: 1 297.00 B records start: 4 total: 297 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
d_year INT32 S _ 1 57.00 B 0 1998 / 1998
brand_id INT32 S _ 1 57.00 B 0 1001002 / 1001002
brand BINARY S _ 1 106.00 B 0 "amalgamalg #2"
/ "amalgamalg #2"
sum_agg DOUBLE S _ 1 77.00 B 0 12810.490000 /
12810.490000
And using parquet-cpp the stats were,
Schema:
message schema {
optional int32 d_year;
optional int32 brand_id;
optional binary brand (UTF8);
optional double sum_agg;
}
Row group 0: count: 1 299.00 B records start: 4 total: 299 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
d_year INT32 S RBR_ 1 64.00 B 0 1998 / 1998
brand_id INT32 S RBR_ 1 64.00 B 0 1001002 / 1001002
brand BINARY S RBR_ 1 95.00 B 0
sum_agg DOUBLE S RBR_ 1 76.00 B 0 12810.490000 /
12810.490000
So in essence the strings are not that long.
Regards,
Keith.
http://keith-chapman.com
On Fri, Jun 7, 2019 at 10:01 AM Deepak Majeti <ma...@gmail.com>
wrote:
> Hi Keith,
>
> What is the length of the min/max string value? The stats by default won't
> be written if the length is greater than 4096.
>
> On Fri, Jun 7, 2019 at 9:55 PM Keith Chapman <ke...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I was using Parquet-CPP 1.4.0 to write a parquet file which has a String
>> and Int column. I see statistics been written on the Int column but do not
>> see any statistics written on the String column. Is this something known?
>> If so does a later release (or master) have this feature? (Statistics on
>> String columns)
>>
>> Regards,
>> Keith.
>>
>> http://keith-chapman.com
>>
>
>
> --
> regards,
> Deepak Majeti
>
>
Re: [PARQUET-CPP] Missing statistics on String columns in 1.4.0
Posted by Deepak Majeti <ma...@gmail.com>.
Hi Keith,
What is the length of the min/max string value? The stats by default won't
be written if the length is greater than 4096.
On Fri, Jun 7, 2019 at 9:55 PM Keith Chapman <ke...@gmail.com>
wrote:
> Hi,
>
> I was using Parquet-CPP 1.4.0 to write a parquet file which has a String
> and Int column. I see statistics been written on the Int column but do not
> see any statistics written on the String column. Is this something known?
> If so does a later release (or master) have this feature? (Statistics on
> String columns)
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
--
regards,
Deepak Majeti