You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Keith Chapman <ke...@gmail.com> on 2019/06/07 16:19:10 UTC

[PARQUET-CPP] Missing statistics on String columns in 1.4.0

Hi,

I was using Parquet-CPP 1.4.0 to write a parquet file which has a String
and Int column. I see statistics been written on the Int column but do not
see any statistics written on the String column. Is this something known?
If so does a later release (or master) have this feature? (Statistics on
String columns)

Regards,
Keith.

http://keith-chapman.com

Re: [PARQUET-CPP] Missing statistics on String columns in 1.4.0

Posted by Keith Chapman <ke...@gmail.com>.
For the file created with the java library (parquet-mr version 1.10.1), the
stats were written as below,

Schema:
message spark_schema {
  optional int32 d_year;
  optional int32 brand_id;
  optional binary brand (UTF8);
  optional double sum_agg;
}


Row group 0:  count: 1  297.00 B records  start: 4  total: 297 B
--------------------------------------------------------------------------------
          type      encodings count     avg size   nulls   min / max
d_year    INT32     S   _     1         57.00 B    0       1998 / 1998
brand_id  INT32     S   _     1         57.00 B    0       1001002 / 1001002
brand     BINARY    S   _     1         106.00 B   0       "amalgamalg #2"
/ "amalgamalg #2"
sum_agg   DOUBLE    S   _     1         77.00 B    0       12810.490000 /
12810.490000

And using parquet-cpp the stats were,

Schema:
message schema {
  optional int32 d_year;
  optional int32 brand_id;
  optional binary brand (UTF8);
  optional double sum_agg;
}


Row group 0:  count: 1  299.00 B records  start: 4  total: 299 B
--------------------------------------------------------------------------------
          type      encodings count     avg size   nulls   min / max
d_year    INT32     S RBR_    1         64.00 B    0       1998 / 1998
brand_id  INT32     S RBR_    1         64.00 B    0       1001002 / 1001002
brand     BINARY    S RBR_    1         95.00 B    0
sum_agg   DOUBLE    S RBR_    1         76.00 B    0       12810.490000 /
12810.490000

So in essence the strings are not that long.

Regards,
Keith.

http://keith-chapman.com


On Fri, Jun 7, 2019 at 10:01 AM Deepak Majeti <ma...@gmail.com>
wrote:

> Hi Keith,
>
> What is the length of the min/max string value? The stats by default won't
> be written if the length is greater than 4096.
>
> On Fri, Jun 7, 2019 at 9:55 PM Keith Chapman <ke...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I was using Parquet-CPP 1.4.0 to write a parquet file which has a String
>> and Int column. I see statistics been written on the Int column but do not
>> see any statistics written on the String column. Is this something known?
>> If so does a later release (or master) have this feature? (Statistics on
>> String columns)
>>
>> Regards,
>> Keith.
>>
>> http://keith-chapman.com
>>
>
>
> --
> regards,
> Deepak Majeti
>
>

Re: [PARQUET-CPP] Missing statistics on String columns in 1.4.0

Posted by Deepak Majeti <ma...@gmail.com>.
Hi Keith,

What is the length of the min/max string value? The stats by default won't
be written if the length is greater than 4096.

On Fri, Jun 7, 2019 at 9:55 PM Keith Chapman <ke...@gmail.com>
wrote:

> Hi,
>
> I was using Parquet-CPP 1.4.0 to write a parquet file which has a String
> and Int column. I see statistics been written on the Int column but do not
> see any statistics written on the String column. Is this something known?
> If so does a later release (or master) have this feature? (Statistics on
> String columns)
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>


-- 
regards,
Deepak Majeti