You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Onur Soyer <on...@gmail.com> on 2015/10/01 09:55:38 UTC

Possibly missing feature in parquet-tools

Hello,


While playing with "parquet-tools", I found that the statistics data of
columns is not being printed out when the following is executed;

$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema --detailed
perf.1000.parquet

​

And the output for a row group like this;

=====================================================================================================================

row group 1: RC:747388 TS:134218473 OFFSET:4
--------------------------------------------------------------------------------
cust_key:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:5979444/5979444/1.00
VC:747388 ENC:PLAIN,RLE,BIT_PACKED
name:  BINARY UNCOMPRESSED DO:0 FPO:5979448 SZ:16443766/16443766/1.00
VC:747388 ENC:PLAIN,RLE,BIT_PACKED
address:  BINARY UNCOMPRESSED DO:0 FPO:22423214
SZ:21716568/21716568/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
nation_key:  INT32 UNCOMPRESSED DO:0 FPO:44139782
SZ:2989697/2989697/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
phone:  BINARY UNCOMPRESSED DO:0 FPO:47129479
SZ:14201364/14201364/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
acctbal:  DOUBLE UNCOMPRESSED DO:0 FPO:61330843
SZ:5979444/5979444/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
mktsegment:  BINARY UNCOMPRESSED DO:0 FPO:67310287
SZ:9714675/9714675/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
comment_col:  BINARY UNCOMPRESSED DO:0 FPO:77024962
SZ:57193515/57193515/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED

=====================================================================================================================

​


Then I dived into the code and found "private static void
showDetails(PrettyPrintWriter out, ColumnChunkMetaData meta, boolean name)"
function in MetaDataUtils.java that is responsible of printing the output
above. I inserted a few lines to the "showDetails(...)" function to print
out to MAX & MIN value of Statistics object that can be retrieved from
ColumnChunkMetaData which is passed as argument to the "showDetails"
function.


After the modification, the output became like;

=====================================================================================================================

row group 1: RC:747388 TS:134218473 OFFSET:4
--------------------------------------------------------------------------------
cust_key:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:5979444/5979444/1.00
VC:747388 MIN: 0 MAX: 747387 ENC:PLAIN,BIT_PACKED,RLE
name:  BINARY UNCOMPRESSED DO:0 FPO:5979448 SZ:16443766/16443766/1.00
VC:747388 MIN: Binary{18 bytes, [67, 117, 115, 116, 111, 109, 101,
114, 35, 48, 48, 48, 48, 48, 48, 48, 48, 49]} MAX: Binary{18 bytes,
[67, 117, 115, 116, 111, 109, 101, 114, 35, 48, 48, 48, 49, 53, 48,
48, 48, 48]} ENC:PLAIN,BIT_PACKED,RLE
address:  BINARY UNCOMPRESSED DO:0 FPO:22423214
SZ:21716568/21716568/1.00 VC:747388 MIN: Binary{13 bytes, [32, 32, 32,
50, 117, 90, 119, 86, 104, 81, 118, 119, 65]} MAX: Binary{23 bytes,
[122, 122, 120, 71, 107, 116, 122, 88, 84, 77, 75, 83, 49, 66, 120,
90, 108, 103, 81, 57, 110, 113, 81]} ENC:PLAIN,BIT_PACKED,RLE
nation_key:  INT32 UNCOMPRESSED DO:0 FPO:44139782
SZ:2989697/2989697/1.00 VC:747388 MIN: 0 MAX: 24
ENC:PLAIN,BIT_PACKED,RLE
phone:  BINARY UNCOMPRESSED DO:0 FPO:47129479
SZ:14201364/14201364/1.00 VC:747388 MIN: Binary{15 bytes, [49, 48, 45,
49, 48, 48, 45, 49, 48, 54, 45, 49, 54, 49, 55]} MAX: Binary{15 bytes,
[51, 52, 45, 57, 57, 57, 45, 54, 49, 56, 45, 54, 56, 56, 49]}
ENC:PLAIN,BIT_PACKED,RLE
acctbal:  DOUBLE UNCOMPRESSED DO:0 FPO:61330843
SZ:5979444/5979444/1.00 VC:747388 MIN: -999.99 MAX: 9999.99
ENC:PLAIN,BIT_PACKED,RLE
mktsegment:  BINARY UNCOMPRESSED DO:0 FPO:67310287
SZ:9714675/9714675/1.00 VC:747388 MIN: Binary{10 bytes, [65, 85, 84,
79, 77, 79, 66, 73, 76, 69]} MAX: Binary{9 bytes, [77, 65, 67, 72, 73,
78, 69, 82, 89]} ENC:PLAIN,BIT_PACKED,RLE
comment_col:  BINARY UNCOMPRESSED DO:0 FPO:77024962
SZ:57193515/57193515/1.00 VC:747388 MIN: Binary{116 bytes, [32, 84,
105, 114, 101, 115, 105, 97, 115, 32, 97, 99, 99, 111, 114, 100, 105,
110, 103, 32, 116, 111, 32, 116, 104, 101, 32, 115, 108, 121, 108,
121, 32, 98, 108, 105, 116, 104, 101, 32, 105, 110, 115, 116, 114,
117, 99, 116, 105, 111, 110, 115, 32, 100, 101, 116, 101, 99, 116, 32,
113, 117, 105, 99, 107, 108, 121, 32, 97, 116, 32, 116, 104, 101, 32,
115, 108, 121, 108, 121, 32, 101, 120, 112, 114, 101, 115, 115, 32,
99, 111, 117, 114, 116, 115, 46, 32, 101, 120, 112, 114, 101, 115,
115, 32, 100, 105, 110, 111, 115, 32, 119, 97, 107, 101, 32]} MAX:
Binary{41 bytes, [122, 122, 108, 101, 46, 32, 98, 108, 105, 116, 104,
101, 108, 121, 32, 114, 101, 103, 117, 108, 97, 114, 32, 105, 110,
115, 116, 114, 117, 99, 116, 105, 111, 110, 115, 32, 99, 97, 106, 111,
108]} ENC:PLAIN,BIT_PACKED,RLE

=====================================================================================================================

​

Is this feature not implemented intentionally?


Regards,
Onur Soyer

Re: Possibly missing feature in parquet-tools

Posted by Ryan Blue <bl...@cloudera.com>.
Hi Onur,

I don't think it was intentional to not print page and column chunk 
stats, it just wasn't something that we've found the need to add yet. If 
you'd like to get your changes in, we'd be happy to help. Thanks!

rb

On 10/01/2015 03:55 AM, Onur Soyer wrote:
> Hello,
>
>
> While playing with "parquet-tools", I found that the statistics data of
> columns is not being printed out when the following is executed;
>
> $ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema --detailed
> perf.1000.parquet
>
> ​
>
> And the output for a row group like this;
>
> =====================================================================================================================
>
> row group 1: RC:747388 TS:134218473 OFFSET:4
> --------------------------------------------------------------------------------
> cust_key:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:5979444/5979444/1.00
> VC:747388 ENC:PLAIN,RLE,BIT_PACKED
> name:  BINARY UNCOMPRESSED DO:0 FPO:5979448 SZ:16443766/16443766/1.00
> VC:747388 ENC:PLAIN,RLE,BIT_PACKED
> address:  BINARY UNCOMPRESSED DO:0 FPO:22423214
> SZ:21716568/21716568/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
> nation_key:  INT32 UNCOMPRESSED DO:0 FPO:44139782
> SZ:2989697/2989697/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
> phone:  BINARY UNCOMPRESSED DO:0 FPO:47129479
> SZ:14201364/14201364/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
> acctbal:  DOUBLE UNCOMPRESSED DO:0 FPO:61330843
> SZ:5979444/5979444/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
> mktsegment:  BINARY UNCOMPRESSED DO:0 FPO:67310287
> SZ:9714675/9714675/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
> comment_col:  BINARY UNCOMPRESSED DO:0 FPO:77024962
> SZ:57193515/57193515/1.00 VC:747388 ENC:PLAIN,RLE,BIT_PACKED
>
> =====================================================================================================================
>
> ​
>
>
> Then I dived into the code and found "private static void
> showDetails(PrettyPrintWriter out, ColumnChunkMetaData meta, boolean name)"
> function in MetaDataUtils.java that is responsible of printing the output
> above. I inserted a few lines to the "showDetails(...)" function to print
> out to MAX & MIN value of Statistics object that can be retrieved from
> ColumnChunkMetaData which is passed as argument to the "showDetails"
> function.
>
>
> After the modification, the output became like;
>
> =====================================================================================================================
>
> row group 1: RC:747388 TS:134218473 OFFSET:4
> --------------------------------------------------------------------------------
> cust_key:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:5979444/5979444/1.00
> VC:747388 MIN: 0 MAX: 747387 ENC:PLAIN,BIT_PACKED,RLE
> name:  BINARY UNCOMPRESSED DO:0 FPO:5979448 SZ:16443766/16443766/1.00
> VC:747388 MIN: Binary{18 bytes, [67, 117, 115, 116, 111, 109, 101,
> 114, 35, 48, 48, 48, 48, 48, 48, 48, 48, 49]} MAX: Binary{18 bytes,
> [67, 117, 115, 116, 111, 109, 101, 114, 35, 48, 48, 48, 49, 53, 48,
> 48, 48, 48]} ENC:PLAIN,BIT_PACKED,RLE
> address:  BINARY UNCOMPRESSED DO:0 FPO:22423214
> SZ:21716568/21716568/1.00 VC:747388 MIN: Binary{13 bytes, [32, 32, 32,
> 50, 117, 90, 119, 86, 104, 81, 118, 119, 65]} MAX: Binary{23 bytes,
> [122, 122, 120, 71, 107, 116, 122, 88, 84, 77, 75, 83, 49, 66, 120,
> 90, 108, 103, 81, 57, 110, 113, 81]} ENC:PLAIN,BIT_PACKED,RLE
> nation_key:  INT32 UNCOMPRESSED DO:0 FPO:44139782
> SZ:2989697/2989697/1.00 VC:747388 MIN: 0 MAX: 24
> ENC:PLAIN,BIT_PACKED,RLE
> phone:  BINARY UNCOMPRESSED DO:0 FPO:47129479
> SZ:14201364/14201364/1.00 VC:747388 MIN: Binary{15 bytes, [49, 48, 45,
> 49, 48, 48, 45, 49, 48, 54, 45, 49, 54, 49, 55]} MAX: Binary{15 bytes,
> [51, 52, 45, 57, 57, 57, 45, 54, 49, 56, 45, 54, 56, 56, 49]}
> ENC:PLAIN,BIT_PACKED,RLE
> acctbal:  DOUBLE UNCOMPRESSED DO:0 FPO:61330843
> SZ:5979444/5979444/1.00 VC:747388 MIN: -999.99 MAX: 9999.99
> ENC:PLAIN,BIT_PACKED,RLE
> mktsegment:  BINARY UNCOMPRESSED DO:0 FPO:67310287
> SZ:9714675/9714675/1.00 VC:747388 MIN: Binary{10 bytes, [65, 85, 84,
> 79, 77, 79, 66, 73, 76, 69]} MAX: Binary{9 bytes, [77, 65, 67, 72, 73,
> 78, 69, 82, 89]} ENC:PLAIN,BIT_PACKED,RLE
> comment_col:  BINARY UNCOMPRESSED DO:0 FPO:77024962
> SZ:57193515/57193515/1.00 VC:747388 MIN: Binary{116 bytes, [32, 84,
> 105, 114, 101, 115, 105, 97, 115, 32, 97, 99, 99, 111, 114, 100, 105,
> 110, 103, 32, 116, 111, 32, 116, 104, 101, 32, 115, 108, 121, 108,
> 121, 32, 98, 108, 105, 116, 104, 101, 32, 105, 110, 115, 116, 114,
> 117, 99, 116, 105, 111, 110, 115, 32, 100, 101, 116, 101, 99, 116, 32,
> 113, 117, 105, 99, 107, 108, 121, 32, 97, 116, 32, 116, 104, 101, 32,
> 115, 108, 121, 108, 121, 32, 101, 120, 112, 114, 101, 115, 115, 32,
> 99, 111, 117, 114, 116, 115, 46, 32, 101, 120, 112, 114, 101, 115,
> 115, 32, 100, 105, 110, 111, 115, 32, 119, 97, 107, 101, 32]} MAX:
> Binary{41 bytes, [122, 122, 108, 101, 46, 32, 98, 108, 105, 116, 104,
> 101, 108, 121, 32, 114, 101, 103, 117, 108, 97, 114, 32, 105, 110,
> 115, 116, 114, 117, 99, 116, 105, 111, 110, 115, 32, 99, 97, 106, 111,
> 108]} ENC:PLAIN,BIT_PACKED,RLE
>
> =====================================================================================================================
>
> ​
>
> Is this feature not implemented intentionally?
>
>
> Regards,
> Onur Soyer
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.