You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Stephen Joung <st...@vcnc.co.kr> on 2018/01/24 01:30:32 UTC
write parquet with statistics min max with binary field
Hi, I am trying to use spark sql filter push down. and specially want to
use row group skipping with parquet file.
And I guessed that I need parquet file with statistics min/max.
----
On spark master branch - I tried to write single column with "a", "b", "c"
to parquet file f1
scala> List("a", "b", "c").toDF("field1").coalesce(1).write.parquet("f1")
But saved file does not have statistics (min, max)
$ ls f1/*.parquet
f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet
$ parquet-tool meta f1/*.parquet
file:
file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-
c000.snappy.parquet
creator: parquet-mr version 1.8.2 (build
c6522788629e590a53eb79874b95f6c3ff11f16c)
extra: org.apache.spark.sql.parquet.row.metadata =
{"type":"struct","fields":[{"name":"field1","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
field1: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:3 TS:48 OFFSET:4
--------------------------------------------------------------------------------
field1: BINARY SNAPPY DO:0 FPO:4 SZ:50/48/0.96 VC:3
ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column]
----
Any pointer or comment would be appreciated.
Thank you.
Re: write parquet with statistics min max with binary field
Posted by Stephen Joung <st...@vcnc.co.kr>.
After setting `parquet.strings.signed-min-max.enabled` to `true` in
`ShowMetaCommand.java`, parquet-tools meta show min,max.
@@ -57,8 +57,9 @@ public class ShowMetaCommand extends ArgsOnlyCommand {
String[] args = options.getArgs();
String input = args[0];
Configuration conf = new Configuration();
+ conf.set("parquet.strings.signed-min-max.enabled", "true");
Path inputPath = new Path(input);
FileStatus inputFileStatus =
inputPath.getFileSystem(conf).getFileStatus(inputPath);
List<Footer> footers = ParquetFileReader.readFooters(conf,
inputFileStatus, false);
Result
row group 1: RC:3 TS:56 OFFSET:4
--------------------------------------------------------------------------------
field1: BINARY SNAPPY DO:0 FPO:4 SZ:56/56/1.00 VC:3
ENC:DELTA_BYTE_ARRAY -- ST:[min: a, max: c, num_nulls: 0]
For the reference, this was intended symptom by PARQUET-686 [1].
[1] https://www.mail-archive.com/commits@parquet.apache.org/msg00491.html
2018-01-24 10:31 GMT+09:00 Stephen Joung <st...@vcnc.co.kr>:
> How can I write parquet file with min/max statistic?
>
> 2018-01-24 10:30 GMT+09:00 Stephen Joung <st...@vcnc.co.kr>:
>
>> Hi, I am trying to use spark sql filter push down. and specially want to
>> use row group skipping with parquet file.
>>
>> And I guessed that I need parquet file with statistics min/max.
>>
>> ----
>>
>> On spark master branch - I tried to write single column with "a", "b",
>> "c" to parquet file f1
>>
>> scala> List("a", "b", "c").toDF("field1").coalesce(1
>> ).write.parquet("f1")
>>
>> But saved file does not have statistics (min, max)
>>
>> $ ls f1/*.parquet
>> f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet
>> $ parquet-tool meta f1/*.parquet
>> file: file:/Users/stephen/p/spark/f
>> 1/part-00000-445036f9-7a40-4333-8405-8451faa44319- c000.snappy.parquet
>> creator: parquet-mr version 1.8.2 (build
>> c6522788629e590a53eb79874b95f6c3ff11f16c)
>> extra: org.apache.spark.sql.parquet.row.metadata =
>> {"type":"struct","fields":[{"name":"field1","type":"string",
>> "nullable":true,"metadata":{}}]}
>>
>> file schema: spark_schema
>> -----------------------------------------------------------
>> ---------------------
>> field1: OPTIONAL BINARY O:UTF8 R:0 D:1
>>
>> row group 1: RC:3 TS:48 OFFSET:4
>> -----------------------------------------------------------
>> ---------------------
>> field1: BINARY SNAPPY DO:0 FPO:4 SZ:50/48/0.96 VC:3
>> ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column]
>>
>> ----
>>
>> Any pointer or comment would be appreciated.
>> Thank you.
>>
>>
>
Re: write parquet with statistics min max with binary field
Posted by Stephen Joung <st...@vcnc.co.kr>.
How can I write parquet file with min/max statistic?
2018-01-24 10:30 GMT+09:00 Stephen Joung <st...@vcnc.co.kr>:
> Hi, I am trying to use spark sql filter push down. and specially want to
> use row group skipping with parquet file.
>
> And I guessed that I need parquet file with statistics min/max.
>
> ----
>
> On spark master branch - I tried to write single column with "a", "b", "c"
> to parquet file f1
>
> scala> List("a", "b", "c").toDF("field1").coalesce(
> 1).write.parquet("f1")
>
> But saved file does not have statistics (min, max)
>
> $ ls f1/*.parquet
> f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet
> $ parquet-tool meta f1/*.parquet
> file: file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-
> c000.snappy.parquet
> creator: parquet-mr version 1.8.2 (build
> c6522788629e590a53eb79874b95f6c3ff11f16c)
> extra: org.apache.spark.sql.parquet.row.metadata =
> {"type":"struct","fields":[{"name":"field1","type":"string"
> ,"nullable":true,"metadata":{}}]}
>
> file schema: spark_schema
> -----------------------------------------------------------
> ---------------------
> field1: OPTIONAL BINARY O:UTF8 R:0 D:1
>
> row group 1: RC:3 TS:48 OFFSET:4
> -----------------------------------------------------------
> ---------------------
> field1: BINARY SNAPPY DO:0 FPO:4 SZ:50/48/0.96 VC:3
> ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column]
>
> ----
>
> Any pointer or comment would be appreciated.
> Thank you.
>
>