You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rabin Banerjee <de...@gmail.com> on 2017/11/21 15:29:03 UTC

Parquet Filter pushdown not working and statistics are not generating for any column with Spark 1.6 CDH 5.7

Hi All ,


 I am using CDH 5.7 which comes with Spark version 1.6.0.  I am saving my
data set as parquet data and then querying it . Query is executing fine But
when I checked the files generated by spark, I found statistics(min/max) is
missing for all the columns . And hence filters are not pushed down. Its
scanning the entire file.


*(1 to 30000).map(i => (i, i.toString)).toDF("a",
"b").sort("a").write.parquet("/hdfs/path/to/store")*


*parquet-tools meta
part-r-00186-03addad8-c19d-4812-b83b-a8708606183b.gz.parquet*

creator:     p*arquet-mr version 1.5.0-cdh5.7.1* (build ${buildNumber})

extra:       org.apache.spark.sql.parquet.row.metadata =
{"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}}]}



file schema: spark_schema

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

a:           OPTIONAL INT32 R:0 D:1

b:           OPTIONAL BINARY O:UTF8 R:0 D:1


row group 1: RC:148 TS:2012

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

a:            INT32 GZIP DO:0 FPO:4 SZ:297/635/2.14 VC:148
ENC:BIT_PACKED,PLAIN,RLE

b:            BINARY GZIP DO:0 FPO:301 SZ:301/1377/4.57 VC:148
ENC:BIT_PACKED,PLAIN,RLE


As you can see from the parquet meta the STA field is missing. And spark is
scanning all data of all files.

Any suggestion ?


Thanks //

RB