You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Rabin Banerjee <de...@gmail.com> on 2017/11/21 15:29:03 UTC
Parquet Filter pushdown not working and statistics are not generating
for any column with Spark 1.6 CDH 5.7
Hi All ,
I am using CDH 5.7 which comes with Spark version 1.6.0. I am saving my
data set as parquet data and then querying it . Query is executing fine But
when I checked the files generated by spark, I found statistics(min/max) is
missing for all the columns . And hence filters are not pushed down. Its
scanning the entire file.
*(1 to 30000).map(i => (i, i.toString)).toDF("a",
"b").sort("a").write.parquet("/hdfs/path/to/store")*
*parquet-tools meta
part-r-00186-03addad8-c19d-4812-b83b-a8708606183b.gz.parquet*
creator: p*arquet-mr version 1.5.0-cdh5.7.1* (build ${buildNumber})
extra: org.apache.spark.sql.parquet.row.metadata =
{"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
a: OPTIONAL INT32 R:0 D:1
b: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:148 TS:2012
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
a: INT32 GZIP DO:0 FPO:4 SZ:297/635/2.14 VC:148
ENC:BIT_PACKED,PLAIN,RLE
b: BINARY GZIP DO:0 FPO:301 SZ:301/1377/4.57 VC:148
ENC:BIT_PACKED,PLAIN,RLE
As you can see from the parquet meta the STA field is missing. And spark is
scanning all data of all files.
Any suggestion ?
Thanks //
RB