You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by emlyn <Em...@microsoft.com> on 2018/09/06 09:56:40 UTC

Re: CBO not working for Parquet Files

rajat mishra wrote
> When I try to computed the statistics for a query where partition column
> is in where clause, the statistics returned contains only the sizeInBytes
> and not the no of rows count.

We are also having the same issue. We have our data in partitioned parquet
files and were hoping to try out cbo but haven’t been able to get it
working: any query with a where clause on the partition column(s) (which is
the majority of realistic queries) seems to lose/ignore the rowCount stats.
We’ve generated both overall table stats (ANALYZE TABLE db.table PARTITION
COMPUTE STATISTICS;) and partitioned stats (ANALYZE TABLE db.table PARTITION
(col1, col2) COMPUTE STATISTICS;), and have verified that they are present
in the metastore.
 
I’ve also found this ticket:
https://issues.apache.org/jira/browse/SPARK-25185, but there it has no
response so far.
 
I suspect we must be missing something, as it seems that partitioned parquet
files would be a common use case, and if this is a bug in Spark I would have
expected it to have been picked up sooner.
 
Has anybody managed to get cbo working with partitioned parquet files? Is
this a known issue?
 
Thanks,
Emlyn



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org