You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Wenchen Fan <cl...@gmail.com> on 2018/10/02 02:37:03 UTC

Re: BroadcastJoin failed on partitioned parquet table

I'm not sure if Spark 1.6 is still maintained, can you try a 2.x spark
version and see if the problem still exists?

On Sun, Sep 30, 2018 at 4:14 PM 白也诗无敌 <44...@qq.com> wrote:

> Besides I have tried ANALYZE statement. It has no use cause I need the
> single partition but get the total table size by hive parameter 'totalSize'
> or 'rawSize' and so on
>
>
>
>
> Hi, guys:
>      I'm using Spark1.6.2.
>      There are two tables and the small one is a partitioned parquet table;
>      The total size of the small table is 1000M but each partition only 1M;
>      When I set spark.sql.autoBroadcastJoinThreshold to 50m ​and join the
> two tables with single partition, I get the SortMergeJoin physical plan.
>      I have made some try and it has something to do with the partition
> pruning:
>      1. check the physical plan, and all of the partitions of the small
> table are added in.
>       It seems like https://issues.apache.org/jira/browse/SPARK-16980
>
>      2. set spark.sql.hive.convertMetastoreParquet=false
>     ​The pruning is success, but still get SortMergeJoin because the code
> HiveMetastoreCatalog.scala
>       @transient override lazy val statistics: Statistics = Statistics(
>
>   sizeInBytes = {
>     val totalSize = hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE)
>     val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
>
>     The total size of the table not the single partition.
>
>     How can I fix this without patches? Or Is there a patch for SPARK1.6
> about SPARK-16980.
>
>
> best regards!
> Jerry
>