You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jianshi Huang <ji...@gmail.com> on 2014/11/26 09:36:22 UTC

Auto BroadcastJoin optimization failed in latest Spark

Hi,

I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails
optimizing auto broadcast join in my query. I have a query that joins a
huge fact table with 15 tiny dimension tables.

I'm currently using an older version of Spark which was built on Oct. 12.

Anyone else has met similar situation?

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Posted by Jianshi Huang <ji...@gmail.com>.

With Liancheng's suggestion, I've tried setting

 spark.sql.hive.convertMetastoreParquet  false

but still analyze noscan return -1 in rawDataSize

Jianshi


On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> If I run ANALYZE without NOSCAN, then Hive can successfully get the size:
>
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
> COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}
>
> Is Hive's PARQUET support broken?
>
> Jianshi
>
>
> On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
>> Sorry for the late of follow-up.
>>
>> I used Hao's DESC EXTENDED command and found some clue:
>>
>> new (broadcast broken Spark build):
>> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
>> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>>
>> old (broadcast working Spark build):
>> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
>> totalSize=56166}
>>
>> Looks like the table size computation failed in the latest version.
>>
>> I've run the analyze command:
>>
>>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>>
>> And the tables are created from Parquet files:
>>
>> e.g.
>> CREATE EXTERNAL TABLE table1 (
>>   code int,
>>   desc string
>> )
>> STORED AS PARQUET
>> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>>
>>
>> Anyone knows what went wrong?
>>
>>
>> Thanks,
>> Jianshi
>>
>>
>>
>> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <ha...@intel.com> wrote:
>>
>>>  Hi Jianshi,
>>>
>>> I couldn’t reproduce that with latest MASTER, and I can always get the
>>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>>> there any external tables in your case?
>>>
>>>
>>>
>>> In general probably couple of things you can try first (with
>>> HiveContext):
>>>
>>> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>>> of the tables);
>>>
>>> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the
>>> threshold as a greater value, it is 1024*1024*10 by default, just make sure
>>> the maximum dimension tables size (in bytes) is less than this)
>>>
>>> 3)      Always put the main table(the biggest table) in the left-most
>>> among the inner joins;
>>>
>>>
>>>
>>> DESC EXTENDED tablename; -- this will print the detail information for
>>> the statistic table size (the field “totalSize”)
>>>
>>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>>
>>>
>>>
>>> Let me know if you still have problem.
>>>
>>>
>>>
>>> Hao
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>>> *Sent:* Thursday, November 27, 2014 10:24 PM
>>> *To:* Cheng, Hao
>>> *Cc:* user
>>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi Hao,
>>>
>>>
>>>
>>> I'm using inner join as Broadcast join didn't work for left joins
>>> (thanks for the links for the latest improvements).
>>>
>>>
>>>
>>> And I'm using HiveConext and it worked in a previous build (10/12) when
>>> joining 15 dimension tables.
>>>
>>>
>>>
>>> Jianshi
>>>
>>>
>>>
>>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:
>>>
>>>  Are all of your join keys the same? and I guess the join type are all
>>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>>> you need.
>>>
>>>
>>>
>>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>>> another optimization for this.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>>> *To:* user
>>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>>> fails optimizing auto broadcast join in my query. I have a query that joins
>>> a huge fact table with 15 tiny dimension tables.
>>>
>>>
>>>
>>> I'm currently using an older version of Spark which was built on Oct. 12.
>>>
>>>
>>>
>>> Anyone else has met similar situation?
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Posted by Jianshi Huang <ji...@gmail.com>.

With Liancheng's suggestion, I've tried setting

 spark.sql.hive.convertMetastoreParquet  false

but still analyze noscan return -1 in rawDataSize

Jianshi


On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> If I run ANALYZE without NOSCAN, then Hive can successfully get the size:
>
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
> COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}
>
> Is Hive's PARQUET support broken?
>
> Jianshi
>
>
> On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
>> Sorry for the late of follow-up.
>>
>> I used Hao's DESC EXTENDED command and found some clue:
>>
>> new (broadcast broken Spark build):
>> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
>> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>>
>> old (broadcast working Spark build):
>> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
>> totalSize=56166}
>>
>> Looks like the table size computation failed in the latest version.
>>
>> I've run the analyze command:
>>
>>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>>
>> And the tables are created from Parquet files:
>>
>> e.g.
>> CREATE EXTERNAL TABLE table1 (
>>   code int,
>>   desc string
>> )
>> STORED AS PARQUET
>> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>>
>>
>> Anyone knows what went wrong?
>>
>>
>> Thanks,
>> Jianshi
>>
>>
>>
>> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <ha...@intel.com> wrote:
>>
>>>  Hi Jianshi,
>>>
>>> I couldn’t reproduce that with latest MASTER, and I can always get the
>>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>>> there any external tables in your case?
>>>
>>>
>>>
>>> In general probably couple of things you can try first (with
>>> HiveContext):
>>>
>>> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>>> of the tables);
>>>
>>> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the
>>> threshold as a greater value, it is 1024*1024*10 by default, just make sure
>>> the maximum dimension tables size (in bytes) is less than this)
>>>
>>> 3)      Always put the main table(the biggest table) in the left-most
>>> among the inner joins;
>>>
>>>
>>>
>>> DESC EXTENDED tablename; -- this will print the detail information for
>>> the statistic table size (the field “totalSize”)
>>>
>>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>>
>>>
>>>
>>> Let me know if you still have problem.
>>>
>>>
>>>
>>> Hao
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>>> *Sent:* Thursday, November 27, 2014 10:24 PM
>>> *To:* Cheng, Hao
>>> *Cc:* user
>>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi Hao,
>>>
>>>
>>>
>>> I'm using inner join as Broadcast join didn't work for left joins
>>> (thanks for the links for the latest improvements).
>>>
>>>
>>>
>>> And I'm using HiveConext and it worked in a previous build (10/12) when
>>> joining 15 dimension tables.
>>>
>>>
>>>
>>> Jianshi
>>>
>>>
>>>
>>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:
>>>
>>>  Are all of your join keys the same? and I guess the join type are all
>>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>>> you need.
>>>
>>>
>>>
>>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>>> another optimization for this.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>>> *To:* user
>>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>>> fails optimizing auto broadcast join in my query. I have a query that joins
>>> a huge fact table with 15 tiny dimension tables.
>>>
>>>
>>>
>>> I'm currently using an older version of Spark which was built on Oct. 12.
>>>
>>>
>>>
>>> Anyone else has met similar situation?
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Posted by Jianshi Huang <ji...@gmail.com>.

If I run ANALYZE without NOSCAN, then Hive can successfully get the size:

parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}

Is Hive's PARQUET support broken?

Jianshi


On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> Sorry for the late of follow-up.
>
> I used Hao's DESC EXTENDED command and found some clue:
>
> new (broadcast broken Spark build):
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>
> old (broadcast working Spark build):
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
> totalSize=56166}
>
> Looks like the table size computation failed in the latest version.
>
> I've run the analyze command:
>
>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>
> And the tables are created from Parquet files:
>
> e.g.
> CREATE EXTERNAL TABLE table1 (
>   code int,
>   desc string
> )
> STORED AS PARQUET
> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>
>
> Anyone knows what went wrong?
>
>
> Thanks,
> Jianshi
>
>
>
> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <ha...@intel.com> wrote:
>
>>  Hi Jianshi,
>>
>> I couldn’t reproduce that with latest MASTER, and I can always get the
>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>> there any external tables in your case?
>>
>>
>>
>> In general probably couple of things you can try first (with HiveContext):
>>
>> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>> of the tables);
>>
>> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
>> as a greater value, it is 1024*1024*10 by default, just make sure the
>> maximum dimension tables size (in bytes) is less than this)
>>
>> 3)      Always put the main table(the biggest table) in the left-most
>> among the inner joins;
>>
>>
>>
>> DESC EXTENDED tablename; -- this will print the detail information for
>> the statistic table size (the field “totalSize”)
>>
>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>
>>
>>
>> Let me know if you still have problem.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>> *Sent:* Thursday, November 27, 2014 10:24 PM
>> *To:* Cheng, Hao
>> *Cc:* user
>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi Hao,
>>
>>
>>
>> I'm using inner join as Broadcast join didn't work for left joins (thanks
>> for the links for the latest improvements).
>>
>>
>>
>> And I'm using HiveConext and it worked in a previous build (10/12) when
>> joining 15 dimension tables.
>>
>>
>>
>> Jianshi
>>
>>
>>
>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:
>>
>>  Are all of your join keys the same? and I guess the join type are all
>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>> you need.
>>
>>
>>
>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>> another optimization for this.
>>
>>
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>> *To:* user
>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi,
>>
>>
>>
>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>> fails optimizing auto broadcast join in my query. I have a query that joins
>> a huge fact table with 15 tiny dimension tables.
>>
>>
>>
>> I'm currently using an older version of Spark which was built on Oct. 12.
>>
>>
>>
>> Anyone else has met similar situation?
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>>
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Posted by Jianshi Huang <ji...@gmail.com>.

If I run ANALYZE without NOSCAN, then Hive can successfully get the size:

parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}

Is Hive's PARQUET support broken?

Jianshi


On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> Sorry for the late of follow-up.
>
> I used Hao's DESC EXTENDED command and found some clue:
>
> new (broadcast broken Spark build):
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>
> old (broadcast working Spark build):
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
> totalSize=56166}
>
> Looks like the table size computation failed in the latest version.
>
> I've run the analyze command:
>
>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>
> And the tables are created from Parquet files:
>
> e.g.
> CREATE EXTERNAL TABLE table1 (
>   code int,
>   desc string
> )
> STORED AS PARQUET
> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>
>
> Anyone knows what went wrong?
>
>
> Thanks,
> Jianshi
>
>
>
> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <ha...@intel.com> wrote:
>
>>  Hi Jianshi,
>>
>> I couldn’t reproduce that with latest MASTER, and I can always get the
>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>> there any external tables in your case?
>>
>>
>>
>> In general probably couple of things you can try first (with HiveContext):
>>
>> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>> of the tables);
>>
>> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
>> as a greater value, it is 1024*1024*10 by default, just make sure the
>> maximum dimension tables size (in bytes) is less than this)
>>
>> 3)      Always put the main table(the biggest table) in the left-most
>> among the inner joins;
>>
>>
>>
>> DESC EXTENDED tablename; -- this will print the detail information for
>> the statistic table size (the field “totalSize”)
>>
>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>
>>
>>
>> Let me know if you still have problem.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>> *Sent:* Thursday, November 27, 2014 10:24 PM
>> *To:* Cheng, Hao
>> *Cc:* user
>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi Hao,
>>
>>
>>
>> I'm using inner join as Broadcast join didn't work for left joins (thanks
>> for the links for the latest improvements).
>>
>>
>>
>> And I'm using HiveConext and it worked in a previous build (10/12) when
>> joining 15 dimension tables.
>>
>>
>>
>> Jianshi
>>
>>
>>
>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:
>>
>>  Are all of your join keys the same? and I guess the join type are all
>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>> you need.
>>
>>
>>
>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>> another optimization for this.
>>
>>
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>> *To:* user
>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi,
>>
>>
>>
>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>> fails optimizing auto broadcast join in my query. I have a query that joins
>> a huge fact table with 15 tiny dimension tables.
>>
>>
>>
>> I'm currently using an older version of Spark which was built on Oct. 12.
>>
>>
>>
>> Anyone else has met similar situation?
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>>
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Posted by Jianshi Huang <ji...@gmail.com>.

Sorry for the late of follow-up.

I used Hao's DESC EXTENDED command and found some clue:

new (broadcast broken Spark build):
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}

old (broadcast working Spark build):
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
totalSize=56166}

Looks like the table size computation failed in the latest version.

I've run the analyze command:

  ANALYZE TABLE $table COMPUTE STATISTICS noscan

And the tables are created from Parquet files:

e.g.
CREATE EXTERNAL TABLE table1 (
  code int,
  desc string
)
STORED AS PARQUET
LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'


Anyone knows what went wrong?


Thanks,
Jianshi



On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <ha...@intel.com> wrote:

>  Hi Jianshi,
>
> I couldn’t reproduce that with latest MASTER, and I can always get the
> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
> there any external tables in your case?
>
>
>
> In general probably couple of things you can try first (with HiveContext):
>
> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
> of the tables);
>
> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
> as a greater value, it is 1024*1024*10 by default, just make sure the
> maximum dimension tables size (in bytes) is less than this)
>
> 3)      Always put the main table(the biggest table) in the left-most
> among the inner joins;
>
>
>
> DESC EXTENDED tablename; -- this will print the detail information for the
> statistic table size (the field “totalSize”)
>
> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>
>
>
> Let me know if you still have problem.
>
>
>
> Hao
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
> *Sent:* Thursday, November 27, 2014 10:24 PM
> *To:* Cheng, Hao
> *Cc:* user
> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi Hao,
>
>
>
> I'm using inner join as Broadcast join didn't work for left joins (thanks
> for the links for the latest improvements).
>
>
>
> And I'm using HiveConext and it worked in a previous build (10/12) when
> joining 15 dimension tables.
>
>
>
> Jianshi
>
>
>
> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:
>
>  Are all of your join keys the same? and I guess the join type are all
> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
> you need.
>
>
>
> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
> join) currently,  https://github.com/apache/spark/pull/3270 should be
> another optimization for this.
>
>
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
> *Sent:* Wednesday, November 26, 2014 4:36 PM
> *To:* user
> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi,
>
>
>
> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails
> optimizing auto broadcast join in my query. I have a query that joins a
> huge fact table with 15 tiny dimension tables.
>
>
>
> I'm currently using an older version of Spark which was built on Oct. 12.
>
>
>
> Anyone else has met similar situation?
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>
>
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Posted by Jianshi Huang <ji...@gmail.com>.

Sorry for the late of follow-up.

I used Hao's DESC EXTENDED command and found some clue:

new (broadcast broken Spark build):
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}

old (broadcast working Spark build):
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
totalSize=56166}

Looks like the table size computation failed in the latest version.

I've run the analyze command:

  ANALYZE TABLE $table COMPUTE STATISTICS noscan

And the tables are created from Parquet files:

e.g.
CREATE EXTERNAL TABLE table1 (
  code int,
  desc string
)
STORED AS PARQUET
LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'


Anyone knows what went wrong?


Thanks,
Jianshi



On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <ha...@intel.com> wrote:

>  Hi Jianshi,
>
> I couldn’t reproduce that with latest MASTER, and I can always get the
> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
> there any external tables in your case?
>
>
>
> In general probably couple of things you can try first (with HiveContext):
>
> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
> of the tables);
>
> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
> as a greater value, it is 1024*1024*10 by default, just make sure the
> maximum dimension tables size (in bytes) is less than this)
>
> 3)      Always put the main table(the biggest table) in the left-most
> among the inner joins;
>
>
>
> DESC EXTENDED tablename; -- this will print the detail information for the
> statistic table size (the field “totalSize”)
>
> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>
>
>
> Let me know if you still have problem.
>
>
>
> Hao
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
> *Sent:* Thursday, November 27, 2014 10:24 PM
> *To:* Cheng, Hao
> *Cc:* user
> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi Hao,
>
>
>
> I'm using inner join as Broadcast join didn't work for left joins (thanks
> for the links for the latest improvements).
>
>
>
> And I'm using HiveConext and it worked in a previous build (10/12) when
> joining 15 dimension tables.
>
>
>
> Jianshi
>
>
>
> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:
>
>  Are all of your join keys the same? and I guess the join type are all
> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
> you need.
>
>
>
> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
> join) currently,  https://github.com/apache/spark/pull/3270 should be
> another optimization for this.
>
>
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
> *Sent:* Wednesday, November 26, 2014 4:36 PM
> *To:* user
> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi,
>
>
>
> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails
> optimizing auto broadcast join in my query. I have a query that joins a
> huge fact table with 15 tiny dimension tables.
>
>
>
> I'm currently using an older version of Spark which was built on Oct. 12.
>
>
>
> Anyone else has met similar situation?
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>
>
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

RE: Auto BroadcastJoin optimization failed in latest Spark

Posted by "Cheng, Hao" <ha...@intel.com>.

Hi Jianshi,
I couldn’t reproduce that with latest MASTER, and I can always get the BroadcastHashJoin for managed tables (in .csv file) in my testing, are there any external tables in your case?

In general probably couple of things you can try first (with HiveContext):

1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all of the tables);

2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold as a greater value, it is 1024*1024*10 by default, just make sure the maximum dimension tables size (in bytes) is less than this)

3)      Always put the main table(the biggest table) in the left-most among the inner joins;

DESC EXTENDED tablename; -- this will print the detail information for the statistic table size (the field “totalSize”)
EXPLAIN EXTENDED query; -- this will print the detail physical plan.

Let me know if you still have problem.

Hao

From: Jianshi Huang [mailto:jianshi.huang@gmail.com]
Sent: Thursday, November 27, 2014 10:24 PM
To: Cheng, Hao
Cc: user
Subject: Re: Auto BroadcastJoin optimization failed in latest Spark

Hi Hao,

I'm using inner join as Broadcast join didn't work for left joins (thanks for the links for the latest improvements).

And I'm using HiveConext and it worked in a previous build (10/12) when joining 15 dimension tables.

Jianshi

On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com>> wrote:
Are all of your join keys the same? and I guess the join type are all “Left” join, https://github.com/apache/spark/pull/3362 probably is what you need.

And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) currently,  https://github.com/apache/spark/pull/3270 should be another optimization for this.


From: Jianshi Huang [mailto:jianshi.huang@gmail.com<ma...@gmail.com>]
Sent: Wednesday, November 26, 2014 4:36 PM
To: user
Subject: Auto BroadcastJoin optimization failed in latest Spark

Hi,

I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails optimizing auto broadcast join in my query. I have a query that joins a huge fact table with 15 tiny dimension tables.

I'm currently using an older version of Spark which was built on Oct. 12.

Anyone else has met similar situation?

--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/



--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Posted by Jianshi Huang <ji...@gmail.com>.

Hi Hao,

I'm using inner join as Broadcast join didn't work for left joins (thanks
for the links for the latest improvements).

And I'm using HiveConext and it worked in a previous build (10/12) when
joining 15 dimension tables.

Jianshi

On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:

>  Are all of your join keys the same? and I guess the join type are all
> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
> you need.
>
>
>
> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
> join) currently,  https://github.com/apache/spark/pull/3270 should be
> another optimization for this.
>
>
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
> *Sent:* Wednesday, November 26, 2014 4:36 PM
> *To:* user
> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi,
>
>
>
> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails
> optimizing auto broadcast join in my query. I have a query that joins a
> huge fact table with 15 tiny dimension tables.
>
>
>
> I'm currently using an older version of Spark which was built on Oct. 12.
>
>
>
> Anyone else has met similar situation?
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

RE: Auto BroadcastJoin optimization failed in latest Spark

Posted by "Cheng, Hao" <ha...@intel.com>.

Are all of your join keys the same? and I guess the join type are all “Left” join, https://github.com/apache/spark/pull/3362 probably is what you need.

And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) currently,  https://github.com/apache/spark/pull/3270 should be another optimization for this.

From: Jianshi Huang [mailto:jianshi.huang@gmail.com]
Sent: Wednesday, November 26, 2014 4:36 PM
To: user
Subject: Auto BroadcastJoin optimization failed in latest Spark

Hi,

I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails optimizing auto broadcast join in my query. I have a query that joins a huge fact table with 15 tiny dimension tables.

I'm currently using an older version of Spark which was built on Oct. 12.

Anyone else has met similar situation?

--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/