You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Jianshi Huang <ji...@gmail.com> on 2014/12/05 08:30:08 UTC

Re: Auto BroadcastJoin optimization failed in latest Spark

Sorry for the late of follow-up.

I used Hao's DESC EXTENDED command and found some clue:

new (broadcast broken Spark build):
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}

old (broadcast working Spark build):
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
totalSize=56166}

Looks like the table size computation failed in the latest version.

I've run the analyze command:

  ANALYZE TABLE $table COMPUTE STATISTICS noscan

And the tables are created from Parquet files:

e.g.
CREATE EXTERNAL TABLE table1 (
  code int,
  desc string
)
STORED AS PARQUET
LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'


Anyone knows what went wrong?


Thanks,
Jianshi



On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <ha...@intel.com> wrote:

>  Hi Jianshi,
>
> I couldn’t reproduce that with latest MASTER, and I can always get the
> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
> there any external tables in your case?
>
>
>
> In general probably couple of things you can try first (with HiveContext):
>
> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
> of the tables);
>
> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
> as a greater value, it is 1024*1024*10 by default, just make sure the
> maximum dimension tables size (in bytes) is less than this)
>
> 3)      Always put the main table(the biggest table) in the left-most
> among the inner joins;
>
>
>
> DESC EXTENDED tablename; -- this will print the detail information for the
> statistic table size (the field “totalSize”)
>
> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>
>
>
> Let me know if you still have problem.
>
>
>
> Hao
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
> *Sent:* Thursday, November 27, 2014 10:24 PM
> *To:* Cheng, Hao
> *Cc:* user
> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi Hao,
>
>
>
> I'm using inner join as Broadcast join didn't work for left joins (thanks
> for the links for the latest improvements).
>
>
>
> And I'm using HiveConext and it worked in a previous build (10/12) when
> joining 15 dimension tables.
>
>
>
> Jianshi
>
>
>
> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:
>
>  Are all of your join keys the same? and I guess the join type are all
> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
> you need.
>
>
>
> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
> join) currently,  https://github.com/apache/spark/pull/3270 should be
> another optimization for this.
>
>
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
> *Sent:* Wednesday, November 26, 2014 4:36 PM
> *To:* user
> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi,
>
>
>
> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails
> optimizing auto broadcast join in my query. I have a query that joins a
> huge fact table with 15 tiny dimension tables.
>
>
>
> I'm currently using an older version of Spark which was built on Oct. 12.
>
>
>
> Anyone else has met similar situation?
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>
>
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Posted by Jianshi Huang <ji...@gmail.com>.

With Liancheng's suggestion, I've tried setting

 spark.sql.hive.convertMetastoreParquet  false

but still analyze noscan return -1 in rawDataSize

Jianshi


On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> If I run ANALYZE without NOSCAN, then Hive can successfully get the size:
>
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
> COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}
>
> Is Hive's PARQUET support broken?
>
> Jianshi
>
>
> On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
>> Sorry for the late of follow-up.
>>
>> I used Hao's DESC EXTENDED command and found some clue:
>>
>> new (broadcast broken Spark build):
>> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
>> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>>
>> old (broadcast working Spark build):
>> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
>> totalSize=56166}
>>
>> Looks like the table size computation failed in the latest version.
>>
>> I've run the analyze command:
>>
>>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>>
>> And the tables are created from Parquet files:
>>
>> e.g.
>> CREATE EXTERNAL TABLE table1 (
>>   code int,
>>   desc string
>> )
>> STORED AS PARQUET
>> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>>
>>
>> Anyone knows what went wrong?
>>
>>
>> Thanks,
>> Jianshi
>>
>>
>>
>> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <ha...@intel.com> wrote:
>>
>>>  Hi Jianshi,
>>>
>>> I couldn’t reproduce that with latest MASTER, and I can always get the
>>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>>> there any external tables in your case?
>>>
>>>
>>>
>>> In general probably couple of things you can try first (with
>>> HiveContext):
>>>
>>> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>>> of the tables);
>>>
>>> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the
>>> threshold as a greater value, it is 1024*1024*10 by default, just make sure
>>> the maximum dimension tables size (in bytes) is less than this)
>>>
>>> 3)      Always put the main table(the biggest table) in the left-most
>>> among the inner joins;
>>>
>>>
>>>
>>> DESC EXTENDED tablename; -- this will print the detail information for
>>> the statistic table size (the field “totalSize”)
>>>
>>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>>
>>>
>>>
>>> Let me know if you still have problem.
>>>
>>>
>>>
>>> Hao
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>>> *Sent:* Thursday, November 27, 2014 10:24 PM
>>> *To:* Cheng, Hao
>>> *Cc:* user
>>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi Hao,
>>>
>>>
>>>
>>> I'm using inner join as Broadcast join didn't work for left joins
>>> (thanks for the links for the latest improvements).
>>>
>>>
>>>
>>> And I'm using HiveConext and it worked in a previous build (10/12) when
>>> joining 15 dimension tables.
>>>
>>>
>>>
>>> Jianshi
>>>
>>>
>>>
>>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:
>>>
>>>  Are all of your join keys the same? and I guess the join type are all
>>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>>> you need.
>>>
>>>
>>>
>>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>>> another optimization for this.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>>> *To:* user
>>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>>> fails optimizing auto broadcast join in my query. I have a query that joins
>>> a huge fact table with 15 tiny dimension tables.
>>>
>>>
>>>
>>> I'm currently using an older version of Spark which was built on Oct. 12.
>>>
>>>
>>>
>>> Anyone else has met similar situation?
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Posted by Jianshi Huang <ji...@gmail.com>.

With Liancheng's suggestion, I've tried setting

 spark.sql.hive.convertMetastoreParquet  false

but still analyze noscan return -1 in rawDataSize

Jianshi


On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> If I run ANALYZE without NOSCAN, then Hive can successfully get the size:
>
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
> COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}
>
> Is Hive's PARQUET support broken?
>
> Jianshi
>
>
> On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
>> Sorry for the late of follow-up.
>>
>> I used Hao's DESC EXTENDED command and found some clue:
>>
>> new (broadcast broken Spark build):
>> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
>> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>>
>> old (broadcast working Spark build):
>> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
>> totalSize=56166}
>>
>> Looks like the table size computation failed in the latest version.
>>
>> I've run the analyze command:
>>
>>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>>
>> And the tables are created from Parquet files:
>>
>> e.g.
>> CREATE EXTERNAL TABLE table1 (
>>   code int,
>>   desc string
>> )
>> STORED AS PARQUET
>> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>>
>>
>> Anyone knows what went wrong?
>>
>>
>> Thanks,
>> Jianshi
>>
>>
>>
>> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <ha...@intel.com> wrote:
>>
>>>  Hi Jianshi,
>>>
>>> I couldn’t reproduce that with latest MASTER, and I can always get the
>>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>>> there any external tables in your case?
>>>
>>>
>>>
>>> In general probably couple of things you can try first (with
>>> HiveContext):
>>>
>>> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>>> of the tables);
>>>
>>> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the
>>> threshold as a greater value, it is 1024*1024*10 by default, just make sure
>>> the maximum dimension tables size (in bytes) is less than this)
>>>
>>> 3)      Always put the main table(the biggest table) in the left-most
>>> among the inner joins;
>>>
>>>
>>>
>>> DESC EXTENDED tablename; -- this will print the detail information for
>>> the statistic table size (the field “totalSize”)
>>>
>>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>>
>>>
>>>
>>> Let me know if you still have problem.
>>>
>>>
>>>
>>> Hao
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>>> *Sent:* Thursday, November 27, 2014 10:24 PM
>>> *To:* Cheng, Hao
>>> *Cc:* user
>>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi Hao,
>>>
>>>
>>>
>>> I'm using inner join as Broadcast join didn't work for left joins
>>> (thanks for the links for the latest improvements).
>>>
>>>
>>>
>>> And I'm using HiveConext and it worked in a previous build (10/12) when
>>> joining 15 dimension tables.
>>>
>>>
>>>
>>> Jianshi
>>>
>>>
>>>
>>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:
>>>
>>>  Are all of your join keys the same? and I guess the join type are all
>>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>>> you need.
>>>
>>>
>>>
>>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>>> another optimization for this.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>>> *To:* user
>>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>>> fails optimizing auto broadcast join in my query. I have a query that joins
>>> a huge fact table with 15 tiny dimension tables.
>>>
>>>
>>>
>>> I'm currently using an older version of Spark which was built on Oct. 12.
>>>
>>>
>>>
>>> Anyone else has met similar situation?
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Posted by Jianshi Huang <ji...@gmail.com>.

If I run ANALYZE without NOSCAN, then Hive can successfully get the size:

parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}

Is Hive's PARQUET support broken?

Jianshi


On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> Sorry for the late of follow-up.
>
> I used Hao's DESC EXTENDED command and found some clue:
>
> new (broadcast broken Spark build):
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>
> old (broadcast working Spark build):
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
> totalSize=56166}
>
> Looks like the table size computation failed in the latest version.
>
> I've run the analyze command:
>
>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>
> And the tables are created from Parquet files:
>
> e.g.
> CREATE EXTERNAL TABLE table1 (
>   code int,
>   desc string
> )
> STORED AS PARQUET
> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>
>
> Anyone knows what went wrong?
>
>
> Thanks,
> Jianshi
>
>
>
> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <ha...@intel.com> wrote:
>
>>  Hi Jianshi,
>>
>> I couldn’t reproduce that with latest MASTER, and I can always get the
>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>> there any external tables in your case?
>>
>>
>>
>> In general probably couple of things you can try first (with HiveContext):
>>
>> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>> of the tables);
>>
>> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
>> as a greater value, it is 1024*1024*10 by default, just make sure the
>> maximum dimension tables size (in bytes) is less than this)
>>
>> 3)      Always put the main table(the biggest table) in the left-most
>> among the inner joins;
>>
>>
>>
>> DESC EXTENDED tablename; -- this will print the detail information for
>> the statistic table size (the field “totalSize”)
>>
>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>
>>
>>
>> Let me know if you still have problem.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>> *Sent:* Thursday, November 27, 2014 10:24 PM
>> *To:* Cheng, Hao
>> *Cc:* user
>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi Hao,
>>
>>
>>
>> I'm using inner join as Broadcast join didn't work for left joins (thanks
>> for the links for the latest improvements).
>>
>>
>>
>> And I'm using HiveConext and it worked in a previous build (10/12) when
>> joining 15 dimension tables.
>>
>>
>>
>> Jianshi
>>
>>
>>
>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:
>>
>>  Are all of your join keys the same? and I guess the join type are all
>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>> you need.
>>
>>
>>
>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>> another optimization for this.
>>
>>
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>> *To:* user
>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi,
>>
>>
>>
>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>> fails optimizing auto broadcast join in my query. I have a query that joins
>> a huge fact table with 15 tiny dimension tables.
>>
>>
>>
>> I'm currently using an older version of Spark which was built on Oct. 12.
>>
>>
>>
>> Anyone else has met similar situation?
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>>
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Posted by Jianshi Huang <ji...@gmail.com>.

If I run ANALYZE without NOSCAN, then Hive can successfully get the size:

parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}

Is Hive's PARQUET support broken?

Jianshi


On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> Sorry for the late of follow-up.
>
> I used Hao's DESC EXTENDED command and found some clue:
>
> new (broadcast broken Spark build):
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>
> old (broadcast working Spark build):
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
> totalSize=56166}
>
> Looks like the table size computation failed in the latest version.
>
> I've run the analyze command:
>
>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>
> And the tables are created from Parquet files:
>
> e.g.
> CREATE EXTERNAL TABLE table1 (
>   code int,
>   desc string
> )
> STORED AS PARQUET
> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>
>
> Anyone knows what went wrong?
>
>
> Thanks,
> Jianshi
>
>
>
> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <ha...@intel.com> wrote:
>
>>  Hi Jianshi,
>>
>> I couldn’t reproduce that with latest MASTER, and I can always get the
>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>> there any external tables in your case?
>>
>>
>>
>> In general probably couple of things you can try first (with HiveContext):
>>
>> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>> of the tables);
>>
>> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
>> as a greater value, it is 1024*1024*10 by default, just make sure the
>> maximum dimension tables size (in bytes) is less than this)
>>
>> 3)      Always put the main table(the biggest table) in the left-most
>> among the inner joins;
>>
>>
>>
>> DESC EXTENDED tablename; -- this will print the detail information for
>> the statistic table size (the field “totalSize”)
>>
>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>
>>
>>
>> Let me know if you still have problem.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>> *Sent:* Thursday, November 27, 2014 10:24 PM
>> *To:* Cheng, Hao
>> *Cc:* user
>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi Hao,
>>
>>
>>
>> I'm using inner join as Broadcast join didn't work for left joins (thanks
>> for the links for the latest improvements).
>>
>>
>>
>> And I'm using HiveConext and it worked in a previous build (10/12) when
>> joining 15 dimension tables.
>>
>>
>>
>> Jianshi
>>
>>
>>
>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <ha...@intel.com> wrote:
>>
>>  Are all of your join keys the same? and I guess the join type are all
>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>> you need.
>>
>>
>>
>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>> another optimization for this.
>>
>>
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>> *To:* user
>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi,
>>
>>
>>
>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>> fails optimizing auto broadcast join in my query. I have a query that joins
>> a huge fact table with 15 tiny dimension tables.
>>
>>
>>
>> I'm currently using an older version of Spark which was built on Oct. 12.
>>
>>
>>
>> Anyone else has met similar situation?
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>>
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/