You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Rui Li (JIRA)" <ji...@apache.org> on 2017/02/27 09:45:45 UTC

[jira] [Commented] (HIVE-16046) Broadcasting small table for Hive on Spark

    [ https://issues.apache.org/jira/browse/HIVE-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885458#comment-15885458 ] 

Rui Li commented on HIVE-16046:
-------------------------------

Details why we didn't choose broadcast for map join can be found in HIVE-7613. But I agree we may want to revisit this.

> Broadcasting small table for Hive on Spark
> ------------------------------------------
>
>                 Key: HIVE-16046
>                 URL: https://issues.apache.org/jira/browse/HIVE-16046
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>
> currently the spark plan is 
> {code}
> 1. TS(Small table)->Sel/Fil->HashTableSink  
>                                    
> 2. TS(Small table)->Sel/Fil->HashTableSink          
>                                                                                                                        
> 3.                                             HashTableDummy --
>                                                                 |
>                                                 HashTableDummy  --
>                                                                 |
>                                 RootTS(Big table) ->Sel/Fil ->MapJoin -->Sel/Fil ->FileSink
> {code}
> 	1.   Run the smalltable SparkWorks on Spark cluster, which dump to hashmap file
> 	2.    Run the SparkWork for the big table on Spark cluster.  Mappers will lookup the smalltable hashmap from the file using HashTableDummy’s loader. 
> The disadvantage of current implementation is it need long time to distribute cache the hash table if the hash table is large.  Here want to use sparkContext.broadcast() to store small table although it will keep the broadcast variable in driver and bring some performance decline on driver.
> [~Fred], [~xuefuz], [~lirui] and [~csun], please give some suggestions on it. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)