You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/12/01 00:09:45 UTC
[Hadoop Wiki] Update of "Hive/JoinOptimization" by LiyinTang

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/JoinOptimization" page has been changed by LiyinTang.
http://wiki.apache.org/hadoop/Hive/JoinOptimization?action=diff&rev1=7&rev2=8

--------------------------------------------------

  
  Hive-1641 ([[http://issues.apache.org/jira/browse/HIVE-1293|http://issues.apache.org/jira/browse/HIVE-1641]]) has solved this problem. As shown in Fig2, the basic idea is to create a new task, MapReduce Local Task, before the orginal Join Map/Reduce Task. This new task will read the small table data from HDFS to in-memory hashtable. After reading, it will serialize the in-memory hashtable into files on disk and compress the hashtable file into a tar file. In next stage, when the MapReduce task is launching, it will put this tar file to Hadoop Distributed Cache, which will populate the tar file to each Mapper’s local disk and decompress the file. So all the Mappers can deserialize the hashtable file back into memory and do the join work as before.
  
+ Obviously, the Local Task is a very memory intensive. So the query processor will launch this task in a child jvm, which has the same heap size as the Mapper's. Since the Local Task may run out of memory, the query processor will measure the memory usage of the local task very carefully. Once the memory usage of the Local Task is higher than a threshold. This Local Task will abort itself and tells the user that this table is too large to hold in the memory. User can change this threshold by '''''set hive.mapjoin.localtask.max.memory.usage = 0.999;'''''
+ 
  == 1.2 Removing JDBM ==
  Previously, Hive uses JDBM ([[http://issues.apache.org/jira/browse/HIVE-1293|http://jdbm.sourceforge.net/]]) as a persistent hashtable. Whenever the in-memory hashtable cannot hold data any more, it will swap the key/value into the JDBM table. However when profiing the Map Join, we found out this JDBM component takes more than 70 % CPU time as shown in Fig3. Also the persistent file JDBM genreated is too large to put into the Distributed Cache. For example, if users put 67,000 simple interger key/value pairs into the JDBM, it will generate more 22M hashtable file. So the JDBM is too heavy weight for Map Join and it would better to remove this componet from Hive. Map Join is designed for holding the small table's data into memory. If the table is too large to hold, just run as a Common Join. There is no need to use persistent hashtable any more. Hive-1754 ([[http://issues.apache.org/jira/browse/HIVE-1293|http://issues.apache.org/jira/browse/HIVE-1754]])
  
  {{attachment:fig3.jpg}}
  
- '''Fig 3. The Profiling Result of JDBM<<BR>>'''
+ '''Fig 3. The Profiling Result of JDBM'''
  
  == 1.3 Performance Evaluation ==
  '''Table 1: The Comparison between the previous map join with the new optimized map join'''
@@ -34, +36 @@

  
  = 2. Converting Join into Map Join Automatically =
  == 2.1 New Join Exeuction Flow ==
- Since map join is faster than the common join, it would better to run the map join whenever possible. Previously, Hive users need to give a hint in the query to assign which table the small table is. For example, select /*+mapjoin(a)*/ a.key, b.value from srcpart_empty a join src b on a.key=b.key;   It is not a good way for user experience and query performance, because sometimes user may give a wrong hint and also users may not give any hints. It would be much better to convert the Common Join into Map Join without users' hint.
+ Since map join is faster than the common join, it would better to run the map join whenever possible. Previously, Hive users need to give a hint in the query to assign which table the small table is. For example, '''''select /*+mapjoin(a)*/ a.key, b.value from src1 a join src2 b on a.key=b.key''''';   It is not a good way for user experience and query performance, because sometimes user may give a wrong hint and also users may not give any hints. It would be much better to convert the Common Join into Map Join without users' hint.
  
  Hive-1642 ([[http://issues.apache.org/jira/browse/HIVE-1293|http://issues.apache.org/jira/browse/HIVE-1642]]) has solved the problem by converting the Common Join into Map Join automatically. For the Map Join, the query processor should know which input table the big table is. Other input table will be recognize as the small table during the execution stage and these tables need to be hold in the memory. However, the query processor has no idea of input file size during compiling time. Because some of the table may be intermediate tables generated from sub queries. So the query processor can only figure out the input file size during executiom time.
  
+ Right now, users need to enable this feature by''''' set hive.auto.convert.join = true;'''''
+ 
  {{attachment:fig5.jpg||height="716px",width="1017px"}}
  
- As shown in fig5,
+ '''Fig 5, The Join Execution Flow'''
+ 
+ As shown in fig5, the left side shows the previous Common Join execution flow, which very straightfroward. On the contrast, the right side is the new Common Join execution flow. During the compile time, hte query processor will generate a Conditional Task, which contains a list of tasks and one of these tasks will be resolved to run during the execution time. It means the tasks in the Conditional Task's list are candidate and one of them will be chosen to run during the run time. First, the original Common Join Task should be into the list. Also the query processor will generate a series of Map Join Task by assuming each of the input tables may be the big table. Take the same example as before, '''''select /*+mapjoin(a)*/ a.key, b.value from src1 y a join src2 b on a.key=b.key'''''. Both table '''''src2 '''''and '''''src1 '''''may be the big table, so it will generate 2 Map Join Task. One is assuming src1 is the big table and the other is assuming src2 is the big table, as shown in Fig 6.
+ 
+ {{attachment:fig6.jpg||height="778px",width="1072px"}}
+ 
+ '''Fig 6, Create Map Join Task by Assuming One of the Input Table is the Big Table'''
  
  == 2.2 Resolving the Join Operation at Run Time ==
+ During the execution stage, the Conditional Task can know exactly the  file size of each input table, even the table is a intermediate table.  If all the tables are too large to be converted into map join, then just  run the Common Join Task as previously. If one of the tables is large  and others are small enough to run Map Join, then the Conditional Task  will pick the corresponding Map Join Local Task to run. By this  mechanism, it can convert the Common Join into Map Join automatically  and dynamically.
+ 
+ Currently, if total size of small tables are large than 25M, then the Conditional Task will choose the original Common Join run. 25M is a very conservative number and user can change this number by '''''set hive.smalltable.filesize = 30000000'''''.
+ 
  == 2.3 Backup Task ==
+ As mentioned above,  the Local Task of Map Join is a very memory intensive. So the query  processor will launch this task in a child jvm, which has the same heap  size as the Mapper's. Since the Local Task may run out of memory, the  query processor will measure the memory usage of the local task very  carefully. Once the memory usage of the Local Task is higher than a  threshold. This Local Task will abort itself and it means the map join task fails.
+ 
+ {{attachment:fig7.jpg}}
+ 
+ Fig7, Run the Original Common Join as a Backup Task
+ 
+ In this case, the query processor will launch the original Common Join task as a Backup Task to run, which is totally transparent to user. The basic idea is shown as Fig 7.
+ 
  == 2.4 Performance Evaluation ==