You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2009/08/20 03:06:08 UTC

[Pig Wiki] Update of "PigSkewedJoinSpec" by SriranjanManjunath

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by SriranjanManjunath:
http://wiki.apache.org/pig/PigSkewedJoinSpec

------------------------------------------------------------------------------
     * In the first phase the skewed join uses the order by sampling to compute a histogram of the records. It then relies on user configs to pass the intermediate keys to the right reducers.
     * In the second phase the current uniform random sampling used by order by will be replaced by a block level sampler which will avoid the problem of over-sampling the data for large inputs.
  
+ [[Anchor(Performance)]]
+ == Skewed Join performance ==
+ We have run the PigMix suite L3 test on a Hadoop cluster to compare skewed join with the regular join. On an average of 3 runs, skewed join took around 24 hours 30 minutes to complete whereas the regular join had to be killed after running for 5 days.
+ 
+ We conducted various performance tests to come up with a "magic" value for the memusage parameter. Here are the results:
+ ||Number of tuples||Number of Reducers||Total Time||Memusage||
+ ||262159 x 2607||2||8min 10sec||0.5||
+ ||262159 x 2607||3||5min 8sec||0.3||
+ ||262159 x 2607||5||3min 23sec||0.2||
+ ||262159 x 2607||9||2min 6 sec||0.1||
+ ||262159 x 2607||18||1min 15sec||0.05||
+ ||262159 x 2607||36||1min 12sec||0.025||
+ ||262159 x 2607||90||1min 13sec||0.01||
+ ||262159 x 2607||112||1min 17sec||0.008||
+ ||262159 x 26195||2||77min 10sec||0.5||
+ ||262159 x 26195||3||47min 58sec||0.3||
+ ||262159 x 26195||5||27min 47sec||0.2||
+ ||262159 x 26195||9||16min 38sec||0.1||
+ ||262159 x 26195||18||8min 31sec||0.05||
+ ||262159 x 26195||36||4min 37sec||0.025||
+ ||262159 x 26195||90||3min 56sec||0.01||
+ ||262159 x 26195||112||4min 42sec||0.008||
+ 
+ As evident from the results, the performance of skewed join varies significantly with the value of memusage. We will advise keeping a low value for memusage, thus using multiple reducers for the join. Note that setting an extremely low value increases the copying cost since the streaming table now needs to be copied to more reducers. We have seen good performance when this value was set in the range of 0.1 - 0.4.
+ 
+ [[Anchor(Usage)]]
+ == Usage Notes ==
+    * Append 'using "skewed"' construct to the join to force pig to use skewed join
+    * Set pig.skewedjoin.more ~/.pig 
+ 
+ 
  [[Anchor(References)]]
  == References ==
     (1) "Practical Skew Handling in Parallel Joins" - David J. Dewitt, Jeffrey F. Naughton, Donovan A. Schneider, S. Seshadri