You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2008/10/23 19:42:05 UTC

[Pig Wiki] Update of "PigUserCookbook" by AdilAijaz

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by AdilAijaz:
http://wiki.apache.org/pig/PigUserCookbook

------------------------------------------------------------------------------
  
  The following are a list of tips that people have discovered for making their pig queries run faster.  Please feel free to add any tips you have.
  
- '''Project Early and Often'''
+ ''' Project Early and Often '''
  
  Pig does not (yet) determine when a field is no longer needed and drop the field from the row.  For example, say you have a query like:
  
@@ -75, +75 @@

  significant.  In one test where the key was null 7% of the time and the data was spread across 200 reducers, we saw a about a 10x speed up in the query by adding the early
  filters.
  
+ 
+ '''Prefer DISTINCT over GROUP BY - GENERATE'''
+ 
+ When it comes to extracting the unique values from a column in a relation, one of two approaches can be used:
+ 
+ ''Using GROUP BY - GENERATE''
+ 
+ {{{
+ A = load 'myfile' as (t, u, v);
+ B = foreach A generate u;
+ C = group B by u;
+ D = foreach C generate group as uniquekey;
+ dump D; 
+ }}}
+ 
+ ''Using DISTINCT''
+ 
+ {{{
+ A = load 'myfile' as (t, u, v);
+ B = foreach A generate u;
+ C = distinct B;
+ dump C; 
+ }}}
+ 
+ In pig 1.x, DISTINCT is just GROUP BY/PROJECT under the hood. In pig 2.0 (types branch) it is not, and it is much faster and more efficient (depending on your key cardinality, up to 20x faster in pig team's tests). Therefore, the use of DISTINCT is recommended over GROUP BY - GENERATE. 
+