You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2008/11/12 20:31:09 UTC

[Pig Wiki] Trivial Update of "PigUserCookbook" by OlgaN

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigUserCookbook

------------------------------------------------------------------------------
  
  will greatly reduce the amount of data being carried through the map and reduce phases by pig.  Depending on your data, this can produce significant time savings.  In
  queries similar to the example given we have seen total time drop by 50%.
+ 
+ '''Filter Early and Often'''
+ 
+ As with early projection, in most cases it is beneficial to apply filters as early as possible to reduce the amount of data flowing through the pipeline.
+ 
+ {{{
+ -- Query 1
+ A = load 'myfile' as (t, u, v);
+ B = load 'myotherfile' as (x, y, z);
+ C = filter A by t == 1;
+ D = join C by t, B by x;
+ E = group D by u;
+ F = foreach E generate group, COUNT($1);
+ 
+ -- Query 2
+ A = load 'myfile' as (t, u, v);
+ B = load 'myotherfile' as (x, y, z);
+ C = join A by t, B by x;
+ D = group C by u;
+ E = foreach D generate group, COUNT($1);
+ F = filter E by C.t == 1;
+ }}}
+ 
+ The first query is clearly more efficient than the second one because it reduces the amount of data going into the join.
+ 
+ One case where pushing filters up might not be a good idea is if the cost of applying filter is very high and only a small amount of data is filtered out.
  
  '''Drop Nulls Before a Join'''