You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2008/10/22 00:56:13 UTC
[Pig Wiki] Update of "PigUserCookbook" by AlanGates

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/PigUserCookbook

New page:
= Pig User CookBook =
This document provides hints and tips for pig users.

== Performance Enhancers ==

The following are a list of tips that people have discovered for making their pig queries run faster.  Please feel free to add any tips you have.

'''Project Early and Often'''

Pig does not (yet) determine when a field is no longer needed and drop the field from the row.  For example, say you have a query like:

{{{
A = load 'myfile' as (t, u, v);
B = load 'myotherfile' as (x, y, z);
C = join A by t, B by x;
D = group C by u;
E = foreach D generate group, COUNT($1);
}}}

There is no need for v, y, or z to participate in this query.  And there is no need to carry both t and x past the join, just one will suffice.  Changing
the above query to 

{{{
A = load 'myfile' as (t, u, v);
A1 = foreach A generate t, u;
B = load 'myotherfile' as (x, y, z);
B1 = foreach B generate x;
C = join A1 by t, B1 by x;
C1 = foreach C generate t, u;
D = group C1 by u;
E = foreach D generate group, COUNT($1);
}}} 

will greatly reduce the amount of data being carried through the map and reduce phases by pig.  Depending on your data, this can produce significant time savings.  In
queries similar to the example given we have seen total time drop by 50%.

'''Drop Nulls Before a Join'''

This comment only applies to pig on the types branch, as pig 0.1.0 does not have nulls.

With the introduction of nulls, join and cogroup semantics were altered to work with nulls.  The semantic for cogrouping with nulls is that nulls from a given input are
grouped together, but nulls across inputs are not grouped together.  This preserves the semantics of grouping (nulls are collected together from a single input to be
passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs).  Since flattening an empty bag results in an empty row, in a
standard join the rows with a null key will always be dropped.  The join: 

{{{
A = load 'myfile' as (t, u, v);
B = load 'myotherfile' as (x, y, z);
C = join A by t, B by x;
}}}

is rewritten by pig to

{{{
A = load 'myfile' as (t, u, v);
B = load 'myotherfile' as (x, y, z);
C1 = cogroup A by t, B by x;
C = foreach C1 generate flatten(A), flatten(B);
}}}

Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output.  So the null
keys will be dropped.  But they will not be dropped until the last possible moment.  If the query is rewritten to

{{{
A = load 'myfile' as (t, u, v);
B = load 'myotherfile' as (x, y, z);
A1 = filter A by t is not null;
B1 = filter B by x is not null;
C = join A1 by t, B1 by x;
}}}

then the nulls will be dropped before the join.  Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be
significant.  In one test where the key was null 7% of the time and the data was spread across 200 reducers, we saw a 6x speed up in the query by adding the early
filters.