You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Giuseppe Ottaviano <gi...@gmail.com> on 2013/03/04 14:08:36 UTC

Efficiently retrieve the number of written tuples in embedded Pig

Hi,
I'm writing an iterative algorithm using Pig embedded in Python. At
each iteration I need the number of records written in the output to
compute the parameters for the next iteration, so I'm currently doing
something like this:

myTuples = FOREACH (GROUP ...
nMyTuples = FOREACH (GROUP myTuples ALL) GENERATE COUNT($1);

STORE nMyTuples INTO 'nMyTuples'

Then from Python I get an iterator on the nMyTuples stream using
PigStats.result('nMyTuples').iterator() and read the result.

This is clearly suboptimal, since an additional MapReduce is needed
just to compute the number of tuples, while this should be just a
counter in the reducers.

I've seen the OutputStats.getNumberRecords(), but it seems not to work
in local mode (seems related to this
https://issues.apache.org/jira/browse/PIG-1641).

Is there a better way to get the count, that works both in local and
mapreduce modes?

Thanks,
Giuseppe