You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Sergey Goder <se...@gmail.com> on 2013/03/29 23:08:10 UTC

Moving Cross of Large Data to be Nested

I've had the same issue that is discussed in this thread:
http://search-hadoop.com/m/2GIkU1JJbon1/cross+reducers+slow&subj=Reducers+slowing+down+UNCLASSIFIED+
and am wondering if it can be solved using a CROSS within a nested FOREACH.
Here is some background information on what I have so far:

My data is using sparse matrix notation (row_i, col_j, value_ij)

grunt> DUMP mat;

(1, 1, 5)
(1, 2, 8)
(1, 3, 0)


I am actually computing a similarity matrix so my code would look something
like:

mat = LOAD 'mat' AS (row:int, col:int, score:int);
matTrans = LOAD 'mat' AS (row:int, col:int, score:int);

matCross = CROSS mat, matTrans;

matLowerCross = FILTER matCross BY mat::row < matTrans::row;

The job essentially stops during the cross call but there are no errors.
I've tried some other things such as replacing the CROSS with a skewed join
and changing the parameters
of pig.skewedjoin.reduce.memusage, pig.cachedbag.memusage
and mapred.job.shuffle.input.buffer.percent as well as setting the PARALLEL
 higher and while I've gotten the reducers to almost complete, they are
still failing around the 99% mark every time.

I tried nesting the CROSS operation but have had no luck even getting it
parse so I was wondering if anyone knew a way it could be done. I think the
code should look something like this:

mat = LOAD 'mat' AS (row:int, col:int, score:int);
matTrans = LOAD 'mat' AS (row:int, col:int, score:int);

matGrp = GROUP mat BY row;
--this next part does not parse
matCrossed = FOREACH matGrp {
                        lowerMat = FILTER matTrans BY row < group;
                        crossedMat = CROSS group, lowerMat;
                        GENERATE FLATTEN(crossedMat);
};


Any ideas or help would be much appreciated.

Thanks,
Sergey