You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2009/01/03 00:04:06 UTC
Re: Custom Group Func

Pig (and hadoop underneath) is built with the expectation that a  
single tuple goes to a single group.  But I think that what you want  
to do can be accomplished in the following round about way.

First, you'll need an EvalFunc that returns a bag.  Let's call this  
function replicator.  Each bag will have one or more tuples.  Each  
tuple will be of the format: (groupnum, inputtuple)).  It will be the  
job of the replicator to create a copy of the tuple for each group it  
will go in.  So if the input to replicator is (a, b, c) and this  
should go in groups 1 and 5, then the output of replicator will be:

{ (1, (a, b, c)), (5, (a, b, c))}.

Then use something like the following pig script:

A = load ...
B = foreach A generate flatten(replicator(*));
C = group B by $0;
D = foreach C generate COUNT($1);

Passing * to replicator will give it the whole tuple.  And flattening  
it will remove the bag and tuple nestings that replicator creates.

Alan.

On Dec 19, 2008, at 10:55 AM, Michael Harris wrote:

> Hello.
>
> I am trying to write a custom group function that places a single  
> tuple into multiple groups so that one row can be counted several  
> times if it belongs to multiple groups. I have read the  
> GroupFunction wiki page, but I beleive the data model has changed  
> since it was written and it no longer applies. I tried to follow  
> it, but got eval func instantiation problems. So I figured I could  
> implement the grouping function as an EvalFunc and set multiple  
> tuples or dataatoms on the result tuple. This however does not add  
> the input tuple to multiple groups, instead it just creates a  
> single grouping on the entire set of the result tuple.
>
> Here is an example format of the input data :
>
> 6|3000| 
> 476;122;148;172;176;178;184;198;206;216;220;288;294;312;332;348;33100; 
> 378;408;422;428;38060;430;38900;472;41740;488;500;476;45300;548;| 
> 38;8;55;63;64;|8|1.0
>
> The eval func :
>
> public class SplitGroupFunc extends EvalFunc<Tuple> {
>
> 	@Override
> 	public void exec(Tuple arg0, Tuple arg1) throws IOException {
> 		String value = arg0.getAtomField(0).strval();
> 		if (arg0.getAtomField(1) != null) {
> 			String splitRegex = arg0.getAtomField(1).strval();
> 			if (splitRegex != null) {
> 				String[] values = value.split(splitRegex);
> 				for (String valueString : values) {
> 					if (valueString != null && !"".equals(valueString)) {
> 						arg1.appendTuple(new Tuple(new DataAtom(valueString)));
> 					}
> 				}
> 				return;
> 			}
> 		}
> 		arg1.appendField(new DataAtom(value));
> 	}
> }
>
> The script line :
>
> rawGroupedCatIndsMsa = GROUP raw BY (category,  
> com.gl.analysis.SplitGroupFunc(industries, ';'),  
> com.gl.analysis.SplitGroupFunc(msas, ';'));
>
> The result :
>
> ((19, (11, 55, 102), (16700)), {(19, 4000, 16700;, 11;55;102;, 17,  
> 3.0)})
> ((19, (11, 55, 28), (16700)), {(19, 3000, 16700;, 11;55;28;, 9, 1.0)})
> ((19, (11, 55, 64), (16700)), {(19, 4000, 16700;, 11;55;64;, 8, 2.0)})
>
> For this data I would actually want the result :
>
> ((19, 11, 16700), {(19, 4000, 16700;, 11;55;102;, 17, 3.0), (19,  
> 3000, 16700;, 11;55;28;, 9, 1.0), (19, 3000, 16700;, 11;55;28;, 9,  
> 1.0)})
> ((19, 55, 16700), {(19, 4000, 16700;, 11;55;102;, 17, 3.0), (19,  
> 3000, 16700;, 11;55;28;, 9, 1.0), (19, 3000, 16700;, 11;55;28;, 9,  
> 1.0)})
> ((19, 102, 16700), {(19, 4000, 16700;, 11;55;102;, 17, 3.0)})
> ((19, 28, 16700), {(19, 3000, 16700;, 11;55;28;, 9, 1.0)})
> ((19, 64, 16700), {(19, 4000, 16700;, 11;55;64;, 8, 2.0)})
>
> Such that I get a grouping of each permutation.
>
> I see documentation of group functions in several places, but I  
> dont see a working example of one on the wiki...
>
> Any help would be greatly appreciated!
>
> -Michael Harris
>
>
>