You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2009/01/03 00:04:06 UTC
Re: Custom Group Func
Pig (and hadoop underneath) is built with the expectation that a
single tuple goes to a single group. But I think that what you want
to do can be accomplished in the following round about way.
First, you'll need an EvalFunc that returns a bag. Let's call this
function replicator. Each bag will have one or more tuples. Each
tuple will be of the format: (groupnum, inputtuple)). It will be the
job of the replicator to create a copy of the tuple for each group it
will go in. So if the input to replicator is (a, b, c) and this
should go in groups 1 and 5, then the output of replicator will be:
{ (1, (a, b, c)), (5, (a, b, c))}.
Then use something like the following pig script:
A = load ...
B = foreach A generate flatten(replicator(*));
C = group B by $0;
D = foreach C generate COUNT($1);
Passing * to replicator will give it the whole tuple. And flattening
it will remove the bag and tuple nestings that replicator creates.
Alan.
On Dec 19, 2008, at 10:55 AM, Michael Harris wrote:
> Hello.
>
> I am trying to write a custom group function that places a single
> tuple into multiple groups so that one row can be counted several
> times if it belongs to multiple groups. I have read the
> GroupFunction wiki page, but I beleive the data model has changed
> since it was written and it no longer applies. I tried to follow
> it, but got eval func instantiation problems. So I figured I could
> implement the grouping function as an EvalFunc and set multiple
> tuples or dataatoms on the result tuple. This however does not add
> the input tuple to multiple groups, instead it just creates a
> single grouping on the entire set of the result tuple.
>
> Here is an example format of the input data :
>
> 6|3000|
> 476;122;148;172;176;178;184;198;206;216;220;288;294;312;332;348;33100;
> 378;408;422;428;38060;430;38900;472;41740;488;500;476;45300;548;|
> 38;8;55;63;64;|8|1.0
>
> The eval func :
>
> public class SplitGroupFunc extends EvalFunc<Tuple> {
>
> @Override
> public void exec(Tuple arg0, Tuple arg1) throws IOException {
> String value = arg0.getAtomField(0).strval();
> if (arg0.getAtomField(1) != null) {
> String splitRegex = arg0.getAtomField(1).strval();
> if (splitRegex != null) {
> String[] values = value.split(splitRegex);
> for (String valueString : values) {
> if (valueString != null && !"".equals(valueString)) {
> arg1.appendTuple(new Tuple(new DataAtom(valueString)));
> }
> }
> return;
> }
> }
> arg1.appendField(new DataAtom(value));
> }
> }
>
> The script line :
>
> rawGroupedCatIndsMsa = GROUP raw BY (category,
> com.gl.analysis.SplitGroupFunc(industries, ';'),
> com.gl.analysis.SplitGroupFunc(msas, ';'));
>
> The result :
>
> ((19, (11, 55, 102), (16700)), {(19, 4000, 16700;, 11;55;102;, 17,
> 3.0)})
> ((19, (11, 55, 28), (16700)), {(19, 3000, 16700;, 11;55;28;, 9, 1.0)})
> ((19, (11, 55, 64), (16700)), {(19, 4000, 16700;, 11;55;64;, 8, 2.0)})
>
> For this data I would actually want the result :
>
> ((19, 11, 16700), {(19, 4000, 16700;, 11;55;102;, 17, 3.0), (19,
> 3000, 16700;, 11;55;28;, 9, 1.0), (19, 3000, 16700;, 11;55;28;, 9,
> 1.0)})
> ((19, 55, 16700), {(19, 4000, 16700;, 11;55;102;, 17, 3.0), (19,
> 3000, 16700;, 11;55;28;, 9, 1.0), (19, 3000, 16700;, 11;55;28;, 9,
> 1.0)})
> ((19, 102, 16700), {(19, 4000, 16700;, 11;55;102;, 17, 3.0)})
> ((19, 28, 16700), {(19, 3000, 16700;, 11;55;28;, 9, 1.0)})
> ((19, 64, 16700), {(19, 4000, 16700;, 11;55;64;, 8, 2.0)})
>
> Such that I get a grouping of each permutation.
>
> I see documentation of group functions in several places, but I
> dont see a working example of one on the wiki...
>
> Any help would be greatly appreciated!
>
> -Michael Harris
>
>
>