You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Tamir Kamara <ta...@gmail.com> on 2009/07/19 09:46:40 UTC

Nested Split

Hi,

The following script gives an error because split cannot be used in nested
statements:
x1 = load 'file' as (a, b, c);
x2 = group x1 by a;
x3 = foreach x2 {
split x1 into y1 if b==1 and c==1, y2 if b==2 and c==2, y3 if b==3 and c==3;
generate group, COUNT(y1), COUNT(y2), COUNT(y3);
}

This forces the definition of y1, y2, y3 on separate statements with filter.

Does this mean that x1 will be scanned 3 times?
Shouldn't split work in the nested case also?

Thanks,
Tamir

RE: Nested Split

Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.
The only operators that are supported inside foreach (for now) are:

Filter
Distinct
Sort
Limit

Currently, there will be four pipelines inside the foreach to execute
your statement with filters, one for projecting the group and the
remaining for each of the COUNTs. x1 will be read three times.

Thanks,
Santhosh

-----Original Message-----
From: Tamir Kamara [mailto:tamirkamara@gmail.com] 
Sent: Sunday, July 19, 2009 12:47 AM
To: pig-user@hadoop.apache.org
Subject: Nested Split

Hi,

The following script gives an error because split cannot be used in
nested
statements:
x1 = load 'file' as (a, b, c);
x2 = group x1 by a;
x3 = foreach x2 {
split x1 into y1 if b==1 and c==1, y2 if b==2 and c==2, y3 if b==3 and
c==3;
generate group, COUNT(y1), COUNT(y2), COUNT(y3);
}

This forces the definition of y1, y2, y3 on separate statements with
filter.

Does this mean that x1 will be scanned 3 times?
Shouldn't split work in the nested case also?

Thanks,
Tamir