You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by 勇胡 <yo...@gmail.com> on 2011/07/19 15:00:47 UTC

why the foreach nested form can't work?

Hello,

I want to use foreach statement to filter the tuple in the bag. But it
didn't work. My pig-code is as follows:

A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score: int);
B = GROUP A BY no;
C =  FOREACH B {
    D = FILTER A BY A.score > 80;
    GENERATE D.name, D.score;}
DUMP C;

It always returns
2011-07-19 14:50:20,329 [main] ERROR org.apache.pig.impl.plan.OperatorPlan -
Attempt to connect operator D: Filter 1-87 which is not in the plan.
2011-07-19 14:50:20,332 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2219: Unable to process scalar in the plan

How can I fix it?

Thanks

Yong

Re: why the foreach nested form can't work?

Posted by Gianmarco <gi...@gmail.com>.

2011/7/19 勇胡 <yo...@gmail.com>

> How can I understand that 'A.score' is a bag? I mean that if I issue a
> 'describe B' command, I can get B: {group:int, A: {name:chararray,
> no:int,score:int}}. From here, I can't get any information that 'A.score'
> is
> a bag, but I can see that A.score is an element of bag.
>

Because A is a a bag and A.score is a projection of A on the score field,
which is of course still a bag.


> And why if I delete the quantifier 'A.', it works?
>
>
Because it is the correct way to do.
"Filter relation by field" is the correct syntax.


> I just changed my pig code as
>
> A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
> int);
> B = GROUP A BY no;
> C =  FOREACH B {
>     D = FILTER A BY score > 80;
>     GENERATE D.name, D.score;}
> DUMP C;
>
> I got an empty bag!
>
> The input is as:
> henrietta       1       25
> sally   1       82
> fred    3       120
> elsie   4       45
>
> The output is as:
> ({(sally)},{(82)})
> ({(fred)},{(120)})
> ({},{})
>
> As you see, I got an empty tuple? why?
>
>
Because you are performing the filter inside a foreach on a group by no, and
no has 3 different values (1,3,4).
On one of the 3 values (namely 4) the filter returns an empty bag (45 < 80)
so you get an empty tuple.


> Yong
>
>
Cheers,
--
Gianmarco De Francisci Morales



> 2011/7/19 Jacob Perkins <ja...@gmail.com>
>
> > I think it's because 'A.score' is a bag but Pig needs a reference to a
> > field in the tuples. This worked for me:
> >
> > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> > B = GROUP A BY no;
> > C = FOREACH B {
> >       D = FILTER A BY score > 80;
> >      GENERATE FLATTEN(D.(name, score));
> >    };
> > DUMP C;
> >
> > on the following data:
> >
> > $: cat foo.tsv
> > henrietta       1       25
> > sally   1       82
> > fred    3       120
> > elsie   4       45
> >
> > yields:
> >
> >
> > Does that work for you?
> >
> > --jacob
> > @thedatachef
> >
> > On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
> > > A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
> > > int);
> > > B = GROUP A BY no;
> > > C =  FOREACH B {
> > >     D = FILTER A BY A.score > 80;
> > >     GENERATE D.name, D.score;}
> > > DUMP C;
> >
> >
>

Re: why the foreach nested form can't work?

Posted by Daniel Dai <da...@hortonworks.com>.

If you refer some field in the base relation, you only need to refer to
column name:
B = FILTER A BY score>80;

Here A is base relation, so you only need to say "score" instead of
"A.score". Otherwise, Pig will think you are using A as a scalar (
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars
)

Daniel

2011/7/20 勇胡 <yo...@gmail.com>

> Thanks for your response. Now I just think that in which kind of situation
> I
> can use "." to reference the field. In pig, if I understand right, each
> relation is a bag. If I issue these commands:
>
> A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
> int);
> B = FILTER A BY A.score>80;
>
> There is no problem at compile time and the pig code can execute, but
> finally I can't get error results. As you mentioned, A.score is a bag and
> 80
> is a constant, they are not compatible. There are really big differences
> than SQL. If I use:
>
> B = FILTER A BY score>80; there is no problem, the statement can execute
> the
> filter semantics.
>
> The same problem will occur in the operators "group, cogroup, join, split,
> order, cross". The input of these operators only support fields, not bags
> (if I use "." to reference the field, I get wrong output information). If
> these normal operators can not support "bag" operations, I can't see why
> the
> pig needs bag type, as the operators can only support flatten type.
>
> Regards!
>
> Yong
> 2011/7/19 Jacob Perkins <ja...@gmail.com>
>
> > On Tue, 2011-07-19 at 16:05 +0200, 勇胡 wrote:
> > > How can I understand that 'A.score' is a bag? I mean that if I issue a
> > > 'describe B' command, I can get B: {group:int, A: {name:chararray,
> > > no:int,score:int}}.
> > Looking at the output of describe shows that A is bag (eg. the '{' and
> > '}' characters), yes? So 'A.score' is simply the bag of all the scores
> > in the group. You can go further and get a bag of both the scores and
> > numbers by looking at 'A.(no, score)'. I admit that it _is_ confusing at
> > first.
> >
> > > From here, I can't get any information that 'A.score' is
> > > a bag, but I can see that A.score is an element of bag.
> > Not true. 'score' is the name of the field. 'A.score' is a bag of just
> > the scores. Using the dot '.' is a way of pulling out specific fields
> > from every tuple within a bag to result in another bag. Consider:
> >
> > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> > B = GROUP A BY no;
> > DUMP B;
> >
> > (1,{(henrietta,1,25),(sally,1,82)})
> > (3,{(fred,3,120)})
> > (4,{(elsie,4,45)})
> >
> > C = FOREACH B GENERATE A.score;
> > DUMP C;
> >
> > ({(25),(82)})
> > ({(120)})
> > ({(45)})
> >
> > Got it?
> >
> > > And why if I delete the quantifier 'A.', it works?
> > >
> > > I just changed my pig code as
> > >
> > > A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int,
> > score:
> > > int);
> > > B = GROUP A BY no;
> > > C =  FOREACH B {
> > >     D = FILTER A BY score > 80;
> > >     GENERATE D.name, D.score;}
> > > DUMP C;
> > >
> > > I got an empty bag!
> > 'D.name' and 'D.score' are bags of tuples. You will need to FLATTEN them
> > at the end as in the example
> >
> > >
> > > The input is as:
> > > henrietta       1       25
> > > sally   1       82
> > > fred    3       120
> > > elsie   4       45
> > >
> > > The output is as:
> > > ({(sally)},{(82)})
> > > ({(fred)},{(120)})
> > > ({},{})
> > >
> > > As you see, I got an empty tuple? why?
> > There are three tuples, one for each group (1, 3, and 4). The filter
> > condition left the bags from group 4 empty since the only tuple,
> > (elsie,4,45) did not have a score > 80. If you FLATTEN the bags the
> > empty ones are discarded.
> >
> > --jacob
> > @thedatachef
> >
> > >
> > > Yong
> > >
> > > 2011/7/19 Jacob Perkins <ja...@gmail.com>
> > >
> > > > I think it's because 'A.score' is a bag but Pig needs a reference to
> a
> > > > field in the tuples. This worked for me:
> > > >
> > > > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> > > > B = GROUP A BY no;
> > > > C = FOREACH B {
> > > >       D = FILTER A BY score > 80;
> > > >      GENERATE FLATTEN(D.(name, score));
> > > >    };
> > > > DUMP C;
> > > >
> > > > on the following data:
> > > >
> > > > $: cat foo.tsv
> > > > henrietta       1       25
> > > > sally   1       82
> > > > fred    3       120
> > > > elsie   4       45
> > > >
> > > > yields:
> > > >
> > > >
> > > > Does that work for you?
> > > >
> > > > --jacob
> > > > @thedatachef
> > > >
> > > > On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
> > > > > A = LOAD '/home/test/student.txt' AS (name:chararray, no:int,
> score:
> > > > > int);
> > > > > B = GROUP A BY no;
> > > > > C =  FOREACH B {
> > > > >     D = FILTER A BY A.score > 80;
> > > > >     GENERATE D.name, D.score;}
> > > > > DUMP C;
> > > >
> > > >
> >
> >
> >
>

Re: why the foreach nested form can't work?

Posted by 勇胡 <yo...@gmail.com>.

Thanks for your response. Now I just think that in which kind of situation I
can use "." to reference the field. In pig, if I understand right, each
relation is a bag. If I issue these commands:

A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
int);
B = FILTER A BY A.score>80;

There is no problem at compile time and the pig code can execute, but
finally I can't get error results. As you mentioned, A.score is a bag and 80
is a constant, they are not compatible. There are really big differences
than SQL. If I use:

B = FILTER A BY score>80; there is no problem, the statement can execute the
filter semantics.

The same problem will occur in the operators "group, cogroup, join, split,
order, cross". The input of these operators only support fields, not bags
(if I use "." to reference the field, I get wrong output information). If
these normal operators can not support "bag" operations, I can't see why the
pig needs bag type, as the operators can only support flatten type.

Regards!

Yong
2011/7/19 Jacob Perkins <ja...@gmail.com>

> On Tue, 2011-07-19 at 16:05 +0200, 勇胡 wrote:
> > How can I understand that 'A.score' is a bag? I mean that if I issue a
> > 'describe B' command, I can get B: {group:int, A: {name:chararray,
> > no:int,score:int}}.
> Looking at the output of describe shows that A is bag (eg. the '{' and
> '}' characters), yes? So 'A.score' is simply the bag of all the scores
> in the group. You can go further and get a bag of both the scores and
> numbers by looking at 'A.(no, score)'. I admit that it _is_ confusing at
> first.
>
> > From here, I can't get any information that 'A.score' is
> > a bag, but I can see that A.score is an element of bag.
> Not true. 'score' is the name of the field. 'A.score' is a bag of just
> the scores. Using the dot '.' is a way of pulling out specific fields
> from every tuple within a bag to result in another bag. Consider:
>
> A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> B = GROUP A BY no;
> DUMP B;
>
> (1,{(henrietta,1,25),(sally,1,82)})
> (3,{(fred,3,120)})
> (4,{(elsie,4,45)})
>
> C = FOREACH B GENERATE A.score;
> DUMP C;
>
> ({(25),(82)})
> ({(120)})
> ({(45)})
>
> Got it?
>
> > And why if I delete the quantifier 'A.', it works?
> >
> > I just changed my pig code as
> >
> > A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int,
> score:
> > int);
> > B = GROUP A BY no;
> > C =  FOREACH B {
> >     D = FILTER A BY score > 80;
> >     GENERATE D.name, D.score;}
> > DUMP C;
> >
> > I got an empty bag!
> 'D.name' and 'D.score' are bags of tuples. You will need to FLATTEN them
> at the end as in the example
>
> >
> > The input is as:
> > henrietta       1       25
> > sally   1       82
> > fred    3       120
> > elsie   4       45
> >
> > The output is as:
> > ({(sally)},{(82)})
> > ({(fred)},{(120)})
> > ({},{})
> >
> > As you see, I got an empty tuple? why?
> There are three tuples, one for each group (1, 3, and 4). The filter
> condition left the bags from group 4 empty since the only tuple,
> (elsie,4,45) did not have a score > 80. If you FLATTEN the bags the
> empty ones are discarded.
>
> --jacob
> @thedatachef
>
> >
> > Yong
> >
> > 2011/7/19 Jacob Perkins <ja...@gmail.com>
> >
> > > I think it's because 'A.score' is a bag but Pig needs a reference to a
> > > field in the tuples. This worked for me:
> > >
> > > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> > > B = GROUP A BY no;
> > > C = FOREACH B {
> > >       D = FILTER A BY score > 80;
> > >      GENERATE FLATTEN(D.(name, score));
> > >    };
> > > DUMP C;
> > >
> > > on the following data:
> > >
> > > $: cat foo.tsv
> > > henrietta       1       25
> > > sally   1       82
> > > fred    3       120
> > > elsie   4       45
> > >
> > > yields:
> > >
> > >
> > > Does that work for you?
> > >
> > > --jacob
> > > @thedatachef
> > >
> > > On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
> > > > A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
> > > > int);
> > > > B = GROUP A BY no;
> > > > C =  FOREACH B {
> > > >     D = FILTER A BY A.score > 80;
> > > >     GENERATE D.name, D.score;}
> > > > DUMP C;
> > >
> > >
>
>
>

Re: why the foreach nested form can't work?

Posted by Jacob Perkins <ja...@gmail.com>.

On Tue, 2011-07-19 at 16:05 +0200, 勇胡 wrote:
> How can I understand that 'A.score' is a bag? I mean that if I issue a
> 'describe B' command, I can get B: {group:int, A: {name:chararray,
> no:int,score:int}}. 
Looking at the output of describe shows that A is bag (eg. the '{' and
'}' characters), yes? So 'A.score' is simply the bag of all the scores
in the group. You can go further and get a bag of both the scores and
numbers by looking at 'A.(no, score)'. I admit that it _is_ confusing at
first.

> From here, I can't get any information that 'A.score' is
> a bag, but I can see that A.score is an element of bag.
Not true. 'score' is the name of the field. 'A.score' is a bag of just
the scores. Using the dot '.' is a way of pulling out specific fields
from every tuple within a bag to result in another bag. Consider:

A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
B = GROUP A BY no;
DUMP B;

(1,{(henrietta,1,25),(sally,1,82)})
(3,{(fred,3,120)})
(4,{(elsie,4,45)})

C = FOREACH B GENERATE A.score;
DUMP C;

({(25),(82)})
({(120)})
({(45)})

Got it?

> And why if I delete the quantifier 'A.', it works?
> 
> I just changed my pig code as
> 
> A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
> int);
> B = GROUP A BY no;
> C =  FOREACH B {
>     D = FILTER A BY score > 80;
>     GENERATE D.name, D.score;}
> DUMP C;
> 
> I got an empty bag!
'D.name' and 'D.score' are bags of tuples. You will need to FLATTEN them
at the end as in the example

> 
> The input is as:
> henrietta       1       25
> sally   1       82
> fred    3       120
> elsie   4       45
> 
> The output is as:
> ({(sally)},{(82)})
> ({(fred)},{(120)})
> ({},{})
> 
> As you see, I got an empty tuple? why?
There are three tuples, one for each group (1, 3, and 4). The filter
condition left the bags from group 4 empty since the only tuple,
(elsie,4,45) did not have a score > 80. If you FLATTEN the bags the
empty ones are discarded.

--jacob
@thedatachef

> 
> Yong
> 
> 2011/7/19 Jacob Perkins <ja...@gmail.com>
> 
> > I think it's because 'A.score' is a bag but Pig needs a reference to a
> > field in the tuples. This worked for me:
> >
> > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> > B = GROUP A BY no;
> > C = FOREACH B {
> >       D = FILTER A BY score > 80;
> >      GENERATE FLATTEN(D.(name, score));
> >    };
> > DUMP C;
> >
> > on the following data:
> >
> > $: cat foo.tsv
> > henrietta       1       25
> > sally   1       82
> > fred    3       120
> > elsie   4       45
> >
> > yields:
> >
> >
> > Does that work for you?
> >
> > --jacob
> > @thedatachef
> >
> > On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
> > > A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
> > > int);
> > > B = GROUP A BY no;
> > > C =  FOREACH B {
> > >     D = FILTER A BY A.score > 80;
> > >     GENERATE D.name, D.score;}
> > > DUMP C;
> >
> >

Re: why the foreach nested form can't work?

Posted by 勇胡 <yo...@gmail.com>.

How can I understand that 'A.score' is a bag? I mean that if I issue a
'describe B' command, I can get B: {group:int, A: {name:chararray,
no:int,score:int}}. From here, I can't get any information that 'A.score' is
a bag, but I can see that A.score is an element of bag.
And why if I delete the quantifier 'A.', it works?

I just changed my pig code as

A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
int);
B = GROUP A BY no;
C =  FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE D.name, D.score;}
DUMP C;

I got an empty bag!

The input is as:
henrietta       1       25
sally   1       82
fred    3       120
elsie   4       45

The output is as:
({(sally)},{(82)})
({(fred)},{(120)})
({},{})

As you see, I got an empty tuple? why?

Yong

2011/7/19 Jacob Perkins <ja...@gmail.com>

> I think it's because 'A.score' is a bag but Pig needs a reference to a
> field in the tuples. This worked for me:
>
> A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> B = GROUP A BY no;
> C = FOREACH B {
>       D = FILTER A BY score > 80;
>      GENERATE FLATTEN(D.(name, score));
>    };
> DUMP C;
>
> on the following data:
>
> $: cat foo.tsv
> henrietta       1       25
> sally   1       82
> fred    3       120
> elsie   4       45
>
> yields:
>
>
> Does that work for you?
>
> --jacob
> @thedatachef
>
> On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
> > A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
> > int);
> > B = GROUP A BY no;
> > C =  FOREACH B {
> >     D = FILTER A BY A.score > 80;
> >     GENERATE D.name, D.score;}
> > DUMP C;
>
>

Re: why the foreach nested form can't work?

Posted by Jacob Perkins <ja...@gmail.com>.

I think it's because 'A.score' is a bag but Pig needs a reference to a
field in the tuples. This worked for me:

A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
B = GROUP A BY no;
C = FOREACH B {
      D = FILTER A BY score > 80;
      GENERATE FLATTEN(D.(name, score));
    };
DUMP C;

on the following data:

$: cat foo.tsv 
henrietta	1	25
sally	1	82
fred	3	120
elsie	4	45

yields:

(sally,82)
(fred,120)

Does that work for you?

--jacob
@thedatachef

On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
> A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
> int);
> B = GROUP A BY no;
> C =  FOREACH B {
>     D = FILTER A BY A.score > 80;
>     GENERATE D.name, D.score;}
> DUMP C;