You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Marco Brinkmann <ma...@cope.io> on 2013/05/29 15:55:02 UTC

Count empty relation after filtering

Hi everybody,

I have a rather simple question and scenario, but still I could not find an
answer in the documention or in other resource:

id, valid
(1, false)
(2, false)

records = LOAD 'test.csv' USING PigStorage(',') AS (id:long, valid:boolean);

test = FILTER records BY valid == true;
test_count = FOREACH (GROUP test ALL) GENERATE COUNT(test);

DUMP test_count;


I would expect that 'valid_count' nows contains '0'. But the dump is
completely empty (with 'valid == false' I get '(2)' as expected). I use pig
0.11.1.

Could someone point me in the right direction?

Cheers, Marco

Re: Count empty relation after filtering

Posted by Peter Connolly <pt...@yahoo.com>.
Try the bincond operator.  Something like this might work:

...
test_count = FOREACH (GROUP test ALL) GENERATE COUNT(test) AS total;
test_count_2 = FOREACH test_count GENERATE id, valid, (total IS NULL ? 0 : total);
DUMP test_count_2;


-Peter


________________________________
 From: Marco Brinkmann <ma...@cope.io>
To: user@pig.apache.org 
Sent: Wednesday, May 29, 2013 11:43 AM
Subject: Re: Count empty relation after filtering
 

I tried to explain why in my basic understanding an operation in a foreach
(count, count_star or anything else) will not leed to any success. And I
still appreciate any hints or tricks to achieve the above.


2013/5/29 Shahab Yunus <sh...@gmail.com>

> So basically this means that we were trying to look at this from RDBMS' SQL
> perspective where 'SELECT COUNT(*) FROM TABLE' returns 0 even if there is
> nothing in the result set and that is why we ignored the possibility that
> FOREACH might not being executed at all (which could be by design)?
>
> -Shahab
>
>
> On Wed, May 29, 2013 at 10:13 AM, Marco Brinkmann
> <ma...@cope.io>wrote:
>
> > Thanks, but this does not change anything. My personal guess (and I only
> > work for a few days with pig) is that FOREACH will never be executed,
> > because the relation 'test' is empty.
> >
> >
> > 2013/5/29 Shahab Yunus <sh...@gmail.com>
> >
> > > Try COUNT_STAR.
> > >
> > > -Shahab
> > >
> > >
> > > On Wed, May 29, 2013 at 9:55 AM, Marco Brinkmann <
> > marco.brinkmann@cope.io
> > > >wrote:
> > >
> > > > Hi everybody,
> > > >
> > > > I have a rather simple question and scenario, but still I could not
> > find
> > > an
> > > > answer in the documention or in other resource:
> > > >
> > > > id, valid
> > > > (1, false)
> > > > (2, false)
> > > >
> > > > records = LOAD 'test.csv' USING PigStorage(',') AS (id:long,
> > > > valid:boolean);
> > > >
> > > > test = FILTER records BY valid == true;
> > > > test_count = FOREACH (GROUP test ALL) GENERATE COUNT(test);
> > > >
> > > > DUMP test_count;
> > > >
> > > >
> > > > I would expect that 'valid_count' nows contains '0'. But the dump is
> > > > completely empty (with 'valid == false' I get '(2)' as expected). I
> use
> > > pig
> > > > 0.11.1.
> > > >
> > > > Could someone point me in the right direction?
> > > >
> > > > Cheers, Marco
> > > >
> > >
> >
>

Re: Count empty relation after filtering

Posted by Marco Brinkmann <ma...@cope.io>.
Great analysis. Couldn't agree more.


2013/5/29 mehmet <me...@yahoo.com>

> I tried your code on 0.10 and it gives the same result. I can logically
> explain why it gives you this result, although I am not convinced that
> would be the desired outcome.
>
> If you think of it as there is a synthetic implicit key 'all' in each
> tuple, and you are grouping over that, you can see why there is no output:
> no tuples, nothing to group over (no reducer sees the key 'all', because
> it doesn't exist). Although, I would contend that when there are no
> tuples, it is might be ideal to output (all,{}) as the output of the group
> all.
>
>
> ________________________________
>  From: Marco Brinkmann <ma...@cope.io>
> To: user@pig.apache.org
> Sent: Wednesday, May 29, 2013 8:43 AM
> Subject: Re: Count empty relation after filtering
>
>
> I tried to explain why in my basic understanding an operation in a foreach
> (count, count_star or anything else) will not leed to any success. And I
> still appreciate any hints or tricks to achieve the above.
>
>
> 2013/5/29 Shahab Yunus <sh...@gmail.com>
>
> > So basically this means that we were trying to look at this from RDBMS'
> SQL
> > perspective where 'SELECT COUNT(*) FROM TABLE' returns 0 even if there is
> > nothing in the result set and that is why we ignored the possibility that
> > FOREACH might not being executed at all (which could be by design)?
> >
> > -Shahab
> >
> >
> > On Wed, May 29, 2013 at 10:13 AM, Marco Brinkmann
> > <ma...@cope.io>wrote:
> >
> > > Thanks, but this does not change anything. My personal guess (and I
> only
> > > work for a few days with pig) is that FOREACH will never be executed,
> > > because the relation 'test' is empty.
> > >
> > >
> > > 2013/5/29 Shahab Yunus <sh...@gmail.com>
> > >
> > > > Try COUNT_STAR.
> > > >
> > > > -Shahab
> > > >
> > > >
> > > > On Wed, May 29, 2013 at 9:55 AM, Marco Brinkmann <
> > > marco.brinkmann@cope.io
> > > > >wrote:
> > > >
> > > > > Hi everybody,
> > > > >
> > > > > I have a rather simple question and scenario, but still I could not
> > > find
> > > > an
> > > > > answer in the documention or in other resource:
> > > > >
> > > > > id, valid
> > > > > (1, false)
> > > > > (2, false)
> > > > >
> > > > > records = LOAD 'test.csv' USING PigStorage(',') AS (id:long,
> > > > > valid:boolean);
> > > > >
> > > > > test = FILTER records BY valid == true;
> > > > > test_count = FOREACH (GROUP test ALL) GENERATE COUNT(test);
> > > > >
> > > > > DUMP test_count;
> > > > >
> > > > >
> > > > > I would expect that 'valid_count' nows contains '0'. But the dump
> is
> > > > > completely empty (with 'valid == false' I get '(2)' as expected). I
> > use
> > > > pig
> > > > > 0.11.1.
> > > > >
> > > > > Could someone point me in the right direction?
> > > > >
> > > > > Cheers, Marco
> > > > >
> > > >
> > >
> >
>

Re: Count empty relation after filtering

Posted by mehmet <me...@yahoo.com>.
I tried your code on 0.10 and it gives the same result. I can logically
explain why it gives you this result, although I am not convinced that
would be the desired outcome.

If you think of it as there is a synthetic implicit key 'all' in each
tuple, and you are grouping over that, you can see why there is no output:
no tuples, nothing to group over (no reducer sees the key 'all', because
it doesn't exist). Although, I would contend that when there are no
tuples, it is might be ideal to output (all,{}) as the output of the group
all.


________________________________
 From: Marco Brinkmann <ma...@cope.io>
To: user@pig.apache.org 
Sent: Wednesday, May 29, 2013 8:43 AM
Subject: Re: Count empty relation after filtering
 

I tried to explain why in my basic understanding an operation in a foreach
(count, count_star or anything else) will not leed to any success. And I
still appreciate any hints or tricks to achieve the above.


2013/5/29 Shahab Yunus <sh...@gmail.com>

> So basically this means that we were trying to look at this from RDBMS' SQL
> perspective where 'SELECT COUNT(*) FROM TABLE' returns 0 even if there is
> nothing in the result set and that is why we ignored the possibility that
> FOREACH might not being executed at all (which could be by design)?
>
> -Shahab
>
>
> On Wed, May 29, 2013 at 10:13 AM, Marco Brinkmann
> <ma...@cope.io>wrote:
>
> > Thanks, but this does not change anything. My personal guess (and I only
> > work for a few days with pig) is that FOREACH will never be executed,
> > because the relation 'test' is empty.
> >
> >
> > 2013/5/29 Shahab Yunus <sh...@gmail.com>
> >
> > > Try COUNT_STAR.
> > >
> > > -Shahab
> > >
> > >
> > > On Wed, May 29, 2013 at 9:55 AM, Marco Brinkmann <
> > marco.brinkmann@cope.io
> > > >wrote:
> > >
> > > > Hi everybody,
> > > >
> > > > I have a rather simple question and scenario, but still I could not
> > find
> > > an
> > > > answer in the documention or in other resource:
> > > >
> > > > id, valid
> > > > (1, false)
> > > > (2, false)
> > > >
> > > > records = LOAD 'test.csv' USING PigStorage(',') AS (id:long,
> > > > valid:boolean);
> > > >
> > > > test = FILTER records BY valid == true;
> > > > test_count = FOREACH (GROUP test ALL) GENERATE COUNT(test);
> > > >
> > > > DUMP test_count;
> > > >
> > > >
> > > > I would expect that 'valid_count' nows contains '0'. But the dump is
> > > > completely empty (with 'valid == false' I get '(2)' as expected). I
> use
> > > pig
> > > > 0.11.1.
> > > >
> > > > Could someone point me in the right direction?
> > > >
> > > > Cheers, Marco
> > > >
> > >
> >
>

Re: Count empty relation after filtering

Posted by Marco Brinkmann <ma...@cope.io>.
I tried to explain why in my basic understanding an operation in a foreach
(count, count_star or anything else) will not leed to any success. And I
still appreciate any hints or tricks to achieve the above.


2013/5/29 Shahab Yunus <sh...@gmail.com>

> So basically this means that we were trying to look at this from RDBMS' SQL
> perspective where 'SELECT COUNT(*) FROM TABLE' returns 0 even if there is
> nothing in the result set and that is why we ignored the possibility that
> FOREACH might not being executed at all (which could be by design)?
>
> -Shahab
>
>
> On Wed, May 29, 2013 at 10:13 AM, Marco Brinkmann
> <ma...@cope.io>wrote:
>
> > Thanks, but this does not change anything. My personal guess (and I only
> > work for a few days with pig) is that FOREACH will never be executed,
> > because the relation 'test' is empty.
> >
> >
> > 2013/5/29 Shahab Yunus <sh...@gmail.com>
> >
> > > Try COUNT_STAR.
> > >
> > > -Shahab
> > >
> > >
> > > On Wed, May 29, 2013 at 9:55 AM, Marco Brinkmann <
> > marco.brinkmann@cope.io
> > > >wrote:
> > >
> > > > Hi everybody,
> > > >
> > > > I have a rather simple question and scenario, but still I could not
> > find
> > > an
> > > > answer in the documention or in other resource:
> > > >
> > > > id, valid
> > > > (1, false)
> > > > (2, false)
> > > >
> > > > records = LOAD 'test.csv' USING PigStorage(',') AS (id:long,
> > > > valid:boolean);
> > > >
> > > > test = FILTER records BY valid == true;
> > > > test_count = FOREACH (GROUP test ALL) GENERATE COUNT(test);
> > > >
> > > > DUMP test_count;
> > > >
> > > >
> > > > I would expect that 'valid_count' nows contains '0'. But the dump is
> > > > completely empty (with 'valid == false' I get '(2)' as expected). I
> use
> > > pig
> > > > 0.11.1.
> > > >
> > > > Could someone point me in the right direction?
> > > >
> > > > Cheers, Marco
> > > >
> > >
> >
>

Re: Count empty relation after filtering

Posted by Shahab Yunus <sh...@gmail.com>.
So basically this means that we were trying to look at this from RDBMS' SQL
perspective where 'SELECT COUNT(*) FROM TABLE' returns 0 even if there is
nothing in the result set and that is why we ignored the possibility that
FOREACH might not being executed at all (which could be by design)?

-Shahab


On Wed, May 29, 2013 at 10:13 AM, Marco Brinkmann
<ma...@cope.io>wrote:

> Thanks, but this does not change anything. My personal guess (and I only
> work for a few days with pig) is that FOREACH will never be executed,
> because the relation 'test' is empty.
>
>
> 2013/5/29 Shahab Yunus <sh...@gmail.com>
>
> > Try COUNT_STAR.
> >
> > -Shahab
> >
> >
> > On Wed, May 29, 2013 at 9:55 AM, Marco Brinkmann <
> marco.brinkmann@cope.io
> > >wrote:
> >
> > > Hi everybody,
> > >
> > > I have a rather simple question and scenario, but still I could not
> find
> > an
> > > answer in the documention or in other resource:
> > >
> > > id, valid
> > > (1, false)
> > > (2, false)
> > >
> > > records = LOAD 'test.csv' USING PigStorage(',') AS (id:long,
> > > valid:boolean);
> > >
> > > test = FILTER records BY valid == true;
> > > test_count = FOREACH (GROUP test ALL) GENERATE COUNT(test);
> > >
> > > DUMP test_count;
> > >
> > >
> > > I would expect that 'valid_count' nows contains '0'. But the dump is
> > > completely empty (with 'valid == false' I get '(2)' as expected). I use
> > pig
> > > 0.11.1.
> > >
> > > Could someone point me in the right direction?
> > >
> > > Cheers, Marco
> > >
> >
>

Re: Count empty relation after filtering

Posted by Marco Brinkmann <ma...@cope.io>.
Thanks, but this does not change anything. My personal guess (and I only
work for a few days with pig) is that FOREACH will never be executed,
because the relation 'test' is empty.


2013/5/29 Shahab Yunus <sh...@gmail.com>

> Try COUNT_STAR.
>
> -Shahab
>
>
> On Wed, May 29, 2013 at 9:55 AM, Marco Brinkmann <marco.brinkmann@cope.io
> >wrote:
>
> > Hi everybody,
> >
> > I have a rather simple question and scenario, but still I could not find
> an
> > answer in the documention or in other resource:
> >
> > id, valid
> > (1, false)
> > (2, false)
> >
> > records = LOAD 'test.csv' USING PigStorage(',') AS (id:long,
> > valid:boolean);
> >
> > test = FILTER records BY valid == true;
> > test_count = FOREACH (GROUP test ALL) GENERATE COUNT(test);
> >
> > DUMP test_count;
> >
> >
> > I would expect that 'valid_count' nows contains '0'. But the dump is
> > completely empty (with 'valid == false' I get '(2)' as expected). I use
> pig
> > 0.11.1.
> >
> > Could someone point me in the right direction?
> >
> > Cheers, Marco
> >
>

Re: Count empty relation after filtering

Posted by Shahab Yunus <sh...@gmail.com>.
Try COUNT_STAR.

-Shahab


On Wed, May 29, 2013 at 9:55 AM, Marco Brinkmann <ma...@cope.io>wrote:

> Hi everybody,
>
> I have a rather simple question and scenario, but still I could not find an
> answer in the documention or in other resource:
>
> id, valid
> (1, false)
> (2, false)
>
> records = LOAD 'test.csv' USING PigStorage(',') AS (id:long,
> valid:boolean);
>
> test = FILTER records BY valid == true;
> test_count = FOREACH (GROUP test ALL) GENERATE COUNT(test);
>
> DUMP test_count;
>
>
> I would expect that 'valid_count' nows contains '0'. But the dump is
> completely empty (with 'valid == false' I get '(2)' as expected). I use pig
> 0.11.1.
>
> Could someone point me in the right direction?
>
> Cheers, Marco
>