You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by William Oberman <ob...@civicscience.com> on 2011/06/03 21:53:57 UTC

trying to count all tuples

Howdy,

I'm coming from cassandra, and I'm actually trying to count all columns in a
column family.  I believe that is similar to counting the number tuples in a
bag in the lingo in the pig manual.  It was harder than I expected, but I
think this works:
rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING CassandraStorage()
AS (key, columns: bag {T: tuple(name, value)});
counts = FOREACH rows GENERATE COUNT(columns);
counts_in_bag = GROUP counts ALL;
sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
dump sum_of_bag;

My question is: am I right that it works?  I started with 3 keys having a
total of 5 columns and got (5).  Then I added a new key/column, and another
column on an existing key and got (7).  So, it seems like it's working.
But, was there a better way to write it?

Thanks!

will

Re: trying to count all tuples

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Thanks for following through William!
D

On Wed, Jun 8, 2011 at 1:56 PM, William Oberman
<ob...@civicscience.com> wrote:
> Just in case this ends up as someone else's answer someday, here is the
> working query on real data:
> rows = LOAD 'cassandra://civicscience/observations' USING
> CassandraStorage();
> filter_rows = FILTER rows BY $1 is not null;
> counts = FOREACH filter_rows GENERATE COUNT($1);
> counts_in_bag = GROUP counts ALL;
> sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
> dump sum_of_bag;
>
> For some reason typing the bag was causing me problems.
>
> On Tue, Jun 7, 2011 at 4:58 PM, William Oberman <ob...@civicscience.com>wrote:
>
>> I think FILTER will do the trick?  E.g.
>>
>>
>> rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
>> CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
>> filter_rows = FILTER rows BY columns is not null;
>> counts = FOREACH filter_rows GENERATE COUNT(columns);
>>
>> counts_in_bag = GROUP counts ALL;
>> sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
>> dump sum_of_bag;
>>
>>
>> On Tue, Jun 7, 2011 at 4:33 PM, William Oberman <ob...@civicscience.com>wrote:
>>
>>> I tried this same script on closer to production data, and I'm getting
>>> errors.  I'm 50% sure it's this:
>>> https://issues.apache.org/jira/browse/PIG-1283
>>>
>>> One of my rows in cassandra has no columns (maybe?), which maybe causes a
>>> null bag, which causes COUNT to blow up (at least, that's my theory).  As a
>>> workaround, can I have COUNT ignore/skip rows with null columns?  I'll start
>>> digging through the docs as well.
>>>
>>> will
>>>
>>>
>>> On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <oberman@civicscience.com
>>> > wrote:
>>>
>>>> That is exactly what I wanted, thanks for the confirm!
>>>>
>>>>
>>>> On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>>>
>>>>> I am not sure what you mean by "count all columns". The code you have
>>>>> counts all *cells*.
>>>>> So:
>>>>> id1: col1, col2
>>>>> id2: col1, col2, col3
>>>>>
>>>>> has 3 columns in a conventional sense, but your code will return 5. Is
>>>>> that what you want? If so, your code seems correct.
>>>>>
>>>>> D
>>>>>
>>>>> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
>>>>> <ob...@civicscience.com> wrote:
>>>>> > Howdy,
>>>>> >
>>>>> > I'm coming from cassandra, and I'm actually trying to count all
>>>>> columns in a
>>>>> > column family.  I believe that is similar to counting the number
>>>>> tuples in a
>>>>> > bag in the lingo in the pig manual.  It was harder than I expected,
>>>>> but I
>>>>> > think this works:
>>>>> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
>>>>> CassandraStorage()
>>>>> > AS (key, columns: bag {T: tuple(name, value)});
>>>>> > counts = FOREACH rows GENERATE COUNT(columns);
>>>>> > counts_in_bag = GROUP counts ALL;
>>>>> > sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
>>>>> > dump sum_of_bag;
>>>>> >
>>>>> > My question is: am I right that it works?  I started with 3 keys
>>>>> having a
>>>>> > total of 5 columns and got (5).  Then I added a new key/column, and
>>>>> another
>>>>> > column on an existing key and got (7).  So, it seems like it's
>>>>> working.
>>>>> > But, was there a better way to write it?
>>>>> >
>>>>> > Thanks!
>>>>> >
>>>>> > will
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Will Oberman
>>>> Civic Science, Inc.
>>>> 3030 Penn Avenue., First Floor
>>>> Pittsburgh, PA 15201
>>>> (M) 412-480-7835
>>>> (E) oberman@civicscience.com
>>>>
>>>
>>>
>>>
>>> --
>>> Will Oberman
>>> Civic Science, Inc.
>>> 3030 Penn Avenue., First Floor
>>> Pittsburgh, PA 15201
>>> (M) 412-480-7835
>>> (E) oberman@civicscience.com
>>>
>>
>>
>>
>> --
>> Will Oberman
>> Civic Science, Inc.
>> 3030 Penn Avenue., First Floor
>> Pittsburgh, PA 15201
>> (M) 412-480-7835
>> (E) oberman@civicscience.com
>>
>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>

Re: trying to count all tuples

Posted by William Oberman <ob...@civicscience.com>.

Just in case this ends up as someone else's answer someday, here is the
working query on real data:
rows = LOAD 'cassandra://civicscience/observations' USING
CassandraStorage();
filter_rows = FILTER rows BY $1 is not null;
counts = FOREACH filter_rows GENERATE COUNT($1);
counts_in_bag = GROUP counts ALL;
sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
dump sum_of_bag;

For some reason typing the bag was causing me problems.

On Tue, Jun 7, 2011 at 4:58 PM, William Oberman <ob...@civicscience.com>wrote:

> I think FILTER will do the trick?  E.g.
>
>
> rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
> CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
> filter_rows = FILTER rows BY columns is not null;
> counts = FOREACH filter_rows GENERATE COUNT(columns);
>
> counts_in_bag = GROUP counts ALL;
> sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
> dump sum_of_bag;
>
>
> On Tue, Jun 7, 2011 at 4:33 PM, William Oberman <ob...@civicscience.com>wrote:
>
>> I tried this same script on closer to production data, and I'm getting
>> errors.  I'm 50% sure it's this:
>> https://issues.apache.org/jira/browse/PIG-1283
>>
>> One of my rows in cassandra has no columns (maybe?), which maybe causes a
>> null bag, which causes COUNT to blow up (at least, that's my theory).  As a
>> workaround, can I have COUNT ignore/skip rows with null columns?  I'll start
>> digging through the docs as well.
>>
>> will
>>
>>
>> On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <oberman@civicscience.com
>> > wrote:
>>
>>> That is exactly what I wanted, thanks for the confirm!
>>>
>>>
>>> On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>>
>>>> I am not sure what you mean by "count all columns". The code you have
>>>> counts all *cells*.
>>>> So:
>>>> id1: col1, col2
>>>> id2: col1, col2, col3
>>>>
>>>> has 3 columns in a conventional sense, but your code will return 5. Is
>>>> that what you want? If so, your code seems correct.
>>>>
>>>> D
>>>>
>>>> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
>>>> <ob...@civicscience.com> wrote:
>>>> > Howdy,
>>>> >
>>>> > I'm coming from cassandra, and I'm actually trying to count all
>>>> columns in a
>>>> > column family.  I believe that is similar to counting the number
>>>> tuples in a
>>>> > bag in the lingo in the pig manual.  It was harder than I expected,
>>>> but I
>>>> > think this works:
>>>> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
>>>> CassandraStorage()
>>>> > AS (key, columns: bag {T: tuple(name, value)});
>>>> > counts = FOREACH rows GENERATE COUNT(columns);
>>>> > counts_in_bag = GROUP counts ALL;
>>>> > sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
>>>> > dump sum_of_bag;
>>>> >
>>>> > My question is: am I right that it works?  I started with 3 keys
>>>> having a
>>>> > total of 5 columns and got (5).  Then I added a new key/column, and
>>>> another
>>>> > column on an existing key and got (7).  So, it seems like it's
>>>> working.
>>>> > But, was there a better way to write it?
>>>> >
>>>> > Thanks!
>>>> >
>>>> > will
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Will Oberman
>>> Civic Science, Inc.
>>> 3030 Penn Avenue., First Floor
>>> Pittsburgh, PA 15201
>>> (M) 412-480-7835
>>> (E) oberman@civicscience.com
>>>
>>
>>
>>
>> --
>> Will Oberman
>> Civic Science, Inc.
>> 3030 Penn Avenue., First Floor
>> Pittsburgh, PA 15201
>> (M) 412-480-7835
>> (E) oberman@civicscience.com
>>
>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>



-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: trying to count all tuples

Posted by William Oberman <ob...@civicscience.com>.

I think FILTER will do the trick?  E.g.

rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING CassandraStorage()
AS (key, columns: bag {T: tuple(name, value)});
filter_rows = FILTER rows BY columns is not null;
counts = FOREACH filter_rows GENERATE COUNT(columns);
counts_in_bag = GROUP counts ALL;
sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
dump sum_of_bag;


On Tue, Jun 7, 2011 at 4:33 PM, William Oberman <ob...@civicscience.com>wrote:

> I tried this same script on closer to production data, and I'm getting
> errors.  I'm 50% sure it's this:
> https://issues.apache.org/jira/browse/PIG-1283
>
> One of my rows in cassandra has no columns (maybe?), which maybe causes a
> null bag, which causes COUNT to blow up (at least, that's my theory).  As a
> workaround, can I have COUNT ignore/skip rows with null columns?  I'll start
> digging through the docs as well.
>
> will
>
>
> On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <ob...@civicscience.com>wrote:
>
>> That is exactly what I wanted, thanks for the confirm!
>>
>>
>> On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>
>>> I am not sure what you mean by "count all columns". The code you have
>>> counts all *cells*.
>>> So:
>>> id1: col1, col2
>>> id2: col1, col2, col3
>>>
>>> has 3 columns in a conventional sense, but your code will return 5. Is
>>> that what you want? If so, your code seems correct.
>>>
>>> D
>>>
>>> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
>>> <ob...@civicscience.com> wrote:
>>> > Howdy,
>>> >
>>> > I'm coming from cassandra, and I'm actually trying to count all columns
>>> in a
>>> > column family.  I believe that is similar to counting the number tuples
>>> in a
>>> > bag in the lingo in the pig manual.  It was harder than I expected, but
>>> I
>>> > think this works:
>>> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
>>> CassandraStorage()
>>> > AS (key, columns: bag {T: tuple(name, value)});
>>> > counts = FOREACH rows GENERATE COUNT(columns);
>>> > counts_in_bag = GROUP counts ALL;
>>> > sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
>>> > dump sum_of_bag;
>>> >
>>> > My question is: am I right that it works?  I started with 3 keys having
>>> a
>>> > total of 5 columns and got (5).  Then I added a new key/column, and
>>> another
>>> > column on an existing key and got (7).  So, it seems like it's working.
>>> > But, was there a better way to write it?
>>> >
>>> > Thanks!
>>> >
>>> > will
>>> >
>>>
>>
>>
>>
>> --
>> Will Oberman
>> Civic Science, Inc.
>> 3030 Penn Avenue., First Floor
>> Pittsburgh, PA 15201
>> (M) 412-480-7835
>> (E) oberman@civicscience.com
>>
>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>



-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: trying to count all tuples

Posted by William Oberman <ob...@civicscience.com>.

I tried this same script on closer to production data, and I'm getting
errors.  I'm 50% sure it's this:
https://issues.apache.org/jira/browse/PIG-1283

One of my rows in cassandra has no columns (maybe?), which maybe causes a
null bag, which causes COUNT to blow up (at least, that's my theory).  As a
workaround, can I have COUNT ignore/skip rows with null columns?  I'll start
digging through the docs as well.

will

On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <ob...@civicscience.com>wrote:

> That is exactly what I wanted, thanks for the confirm!
>
>
> On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> I am not sure what you mean by "count all columns". The code you have
>> counts all *cells*.
>> So:
>> id1: col1, col2
>> id2: col1, col2, col3
>>
>> has 3 columns in a conventional sense, but your code will return 5. Is
>> that what you want? If so, your code seems correct.
>>
>> D
>>
>> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
>> <ob...@civicscience.com> wrote:
>> > Howdy,
>> >
>> > I'm coming from cassandra, and I'm actually trying to count all columns
>> in a
>> > column family.  I believe that is similar to counting the number tuples
>> in a
>> > bag in the lingo in the pig manual.  It was harder than I expected, but
>> I
>> > think this works:
>> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
>> CassandraStorage()
>> > AS (key, columns: bag {T: tuple(name, value)});
>> > counts = FOREACH rows GENERATE COUNT(columns);
>> > counts_in_bag = GROUP counts ALL;
>> > sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
>> > dump sum_of_bag;
>> >
>> > My question is: am I right that it works?  I started with 3 keys having
>> a
>> > total of 5 columns and got (5).  Then I added a new key/column, and
>> another
>> > column on an existing key and got (7).  So, it seems like it's working.
>> > But, was there a better way to write it?
>> >
>> > Thanks!
>> >
>> > will
>> >
>>
>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>



-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: trying to count all tuples

Posted by William Oberman <ob...@civicscience.com>.

That is exactly what I wanted, thanks for the confirm!

On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I am not sure what you mean by "count all columns". The code you have
> counts all *cells*.
> So:
> id1: col1, col2
> id2: col1, col2, col3
>
> has 3 columns in a conventional sense, but your code will return 5. Is
> that what you want? If so, your code seems correct.
>
> D
>
> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
> <ob...@civicscience.com> wrote:
> > Howdy,
> >
> > I'm coming from cassandra, and I'm actually trying to count all columns
> in a
> > column family.  I believe that is similar to counting the number tuples
> in a
> > bag in the lingo in the pig manual.  It was harder than I expected, but I
> > think this works:
> > rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
> CassandraStorage()
> > AS (key, columns: bag {T: tuple(name, value)});
> > counts = FOREACH rows GENERATE COUNT(columns);
> > counts_in_bag = GROUP counts ALL;
> > sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
> > dump sum_of_bag;
> >
> > My question is: am I right that it works?  I started with 3 keys having a
> > total of 5 columns and got (5).  Then I added a new key/column, and
> another
> > column on an existing key and got (7).  So, it seems like it's working.
> > But, was there a better way to write it?
> >
> > Thanks!
> >
> > will
> >
>



-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: trying to count all tuples

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I am not sure what you mean by "count all columns". The code you have
counts all *cells*.
So:
id1: col1, col2
id2: col1, col2, col3

has 3 columns in a conventional sense, but your code will return 5. Is
that what you want? If so, your code seems correct.

D

On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
<ob...@civicscience.com> wrote:
> Howdy,
>
> I'm coming from cassandra, and I'm actually trying to count all columns in a
> column family.  I believe that is similar to counting the number tuples in a
> bag in the lingo in the pig manual.  It was harder than I expected, but I
> think this works:
> rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING CassandraStorage()
> AS (key, columns: bag {T: tuple(name, value)});
> counts = FOREACH rows GENERATE COUNT(columns);
> counts_in_bag = GROUP counts ALL;
> sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
> dump sum_of_bag;
>
> My question is: am I right that it works?  I started with 3 keys having a
> total of 5 columns and got (5).  Then I added a new key/column, and another
> column on an existing key and got (7).  So, it seems like it's working.
> But, was there a better way to write it?
>
> Thanks!
>
> will
>