You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jason Alexander <ja...@shopsavvy.com> on 2012/03/26 19:39:09 UTC

Count grouped by title

Hey guys,



Continuing on in my Pig education, I'm trying to pivot my previous script to give me a break down of count by title.

The script I have so far is:

/* scans grouped by title */

scans 			= LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans 	= FILTER scans BY (title MATCHES 'battery');
groupedscans	= GROUP productscans BY title;
scancount		= FOREACH groupedscans GENERATE title, COUNT(productscans);
--DUMP scancount;
STORE scancount INTO '/output/scans/groupedscans.out';



I'm sure it's something goofy and easy, but any help would be much appreciated!


Thanks,
-Jason

Re: Count grouped by title

Posted by Jason Alexander <ja...@shopsavvy.com>.
Ugh, disregard - this was an error in my regex. User error here.


Thanks again,
-Jason


On Mar 26, 2012, at 2:02 PM, Jason Alexander wrote:

> Thanks Prashant,
> 
> 
> Well, before I wasn't getting any specific error, I was just getting nothing written out. 
> 
> Updating the script based on your feedback, the output I get is:
> 
> battery	303
> 
> Which I assume is the total number of records that have the word "battery" in the title.
> 
> Ultimately, what I would like to see is:
> 
> battery title 1			15
> battery title 2			304
> battery title 3			573
> .
> .
> .
> 
> 
> How can I accomplish that? 
> 
> 
> Thanks again for all your help,
> -Jason
> 
> On Mar 26, 2012, at 12:43 PM, Prashant Kommireddi wrote:
> 
>> You need to use the implicit 'group' to reference title. The error was
>> pretty clear in this case.
>> 
>> grunt> scancount               = FOREACH groupedscans GENERATE title,
>> COUNT(productscans);
>> 2012-03-26 10:41:43,497 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1025:
>> <line 5, column 56> Invalid field projection. Projected field [title] does
>> not exist in schema:
>> group:chararray,productscans:bag{:tuple(thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray)}.
>> 
>> 
>> Instead use 'group'
>> 
>> grunt> scancount               = FOREACH groupedscans GENERATE group,
>> COUNT(productscans);
>> 
>> Thanks,
>> Prashant
>> 
>> On Mon, Mar 26, 2012 at 10:39 AM, Jason Alexander <ja...@shopsavvy.com>wrote:
>> 
>>> Hey guys,
>>> 
>>> 
>>> 
>>> Continuing on in my Pig education, I'm trying to pivot my previous script
>>> to give me a break down of count by title.
>>> 
>>> The script I have so far is:
>>> 
>>> /* scans grouped by title */
>>> 
>>> scans                   = LOAD '/hive/scans/*' USING PigStorage(',') AS
>>> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
>>> productscans    = FILTER scans BY (title MATCHES 'battery');
>>> groupedscans    = GROUP productscans BY title;
>>> scancount               = FOREACH groupedscans GENERATE title,
>>> COUNT(productscans);
>>> --DUMP scancount;
>>> STORE scancount INTO '/output/scans/groupedscans.out';
>>> 
>>> 
>>> 
>>> I'm sure it's something goofy and easy, but any help would be much
>>> appreciated!
>>> 
>>> 
>>> Thanks,
>>> -Jason
> 


Re: Count grouped by title

Posted by Norbert Burger <no...@gmail.com>.
Pig uses Java's regular expression format, which anchors the regex at the
beginning and end of your string-to-be-searched.  This means that the
predicate ...matches 'battery' only returns strings that are exactly
"battery", instead of strings that contain "battery".

Try using ...matches '.*battery.*' instead.

Norbert

On Mon, Mar 26, 2012 at 3:02 PM, Jason Alexander <ja...@shopsavvy.com>wrote:

> Thanks Prashant,
>
>
> Well, before I wasn't getting any specific error, I was just getting
> nothing written out.
>
> Updating the script based on your feedback, the output I get is:
>
> battery 303
>
> Which I assume is the total number of records that have the word "battery"
> in the title.
>
> Ultimately, what I would like to see is:
>
> battery title 1                 15
> battery title 2                 304
> battery title 3                 573
> .
> .
> .
>
>
> How can I accomplish that?
>
>
> Thanks again for all your help,
> -Jason
>
> On Mar 26, 2012, at 12:43 PM, Prashant Kommireddi wrote:
>
> > You need to use the implicit 'group' to reference title. The error was
> > pretty clear in this case.
> >
> > grunt> scancount               = FOREACH groupedscans GENERATE title,
> > COUNT(productscans);
> > 2012-03-26 10:41:43,497 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> > ERROR 1025:
> > <line 5, column 56> Invalid field projection. Projected field [title]
> does
> > not exist in schema:
> >
> group:chararray,productscans:bag{:tuple(thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray)}.
> >
> >
> > Instead use 'group'
> >
> > grunt> scancount               = FOREACH groupedscans GENERATE group,
> > COUNT(productscans);
> >
> > Thanks,
> > Prashant
> >
> > On Mon, Mar 26, 2012 at 10:39 AM, Jason Alexander <jason@shopsavvy.com
> >wrote:
> >
> >> Hey guys,
> >>
> >>
> >>
> >> Continuing on in my Pig education, I'm trying to pivot my previous
> script
> >> to give me a break down of count by title.
> >>
> >> The script I have so far is:
> >>
> >> /* scans grouped by title */
> >>
> >> scans                   = LOAD '/hive/scans/*' USING PigStorage(',') AS
> >>
> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
> >> productscans    = FILTER scans BY (title MATCHES 'battery');
> >> groupedscans    = GROUP productscans BY title;
> >> scancount               = FOREACH groupedscans GENERATE title,
> >> COUNT(productscans);
> >> --DUMP scancount;
> >> STORE scancount INTO '/output/scans/groupedscans.out';
> >>
> >>
> >>
> >> I'm sure it's something goofy and easy, but any help would be much
> >> appreciated!
> >>
> >>
> >> Thanks,
> >> -Jason
>
>

Re: Count grouped by title

Posted by Jason Alexander <ja...@shopsavvy.com>.
Thanks Prashant,


Well, before I wasn't getting any specific error, I was just getting nothing written out. 

Updating the script based on your feedback, the output I get is:

battery	303

Which I assume is the total number of records that have the word "battery" in the title.

Ultimately, what I would like to see is:

battery title 1			15
battery title 2			304
battery title 3			573
.
.
.


How can I accomplish that? 


Thanks again for all your help,
-Jason

On Mar 26, 2012, at 12:43 PM, Prashant Kommireddi wrote:

> You need to use the implicit 'group' to reference title. The error was
> pretty clear in this case.
> 
> grunt> scancount               = FOREACH groupedscans GENERATE title,
> COUNT(productscans);
> 2012-03-26 10:41:43,497 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1025:
> <line 5, column 56> Invalid field projection. Projected field [title] does
> not exist in schema:
> group:chararray,productscans:bag{:tuple(thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray)}.
> 
> 
> Instead use 'group'
> 
> grunt> scancount               = FOREACH groupedscans GENERATE group,
> COUNT(productscans);
> 
> Thanks,
> Prashant
> 
> On Mon, Mar 26, 2012 at 10:39 AM, Jason Alexander <ja...@shopsavvy.com>wrote:
> 
>> Hey guys,
>> 
>> 
>> 
>> Continuing on in my Pig education, I'm trying to pivot my previous script
>> to give me a break down of count by title.
>> 
>> The script I have so far is:
>> 
>> /* scans grouped by title */
>> 
>> scans                   = LOAD '/hive/scans/*' USING PigStorage(',') AS
>> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
>> productscans    = FILTER scans BY (title MATCHES 'battery');
>> groupedscans    = GROUP productscans BY title;
>> scancount               = FOREACH groupedscans GENERATE title,
>> COUNT(productscans);
>> --DUMP scancount;
>> STORE scancount INTO '/output/scans/groupedscans.out';
>> 
>> 
>> 
>> I'm sure it's something goofy and easy, but any help would be much
>> appreciated!
>> 
>> 
>> Thanks,
>> -Jason


Re: Count grouped by title

Posted by Prashant Kommireddi <pr...@gmail.com>.
You need to use the implicit 'group' to reference title. The error was
pretty clear in this case.

grunt> scancount               = FOREACH groupedscans GENERATE title,
COUNT(productscans);
2012-03-26 10:41:43,497 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1025:
<line 5, column 56> Invalid field projection. Projected field [title] does
not exist in schema:
group:chararray,productscans:bag{:tuple(thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray)}.


Instead use 'group'

grunt> scancount               = FOREACH groupedscans GENERATE group,
COUNT(productscans);

Thanks,
Prashant

On Mon, Mar 26, 2012 at 10:39 AM, Jason Alexander <ja...@shopsavvy.com>wrote:

> Hey guys,
>
>
>
> Continuing on in my Pig education, I'm trying to pivot my previous script
> to give me a break down of count by title.
>
> The script I have so far is:
>
> /* scans grouped by title */
>
> scans                   = LOAD '/hive/scans/*' USING PigStorage(',') AS
> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
> productscans    = FILTER scans BY (title MATCHES 'battery');
> groupedscans    = GROUP productscans BY title;
> scancount               = FOREACH groupedscans GENERATE title,
> COUNT(productscans);
> --DUMP scancount;
> STORE scancount INTO '/output/scans/groupedscans.out';
>
>
>
> I'm sure it's something goofy and easy, but any help would be much
> appreciated!
>
>
> Thanks,
> -Jason