You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jason Alexander <ja...@shopsavvy.com> on 2012/03/26 19:39:09 UTC
Count grouped by title
Hey guys,
Continuing on in my Pig education, I'm trying to pivot my previous script to give me a break down of count by title.
The script I have so far is:
/* scans grouped by title */
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'battery');
groupedscans = GROUP productscans BY title;
scancount = FOREACH groupedscans GENERATE title, COUNT(productscans);
--DUMP scancount;
STORE scancount INTO '/output/scans/groupedscans.out';
I'm sure it's something goofy and easy, but any help would be much appreciated!
Thanks,
-Jason
Re: Count grouped by title
Posted by Jason Alexander <ja...@shopsavvy.com>.
Ugh, disregard - this was an error in my regex. User error here.
Thanks again,
-Jason
On Mar 26, 2012, at 2:02 PM, Jason Alexander wrote:
> Thanks Prashant,
>
>
> Well, before I wasn't getting any specific error, I was just getting nothing written out.
>
> Updating the script based on your feedback, the output I get is:
>
> battery 303
>
> Which I assume is the total number of records that have the word "battery" in the title.
>
> Ultimately, what I would like to see is:
>
> battery title 1 15
> battery title 2 304
> battery title 3 573
> .
> .
> .
>
>
> How can I accomplish that?
>
>
> Thanks again for all your help,
> -Jason
>
> On Mar 26, 2012, at 12:43 PM, Prashant Kommireddi wrote:
>
>> You need to use the implicit 'group' to reference title. The error was
>> pretty clear in this case.
>>
>> grunt> scancount = FOREACH groupedscans GENERATE title,
>> COUNT(productscans);
>> 2012-03-26 10:41:43,497 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1025:
>> <line 5, column 56> Invalid field projection. Projected field [title] does
>> not exist in schema:
>> group:chararray,productscans:bag{:tuple(thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray)}.
>>
>>
>> Instead use 'group'
>>
>> grunt> scancount = FOREACH groupedscans GENERATE group,
>> COUNT(productscans);
>>
>> Thanks,
>> Prashant
>>
>> On Mon, Mar 26, 2012 at 10:39 AM, Jason Alexander <ja...@shopsavvy.com>wrote:
>>
>>> Hey guys,
>>>
>>>
>>>
>>> Continuing on in my Pig education, I'm trying to pivot my previous script
>>> to give me a break down of count by title.
>>>
>>> The script I have so far is:
>>>
>>> /* scans grouped by title */
>>>
>>> scans = LOAD '/hive/scans/*' USING PigStorage(',') AS
>>> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
>>> productscans = FILTER scans BY (title MATCHES 'battery');
>>> groupedscans = GROUP productscans BY title;
>>> scancount = FOREACH groupedscans GENERATE title,
>>> COUNT(productscans);
>>> --DUMP scancount;
>>> STORE scancount INTO '/output/scans/groupedscans.out';
>>>
>>>
>>>
>>> I'm sure it's something goofy and easy, but any help would be much
>>> appreciated!
>>>
>>>
>>> Thanks,
>>> -Jason
>
Re: Count grouped by title
Posted by Norbert Burger <no...@gmail.com>.
Pig uses Java's regular expression format, which anchors the regex at the
beginning and end of your string-to-be-searched. This means that the
predicate ...matches 'battery' only returns strings that are exactly
"battery", instead of strings that contain "battery".
Try using ...matches '.*battery.*' instead.
Norbert
On Mon, Mar 26, 2012 at 3:02 PM, Jason Alexander <ja...@shopsavvy.com>wrote:
> Thanks Prashant,
>
>
> Well, before I wasn't getting any specific error, I was just getting
> nothing written out.
>
> Updating the script based on your feedback, the output I get is:
>
> battery 303
>
> Which I assume is the total number of records that have the word "battery"
> in the title.
>
> Ultimately, what I would like to see is:
>
> battery title 1 15
> battery title 2 304
> battery title 3 573
> .
> .
> .
>
>
> How can I accomplish that?
>
>
> Thanks again for all your help,
> -Jason
>
> On Mar 26, 2012, at 12:43 PM, Prashant Kommireddi wrote:
>
> > You need to use the implicit 'group' to reference title. The error was
> > pretty clear in this case.
> >
> > grunt> scancount = FOREACH groupedscans GENERATE title,
> > COUNT(productscans);
> > 2012-03-26 10:41:43,497 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> > ERROR 1025:
> > <line 5, column 56> Invalid field projection. Projected field [title]
> does
> > not exist in schema:
> >
> group:chararray,productscans:bag{:tuple(thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray)}.
> >
> >
> > Instead use 'group'
> >
> > grunt> scancount = FOREACH groupedscans GENERATE group,
> > COUNT(productscans);
> >
> > Thanks,
> > Prashant
> >
> > On Mon, Mar 26, 2012 at 10:39 AM, Jason Alexander <jason@shopsavvy.com
> >wrote:
> >
> >> Hey guys,
> >>
> >>
> >>
> >> Continuing on in my Pig education, I'm trying to pivot my previous
> script
> >> to give me a break down of count by title.
> >>
> >> The script I have so far is:
> >>
> >> /* scans grouped by title */
> >>
> >> scans = LOAD '/hive/scans/*' USING PigStorage(',') AS
> >>
> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
> >> productscans = FILTER scans BY (title MATCHES 'battery');
> >> groupedscans = GROUP productscans BY title;
> >> scancount = FOREACH groupedscans GENERATE title,
> >> COUNT(productscans);
> >> --DUMP scancount;
> >> STORE scancount INTO '/output/scans/groupedscans.out';
> >>
> >>
> >>
> >> I'm sure it's something goofy and easy, but any help would be much
> >> appreciated!
> >>
> >>
> >> Thanks,
> >> -Jason
>
>
Re: Count grouped by title
Posted by Jason Alexander <ja...@shopsavvy.com>.
Thanks Prashant,
Well, before I wasn't getting any specific error, I was just getting nothing written out.
Updating the script based on your feedback, the output I get is:
battery 303
Which I assume is the total number of records that have the word "battery" in the title.
Ultimately, what I would like to see is:
battery title 1 15
battery title 2 304
battery title 3 573
.
.
.
How can I accomplish that?
Thanks again for all your help,
-Jason
On Mar 26, 2012, at 12:43 PM, Prashant Kommireddi wrote:
> You need to use the implicit 'group' to reference title. The error was
> pretty clear in this case.
>
> grunt> scancount = FOREACH groupedscans GENERATE title,
> COUNT(productscans);
> 2012-03-26 10:41:43,497 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1025:
> <line 5, column 56> Invalid field projection. Projected field [title] does
> not exist in schema:
> group:chararray,productscans:bag{:tuple(thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray)}.
>
>
> Instead use 'group'
>
> grunt> scancount = FOREACH groupedscans GENERATE group,
> COUNT(productscans);
>
> Thanks,
> Prashant
>
> On Mon, Mar 26, 2012 at 10:39 AM, Jason Alexander <ja...@shopsavvy.com>wrote:
>
>> Hey guys,
>>
>>
>>
>> Continuing on in my Pig education, I'm trying to pivot my previous script
>> to give me a break down of count by title.
>>
>> The script I have so far is:
>>
>> /* scans grouped by title */
>>
>> scans = LOAD '/hive/scans/*' USING PigStorage(',') AS
>> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
>> productscans = FILTER scans BY (title MATCHES 'battery');
>> groupedscans = GROUP productscans BY title;
>> scancount = FOREACH groupedscans GENERATE title,
>> COUNT(productscans);
>> --DUMP scancount;
>> STORE scancount INTO '/output/scans/groupedscans.out';
>>
>>
>>
>> I'm sure it's something goofy and easy, but any help would be much
>> appreciated!
>>
>>
>> Thanks,
>> -Jason
Re: Count grouped by title
Posted by Prashant Kommireddi <pr...@gmail.com>.
You need to use the implicit 'group' to reference title. The error was
pretty clear in this case.
grunt> scancount = FOREACH groupedscans GENERATE title,
COUNT(productscans);
2012-03-26 10:41:43,497 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1025:
<line 5, column 56> Invalid field projection. Projected field [title] does
not exist in schema:
group:chararray,productscans:bag{:tuple(thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray)}.
Instead use 'group'
grunt> scancount = FOREACH groupedscans GENERATE group,
COUNT(productscans);
Thanks,
Prashant
On Mon, Mar 26, 2012 at 10:39 AM, Jason Alexander <ja...@shopsavvy.com>wrote:
> Hey guys,
>
>
>
> Continuing on in my Pig education, I'm trying to pivot my previous script
> to give me a break down of count by title.
>
> The script I have so far is:
>
> /* scans grouped by title */
>
> scans = LOAD '/hive/scans/*' USING PigStorage(',') AS
> (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
> productscans = FILTER scans BY (title MATCHES 'battery');
> groupedscans = GROUP productscans BY title;
> scancount = FOREACH groupedscans GENERATE title,
> COUNT(productscans);
> --DUMP scancount;
> STORE scancount INTO '/output/scans/groupedscans.out';
>
>
>
> I'm sure it's something goofy and easy, but any help would be much
> appreciated!
>
>
> Thanks,
> -Jason