You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Peter Maas <pf...@gmail.com> on 2012/01/26 10:04:35 UTC

getting the top N entries per hour

Hi,

I'm trying to write a pig script to create a list of the top N ip entries per hour. 


Currently I have something like this:


PER_IP = GROUP CFP_LOGS_CLICKS_WITHOUT_0 BY (dayNumber, hourNumber, ip);
IP_COUNT =  FOREACH PER_IP {
                                        numEntries = COUNT(CFP_LOGS_CLICKS_WITHOUT_0.timestamp);
                                        GENERATE group.dayNumber, group.hourNumber,  group.ip, numEntries;
                                   };

IP_COUNT_GROUPED = GROUP IP_COUNT BY ($0, $1);
IP_COUNT_PER_HOUR = FOREACH IP_COUNT_GROUPED GENERATE group.dayNumber, group.hourNumber, MAX(IP_COUNT.$3), AVG(IP_COUNT.$3);

DUMP IP_COUNT_PER_HOUR;



which gives me the highest number of hits per hour from 1 ip and the average number of hits per ip. What I would like to get is:

- The first N entries with hight visit count, preferably with count AND value

I've been looking at LIMIT and ORDER BY but don't really get how to wire them in so they operator on the group instead of all the data.

Any help and pointers appreciated!


-P

Re: getting the top N entries per hour

Posted by Peter Maas <pf...@gmail.com>.
Cool that works like a charm.


On Jan 26, 2012, at 10:57 AM, Prashant Kommireddi wrote:

> Hi Peter,
> 
> You can use TOP
> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/builtin/TOP.html
> 
> Thanks,
> Prashant
> 
> On Thu, Jan 26, 2012 at 1:04 AM, Peter Maas <pf...@gmail.com> wrote:
> 
>> Hi,
>> 
>> I'm trying to write a pig script to create a list of the top N ip entries
>> per hour.
>> 
>> 
>> Currently I have something like this:
>> 
>> 
>> PER_IP = GROUP CFP_LOGS_CLICKS_WITHOUT_0 BY (dayNumber, hourNumber, ip);
>> IP_COUNT =  FOREACH PER_IP {
>>                                       numEntries =
>> COUNT(CFP_LOGS_CLICKS_WITHOUT_0.timestamp);
>>                                       GENERATE group.dayNumber,
>> group.hourNumber,  group.ip, numEntries;
>>                                  };
>> 
>> IP_COUNT_GROUPED = GROUP IP_COUNT BY ($0, $1);
>> IP_COUNT_PER_HOUR = FOREACH IP_COUNT_GROUPED GENERATE group.dayNumber,
>> group.hourNumber, MAX(IP_COUNT.$3), AVG(IP_COUNT.$3);
>> 
>> DUMP IP_COUNT_PER_HOUR;
>> 
>> 
>> 
>> which gives me the highest number of hits per hour from 1 ip and the
>> average number of hits per ip. What I would like to get is:
>> 
>> - The first N entries with hight visit count, preferably with count AND
>> value
>> 
>> I've been looking at LIMIT and ORDER BY but don't really get how to wire
>> them in so they operator on the group instead of all the data.
>> 
>> Any help and pointers appreciated!
>> 
>> 
>> -P


Re: getting the top N entries per hour

Posted by Prashant Kommireddi <pr...@gmail.com>.
Hi Peter,

You can use TOP
http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/builtin/TOP.html

Thanks,
Prashant

On Thu, Jan 26, 2012 at 1:04 AM, Peter Maas <pf...@gmail.com> wrote:

> Hi,
>
> I'm trying to write a pig script to create a list of the top N ip entries
> per hour.
>
>
> Currently I have something like this:
>
>
> PER_IP = GROUP CFP_LOGS_CLICKS_WITHOUT_0 BY (dayNumber, hourNumber, ip);
> IP_COUNT =  FOREACH PER_IP {
>                                        numEntries =
> COUNT(CFP_LOGS_CLICKS_WITHOUT_0.timestamp);
>                                        GENERATE group.dayNumber,
> group.hourNumber,  group.ip, numEntries;
>                                   };
>
> IP_COUNT_GROUPED = GROUP IP_COUNT BY ($0, $1);
> IP_COUNT_PER_HOUR = FOREACH IP_COUNT_GROUPED GENERATE group.dayNumber,
> group.hourNumber, MAX(IP_COUNT.$3), AVG(IP_COUNT.$3);
>
> DUMP IP_COUNT_PER_HOUR;
>
>
>
> which gives me the highest number of hits per hour from 1 ip and the
> average number of hits per ip. What I would like to get is:
>
> - The first N entries with hight visit count, preferably with count AND
> value
>
> I've been looking at LIMIT and ORDER BY but don't really get how to wire
> them in so they operator on the group instead of all the data.
>
> Any help and pointers appreciated!
>
>
> -P