You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Cam Bazz <ca...@gmail.com> on 2011/01/17 02:02:47 UTC

counting unique visits per item

Hello,

I have rigged my web application so it generates some sort of custom
access log. Each line in my access log has the ipnumber,
sessionCookie, idOfPage.

How can i count unique visits to per idOfPage?

I followed the tutorial to write a script for calculating number of
visits per idOfPage:

raw = load '/home/cambazz/my.log' using PigStorage('\t');
rawprod = filter raw by $2=='PROD';
prod = foreach rawprod generate $0 as time, $3 as ip, $4 as session, $9 as sid;
prod_grouped = group prod by sid;
prod_hits = foreach prod_grouped generate group, COUNT($1);
dump prod_hits;

which was easy.

I now want to calculate number of unique visits, where visits from
same ip,sessionCookie counts as 1 per sid.

I tried various schemes, but could not quite come up with it.

Any ideas / suggestions / help greatly appreciated.


Best Regards,
C.B.

Re: counting unique visits per item

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
looks ok.
you can also use DISTINCT inside a foreach..generate loop, I think that's
cheaper than what you are doing right now"

v = group prod by sid;
uv = foreach prod {
  uv = prod.(session, ip);
  dist = distinct uv;
  generate group as prod, COUNT(prod) as prod_cnt, COUNT(dist) as dist_cnt;
}

this should be a bit nicer if you are working with big collections, though
I've seen  a sequence of group-bys wind up better because pig got confused
and didn't push distincting into the mappers... Try both ways.

caveat: I haven't actually run this, might have minor bugs :)

D

On Sun, Jan 16, 2011 at 5:38 PM, Cam Bazz <ca...@gmail.com> wrote:

> Hello Dmitriy;
>
> I did not know the flatten command, so while waiting for a reply from
> the mailing list i have done:
>
> RAW = load '/home/can/my.log' using PigStorage('\t');
> PROD = foreach PROD generate $3 as ip, $4 as session, $9 as sid;
> --
> V = group PROD by sid;
> UV = group PROD by (sid,session,ip);
> --
> HITS = foreach V generate group, COUNT($1);
> UNQ = group UV by group.sid;
> UNQVISITS = foreach UNQ generate group, COUNT($1);
>
> which does seem to work. I was wondering if I did something very wrong
> in my ignorance.
>
> best regards,
> -C.B.
>
>
>
> On Mon, Jan 17, 2011 at 3:21 AM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> > You can group by multiple keys, so perhaps
> >
> > prod_grouped = group prod by (sid, ip);
> > prod_hits = foreach prod_grouped generate FLATTEN(group) as (sid, ip),
> > COUNT($1) as prod_hit_count;
> >
> > On Sun, Jan 16, 2011 at 5:02 PM, Cam Bazz <ca...@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> I have rigged my web application so it generates some sort of custom
> >> access log. Each line in my access log has the ipnumber,
> >> sessionCookie, idOfPage.
> >>
> >> How can i count unique visits to per idOfPage?
> >>
> >> I followed the tutorial to write a script for calculating number of
> >> visits per idOfPage:
> >>
> >> raw = load '/home/cambazz/my.log' using PigStorage('\t');
> >> rawprod = filter raw by $2=='PROD';
> >> prod = foreach rawprod generate $0 as time, $3 as ip, $4 as session, $9
> as
> >> sid;
> >> prod_grouped = group prod by sid;
> >> prod_hits = foreach prod_grouped generate group, COUNT($1);
> >> dump prod_hits;
> >>
> >> which was easy.
> >>
> >> I now want to calculate number of unique visits, where visits from
> >> same ip,sessionCookie counts as 1 per sid.
> >>
> >> I tried various schemes, but could not quite come up with it.
> >>
> >> Any ideas / suggestions / help greatly appreciated.
> >>
> >>
> >> Best Regards,
> >> C.B.
> >>
> >
>

Re: counting unique visits per item

Posted by Cam Bazz <ca...@gmail.com>.
Hello Dmitriy;

I did not know the flatten command, so while waiting for a reply from
the mailing list i have done:

RAW = load '/home/can/my.log' using PigStorage('\t');
PROD = foreach PROD generate $3 as ip, $4 as session, $9 as sid;
--
V = group PROD by sid;
UV = group PROD by (sid,session,ip);
--
HITS = foreach V generate group, COUNT($1);
UNQ = group UV by group.sid;
UNQVISITS = foreach UNQ generate group, COUNT($1);

which does seem to work. I was wondering if I did something very wrong
in my ignorance.

best regards,
-C.B.



On Mon, Jan 17, 2011 at 3:21 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> You can group by multiple keys, so perhaps
>
> prod_grouped = group prod by (sid, ip);
> prod_hits = foreach prod_grouped generate FLATTEN(group) as (sid, ip),
> COUNT($1) as prod_hit_count;
>
> On Sun, Jan 16, 2011 at 5:02 PM, Cam Bazz <ca...@gmail.com> wrote:
>
>> Hello,
>>
>> I have rigged my web application so it generates some sort of custom
>> access log. Each line in my access log has the ipnumber,
>> sessionCookie, idOfPage.
>>
>> How can i count unique visits to per idOfPage?
>>
>> I followed the tutorial to write a script for calculating number of
>> visits per idOfPage:
>>
>> raw = load '/home/cambazz/my.log' using PigStorage('\t');
>> rawprod = filter raw by $2=='PROD';
>> prod = foreach rawprod generate $0 as time, $3 as ip, $4 as session, $9 as
>> sid;
>> prod_grouped = group prod by sid;
>> prod_hits = foreach prod_grouped generate group, COUNT($1);
>> dump prod_hits;
>>
>> which was easy.
>>
>> I now want to calculate number of unique visits, where visits from
>> same ip,sessionCookie counts as 1 per sid.
>>
>> I tried various schemes, but could not quite come up with it.
>>
>> Any ideas / suggestions / help greatly appreciated.
>>
>>
>> Best Regards,
>> C.B.
>>
>

Re: counting unique visits per item

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
You can group by multiple keys, so perhaps

prod_grouped = group prod by (sid, ip);
prod_hits = foreach prod_grouped generate FLATTEN(group) as (sid, ip),
COUNT($1) as prod_hit_count;

On Sun, Jan 16, 2011 at 5:02 PM, Cam Bazz <ca...@gmail.com> wrote:

> Hello,
>
> I have rigged my web application so it generates some sort of custom
> access log. Each line in my access log has the ipnumber,
> sessionCookie, idOfPage.
>
> How can i count unique visits to per idOfPage?
>
> I followed the tutorial to write a script for calculating number of
> visits per idOfPage:
>
> raw = load '/home/cambazz/my.log' using PigStorage('\t');
> rawprod = filter raw by $2=='PROD';
> prod = foreach rawprod generate $0 as time, $3 as ip, $4 as session, $9 as
> sid;
> prod_grouped = group prod by sid;
> prod_hits = foreach prod_grouped generate group, COUNT($1);
> dump prod_hits;
>
> which was easy.
>
> I now want to calculate number of unique visits, where visits from
> same ip,sessionCookie counts as 1 per sid.
>
> I tried various schemes, but could not quite come up with it.
>
> Any ideas / suggestions / help greatly appreciated.
>
>
> Best Regards,
> C.B.
>