You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Marco Cadetg <ma...@zattoo.com> on 2012/08/30 10:00:50 UTC

reduce continuous sessions

Hi there,

I do have some user session which look something on the following lines:

id:chararray, start:long(unix timestamp), end:long(unix timestamp)
xxx,1,3
xxx,4,7
yyy,1,2
yyy,5,7
zzz,6,7
zzz,7,10

I would like to to combine the rows which belong to a continues session
e.g. in my example the result should be the following:
xxx,1,7
yyy,1,2
yyy,5,7
zzz,6,10

I guess there is no way to do this directly in pig but rather by using a
UDF. Can someone give me a pointer on how you would achieve this?

Thanks,
-Marco

RE: reduce continuous sessions

Posted by Steve Bernstein <St...@deem.com>.
You might want to check out LinkedIn's DataFu contribution, particularly the "sessionize" UDF:
http://sna-projects.com/datafu/javadoc/0.0.4/datafu/pig/sessions/Sessionize.html


_____________
Steve Bernstein
VP, Analytics
Rearden Commerce, Inc.

+1.408.499.0961 Mobile

deem.com | reardencommerce.com

-----Original Message-----
From: Marco Cadetg [mailto:marco@zattoo.com] 
Sent: Thursday, August 30, 2012 4:42 AM
To: user@pig.apache.org
Subject: Re: reduce continuous sessions

Unfortunately it's not that simple.

A = LOAD 'comb.txt' USING PigStorage(',') AS (id:chararray,start:long,end:long);
B = FOREACH (GROUP A BY id) { GENERATE
FLATTEN(group),MIN(A.start),MAX(A.end); } dump B
(xxx,1,7)
(yyy,1,7)
(zzz,6,10)

This is not what I want. I want only to reduce the rows / sessions if they are continues like the end of one session is the start of another. In my example that is:
xxx,1,3
xxx,4,7

This is continuous as the end of the first row is the start (+1s) of the next row.

Unlike this one, here the end of the first row is NOT the start of the next row...
yyy,1,2
yyy,5,7

Therefore I have to keep track of sessions somehow.

Cheers,
-Marco


On Thu, Aug 30, 2012 at 10:07 AM, Prashant Kommireddi
<pr...@gmail.com>wrote:

> Seems like you are looking to group by "id" and get the MIN and MAX 
> timestamp for each group?
>
>
> On Thu, Aug 30, 2012 at 1:00 AM, Marco Cadetg <ma...@zattoo.com> wrote:
>
> > Hi there,
> >
> > I do have some user session which look something on the following lines:
> >
> > id:chararray, start:long(unix timestamp), end:long(unix timestamp)
> > xxx,1,3
> > xxx,4,7
> > yyy,1,2
> > yyy,5,7
> > zzz,6,7
> > zzz,7,10
> >
> > I would like to to combine the rows which belong to a continues 
> > session e.g. in my example the result should be the following:
> > xxx,1,7
> > yyy,1,2
> > yyy,5,7
> > zzz,6,10
> >
> > I guess there is no way to do this directly in pig but rather by 
> > using a UDF. Can someone give me a pointer on how you would achieve this?
> >
> > Thanks,
> > -Marco
> >
>

Re: reduce continuous sessions

Posted by Marco Cadetg <ma...@zattoo.com>.
Unfortunately it's not that simple.

A = LOAD 'comb.txt' USING PigStorage(',') AS
(id:chararray,start:long,end:long);
B = FOREACH (GROUP A BY id) { GENERATE
FLATTEN(group),MIN(A.start),MAX(A.end); }
dump B
(xxx,1,7)
(yyy,1,7)
(zzz,6,10)

This is not what I want. I want only to reduce the rows / sessions if they
are continues like the end of one session is the start of another. In my
example that is:
xxx,1,3
xxx,4,7

This is continuous as the end of the first row is the start (+1s) of the
next row.

Unlike this one, here the end of the first row is NOT the start of the next
row...
yyy,1,2
yyy,5,7

Therefore I have to keep track of sessions somehow.

Cheers,
-Marco


On Thu, Aug 30, 2012 at 10:07 AM, Prashant Kommireddi
<pr...@gmail.com>wrote:

> Seems like you are looking to group by "id" and get the MIN and MAX
> timestamp for each group?
>
>
> On Thu, Aug 30, 2012 at 1:00 AM, Marco Cadetg <ma...@zattoo.com> wrote:
>
> > Hi there,
> >
> > I do have some user session which look something on the following lines:
> >
> > id:chararray, start:long(unix timestamp), end:long(unix timestamp)
> > xxx,1,3
> > xxx,4,7
> > yyy,1,2
> > yyy,5,7
> > zzz,6,7
> > zzz,7,10
> >
> > I would like to to combine the rows which belong to a continues session
> > e.g. in my example the result should be the following:
> > xxx,1,7
> > yyy,1,2
> > yyy,5,7
> > zzz,6,10
> >
> > I guess there is no way to do this directly in pig but rather by using a
> > UDF. Can someone give me a pointer on how you would achieve this?
> >
> > Thanks,
> > -Marco
> >
>

Re: reduce continuous sessions

Posted by Prashant Kommireddi <pr...@gmail.com>.
Seems like you are looking to group by "id" and get the MIN and MAX
timestamp for each group?


On Thu, Aug 30, 2012 at 1:00 AM, Marco Cadetg <ma...@zattoo.com> wrote:

> Hi there,
>
> I do have some user session which look something on the following lines:
>
> id:chararray, start:long(unix timestamp), end:long(unix timestamp)
> xxx,1,3
> xxx,4,7
> yyy,1,2
> yyy,5,7
> zzz,6,7
> zzz,7,10
>
> I would like to to combine the rows which belong to a continues session
> e.g. in my example the result should be the following:
> xxx,1,7
> yyy,1,2
> yyy,5,7
> zzz,6,10
>
> I guess there is no way to do this directly in pig but rather by using a
> UDF. Can someone give me a pointer on how you would achieve this?
>
> Thanks,
> -Marco
>