You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Marco Cadetg <ma...@zattoo.com> on 2011/10/27 11:56:28 UTC

creating a graph over time

I have a problem where I don't know how or if pig is even suitable to solve
it.

I have a schema like this:

student-id,student-name,start-time,duration,course
1,marco,1319708213,500,math
2,ralf,1319708111,112,english
3,greg,1319708321,333,french
4,diva,1319708444,80,english
5,susanne,1319708123,2000,math
1,marco,1319708564,500,french
2,ralf,1319708789,123,french
7,fred,1319708213,5675,french
8,laura,1319708233,123,math
10,sab,1319708999,777,math
11,fibo,1319708789,565,math
6,dan,1319708456,50,english
9,marco,1319708123,60,english
12,bo,1319708456,345,math
1,marco,1319708789,673,math
...
...

I would like to retrieve a graph (interpolation) over time grouped by
course. Meaning how many students are learning for a course based on a 30
sec interval.
The grouping by course is easy but from there I've no clue how I would
achieve the rest. I guess the rest needs to be achieved via some UDF
or is there any way how to this in pig? I often think that I need a "for
loop" or something similar in pig.

Thanks for your help!
-Marco

Re: creating a graph over time

Posted by Stan Rosenberg <sr...@proclivitysystems.com>.

Hi Guys,

Sorry for joining this discussion so late.  I would suggest using
interval trees for dealing with overlapping time intervals.
There is a fairly nice treatment of interval trees in CLR, sect. 14.3.
 The data structure is essentially a red-black tree, and I surmise
that one
could extend java.util.TreeMap to implement it.

Cheers,

stan


On Tue, Nov 1, 2011 at 1:44 PM, Jonathan Coveney <jc...@gmail.com> wrote:
> Okie dokie. So first, let's clarify and simplify the problem a little,
> especially to ensure that I know what is going on.
>
> Let's first just focus on a particular class. This is ok since presumably
> each class is independent. Now, we have user_id, start_time, and end_time
> (start_time+duration). If I understand correctly, a user_id should be
> included up to end_time+30s, since this is a 30s moving window. As such,
> we'll just ignore that side of things for now, because you can just
> transform people's start times accordingly. Further, the assumption is that
> for a given user_id, you will not have overlapping start and end times...you
> can have multiple entries, ie "user 1, start 1, end 3; user 1, start 5, end
> 7;" but you can't have them in this form: "user 1, start 1, end 3; user 1,
> start 2, end 4."
>
> So we have simplified the question to this: given: user_id, start_time, and
> end_time (which never overlap), how can I get a count of unique users for
> every second? So now we will design a UDF to generate that output as a bag
> of (time, # of people) pairs, for every second from min(start_time) to
> max(end_time). The UDF will accept a bag sorted on the start time. Now, as I
> write it it's going to be a simple evalfunc, but it should be an
> accumulator. It's easy to make the transition.
>
> Here is what you do. Initialize a PriorityQueue. The natural ordering for
> int and long is fine, as it will ensure that when we poll it, we'll get the
> earliest end time, which is what we want.
>
> So step one is to pull the first tuple, and get the start_time and end_time.
> The start time will set our time to start_time (which is min(start_time)
> since it was sorted on start_time), and we add the end_time to the priority
> queue. We have a counter "uniques" which we increment.
>
> Now, before we actually do increment, we grab the next tuple. Why do you do
> this instead of go to the next end time? Because we don't know if someone
> starts in between now and the next end time. So we grab the tuple and get
> its start and end time. Now there are two cases.
>
> Case 1: the start time is less than the head of the priority queue, via a
> peek. If this is the case, then we can safely increment up to the start_time
> we just got, and then go from there. This is because it's impossible for
> there to be a new end_time less than the start_time we just got, because
> they are ordered by start_time and end_time>start_time. So we add the new
> end_time, and then we increment our timer until we get to the new start_time
> we just got, and add (timer,unique) at each step. When we get to start_time,
> we unique++. Now we get the next tuple and repeat.
>
> Case 2: the start time comes after the head of the priority queue, via a
> peek. If this is the case, then we need to increment up to the current head,
> emitting (timer,unique). Then when we get to the time_value equal to that
> end_time, we unique--, and check again if the start_time comes before than
> the head of the priority queue. Until it does, we repeat step 2. Once it
> does, we do step 1.
>
> I've attached a crude, untested UDF that does this. Buyer beware. But it
> shows the general flow, and should be better than exploding the data (I
> really hate exploding data like that unless it's absolutely necessary).
>
> To use, generate some data, then...
>
> register window.jar;
> define window com.jcoveney.Window('30');
> a = load 'data' using PigStorage(',') as (uid:long,start:long,end:long);
> b = foreach (group a all) {
>   ord = order a by start asc;
>   generate flatten(window(ord));
> }
> dump b;
>
> to generate data, I first did just  a small subsample just to think about
> it, then I did (in python)
>
> import random
> f=open("data","w")
> for i in range(0,1000000):
>   v1=random.randint(1,10000000)
>   v2=random.randint(1,10000000)
>   start=min(v1,v2)
>   stop=max(v1,v2)
>   print >>f,"%i,%i,%i" % (i,start,stop)
>
> If this function is at all useful, I can clean it up and put in in the
> piggybank. Let me know if the logic doesn't make sense, or if it isn't quite
> what you had in mind.
>
> Jon
>
> 2011/11/1 Marco Cadetg <ma...@zattoo.com>
>>
>> Thanks again for all your comments.
>> Jonathan, would you mind to enlighten me on the way you would keep track
>> of the
>> people you need to "eject". I don't get the min heap based tuple...
>> Cheers
>> -Marco
>>
>> On Mon, Oct 31, 2011 at 6:15 PM, Jonathan Coveney <jc...@gmail.com>
>> wrote:
>>>
>>> Perhaps I'm misunderstanding your use case, and this depends on the
>>> amount
>>> of data, but you could consider something like this (to avoid exploding
>>> the
>>> data, which could perhaps be inavoidable but I hate resorting to that if
>>> I
>>> don't have to).
>>>
>>> a = foreach yourdata generate student_id, start_time, start_time+duration
>>> as end_time, course;
>>> b = group a by course;
>>> c = foreach b {
>>>  ord = order a by start_time;
>>>  generate yourudf.process(ord);
>>> }
>>>
>>> Here is generally what process could do. It would be an accumulator UDF
>>> that expected tuples sorted on start_time. Now you basically need a way
>>> to
>>> know who the distinct users are. Now, since you want 30s windows, your
>>> first window will presumably be 30s after the first start_time in your
>>> data, and you would just tick ahead in 1s and write to a bag which would
>>> have second, # of distinct student_ids. To know when to eject people, you
>>> could have any number of data structures... perhaps a min heap based on
>>> end_time, and of course instead of "ticking" ahead, you would grab a new
>>> tuple (since this is the only thing that would change the state of the #
>>> of
>>> distinct ids), and then do all of the ticking ahead as you adjust the
>>> heap
>>> and write the seconds in between the current time pointer and the
>>> start_time of the new tuple, making sure in each step to check against
>>> the
>>> min heap to eject any users that expired.
>>>
>>> That was a little rambly, I could quickly put together some more
>>> reasonable
>>> pseudocode if that would help. I think the general idea is clear
>>> though...
>>>
>>> 2011/10/31 Guy Bayes <fa...@gmail.com>
>>>
>>> > ahh TV that explains it
>>> >
>>> > 12G data file is a bit too big for R unless you sample, not sure if the
>>> > use
>>> > case is conducive to sampling?
>>> >
>>> > If it is, could sample it down and structure in pig/hadoop and then
>>> > load it
>>> > into the analytical/visualization tool of choice...
>>> >
>>> > Guy
>>> >
>>> > On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <ma...@zattoo.com> wrote:
>>> >
>>> > > The data is not about students but about television ;) Regarding the
>>> > size.
>>> > > The raw input data size is about 150m although when I 'explode' the
>>> > > timeseries
>>> > > it will be around 80x bigger. I guess the average user duration will
>>> > > be
>>> > > around
>>> > > 40 Minutes which means when sampling it at a 30s interval will
>>> > > increase
>>> > the
>>> > > size by ~12GB.
>>> > >
>>> > > I think that is a size which my hadoop cluster with five 8-core x 8GB
>>> > > x
>>> > 2TB
>>> > > HD
>>> > > should be able to cope with.
>>> > >
>>> > > I don't know about R. Are you able to handle 12Gb
>>> > > files well in R (off course it depends on your computer so assume an
>>> > > average business computer e.g. 2-core 2GHz 4GB ram)?
>>> > >
>>> > > Cheers
>>> > > -Marco
>>> > >
>>> > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <fa...@gmail.com>
>>> > wrote:
>>> > >
>>> > > > if it fits in R, it's trivial, draw a density plot or a histogram,
>>> > about
>>> > > > three lines of R code
>>> > > >
>>> > > > why I was wondering about the data volume.
>>> > > >
>>> > > > His example is students attending classes, if  that is really the
>>> > > > data
>>> > > hard
>>> > > > to believe it's super huge?
>>> > > >
>>> > > > Guy
>>> > > >
>>> > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
>>> > > norbert.burger@gmail.com
>>> > > > >wrote:
>>> > > >
>>> > > > > Perhaps another way to approach this problem is to visualize it
>>> > > > > geometrically.  You have a long series of class session
>>> > > > > instances,
>>> > > where
>>> > > > > each class session is like 1D line segment, beginning/stopping at
>>> > some
>>> > > > > start/end time.
>>> > > > >
>>> > > > > These segments naturally overlap, and I think the question you're
>>> > > asking
>>> > > > is
>>> > > > > equivalent to finding the number of overlaps at every subsegment.
>>> > > > >
>>> > > > > To answer this, you want to first break every class session into
>>> > > > > a
>>> > full
>>> > > > > list
>>> > > > > of subsegments, where a subsegment is created by "breaking" each
>>> > class
>>> > > > > session/segment into multiple parts at the start/end point of any
>>> > other
>>> > > > > class session.  You can create this full set of subsegments in
>>> > > > > one
>>> > pass
>>> > > > by
>>> > > > > comparing pairwise (CROSS) each start/end point with your
>>> > > > > original
>>> > list
>>> > > > of
>>> > > > > class sessions.
>>> > > > >
>>> > > > > Once you have the full list of "broken" segments, then a final
>>> > > > > GROUP
>>> > > > > BY/COUNT(*) will you give you the number of overlaps.  Seems like
>>> > > > approach
>>> > > > > would be faster than the previous approach if your class sessions
>>> > > > > are
>>> > > > very
>>> > > > > long, or there are many overlaps.
>>> > > > >
>>> > > > > Norbert
>>> > > > >
>>> > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes
>>> > > > > <fa...@gmail.com>
>>> > > > wrote:
>>> > > > >
>>> > > > > > how big is your dataset?
>>> > > > > >
>>> > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg
>>> > > > > > <ma...@zattoo.com>
>>> > > > wrote:
>>> > > > > >
>>> > > > > > > Thanks Bill and Norbert that seems like what I was looking
>>> > > > > > > for.
>>> > > I'm a
>>> > > > > bit
>>> > > > > > > worried about
>>> > > > > > > how much data/io this could create. But I'll see ;)
>>> > > > > > >
>>> > > > > > > Cheers
>>> > > > > > > -Marco
>>> > > > > > >
>>> > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
>>> > > > > > norbert.burger@gmail.com
>>> > > > > > > >wrote:
>>> > > > > > >
>>> > > > > > > > In case what you're looking for is an analysis over the
>>> > > > > > > > full
>>> > > > learning
>>> > > > > > > > duration, and not just the start interval, then one further
>>> > > insight
>>> > > > > is
>>> > > > > > > > that each original record can be transformed into a
>>> > > > > > > > sequence of
>>> > > > > > > > records, where the size of the sequence corresponds to the
>>> > > session
>>> > > > > > > > duration.  In other words, you can use a UDF to "explode"
>>> > > > > > > > the
>>> > > > > original
>>> > > > > > > > record:
>>> > > > > > > >
>>> > > > > > > > 1,marco,1319708213,500,math
>>> > > > > > > >
>>> > > > > > > > into:
>>> > > > > > > >
>>> > > > > > > > 1,marco,1319708190,500,math
>>> > > > > > > > 1,marco,1319708220,500,math
>>> > > > > > > > 1,marco,1319708250,500,math
>>> > > > > > > > 1,marco,1319708280,500,math
>>> > > > > > > > 1,marco,1319708310,500,math
>>> > > > > > > > 1,marco,1319708340,500,math
>>> > > > > > > > 1,marco,1319708370,500,math
>>> > > > > > > > 1,marco,1319708400,500,math
>>> > > > > > > > 1,marco,1319708430,500,math
>>> > > > > > > > 1,marco,1319708460,500,math
>>> > > > > > > > 1,marco,1319708490,500,math
>>> > > > > > > > 1,marco,1319708520,500,math
>>> > > > > > > > 1,marco,1319708550,500,math
>>> > > > > > > > 1,marco,1319708580,500,math
>>> > > > > > > > 1,marco,1319708610,500,math
>>> > > > > > > > 1,marco,1319708640,500,math
>>> > > > > > > > 1,marco,1319708670,500,math
>>> > > > > > > > 1,marco,1319708700,500,math
>>> > > > > > > >
>>> > > > > > > > and then use Bill's suggestion to group by course,
>>> > > > > > > > interval.
>>> > > > > > > >
>>> > > > > > > > Norbert
>>> > > > > > > >
>>> > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
>>> > > > billgraham@gmail.com>
>>> > > > > > > > wrote:
>>> > > > > > > > > You can pass your time to a udf that rounds it down to
>>> > > > > > > > > the
>>> > > > nearest
>>> > > > > 30
>>> > > > > > > > second
>>> > > > > > > > > interval and then group by course, interval to get counts
>>> > > > > > > > > for
>>> > > > each
>>> > > > > > > > course,
>>> > > > > > > > > interval.
>>> > > > > > > > >
>>> > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <
>>> > marco@zattoo.com>
>>> > > > > > wrote:
>>> > > > > > > > >> I have a problem where I don't know how or if pig is
>>> > > > > > > > >> even
>>> > > > suitable
>>> > > > > > to
>>> > > > > > > > > solve
>>> > > > > > > > >> it.
>>> > > > > > > > >>
>>> > > > > > > > >> I have a schema like this:
>>> > > > > > > > >>
>>> > > > > > > > >> student-id,student-name,start-time,duration,course
>>> > > > > > > > >> 1,marco,1319708213,500,math
>>> > > > > > > > >> 2,ralf,1319708111,112,english
>>> > > > > > > > >> 3,greg,1319708321,333,french
>>> > > > > > > > >> 4,diva,1319708444,80,english
>>> > > > > > > > >> 5,susanne,1319708123,2000,math
>>> > > > > > > > >> 1,marco,1319708564,500,french
>>> > > > > > > > >> 2,ralf,1319708789,123,french
>>> > > > > > > > >> 7,fred,1319708213,5675,french
>>> > > > > > > > >> 8,laura,1319708233,123,math
>>> > > > > > > > >> 10,sab,1319708999,777,math
>>> > > > > > > > >> 11,fibo,1319708789,565,math
>>> > > > > > > > >> 6,dan,1319708456,50,english
>>> > > > > > > > >> 9,marco,1319708123,60,english
>>> > > > > > > > >> 12,bo,1319708456,345,math
>>> > > > > > > > >> 1,marco,1319708789,673,math
>>> > > > > > > > >> ...
>>> > > > > > > > >> ...
>>> > > > > > > > >>
>>> > > > > > > > >> I would like to retrieve a graph (interpolation) over
>>> > > > > > > > >> time
>>> > > > grouped
>>> > > > > > by
>>> > > > > > > > >> course. Meaning how many students are learning for a
>>> > > > > > > > >> course
>>> > > > based
>>> > > > > on
>>> > > > > > a
>>> > > > > > > > 30
>>> > > > > > > > >> sec interval.
>>> > > > > > > > >> The grouping by course is easy but from there I've no
>>> > > > > > > > >> clue
>>> > > how I
>>> > > > > > would
>>> > > > > > > > >> achieve the rest. I guess the rest needs to be achieved
>>> > > > > > > > >> via
>>> > > some
>>> > > > > UDF
>>> > > > > > > > >> or is there any way how to this in pig? I often think
>>> > > > > > > > >> that I
>>> > > > need
>>> > > > > a
>>> > > > > > > "for
>>> > > > > > > > >> loop" or something similar in pig.
>>> > > > > > > > >>
>>> > > > > > > > >> Thanks for your help!
>>> > > > > > > > >> -Marco
>>> > > > > > > > >>
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>
>
>

Re: creating a graph over time

Posted by Jonathan Coveney <jc...@gmail.com>.

Just a heads up: I have a cleaner version (with tests!) here:
https://issues.apache.org/jira/browse/PIG-2364

If you're still using this, I heavily suggest using the new version.

2011/11/4 Marco Cadetg <ma...@zattoo.com>

> Yeha, that is awesome. Thank you very much Jonathan.
> -Marco
>
> On Wed, Nov 2, 2011 at 7:52 PM, Jonathan Coveney <jc...@gmail.com>
> wrote:
>
> > I'll make it less hideous and submit a patch this weekend, then :)
> >
> > 2011/11/2 Ashutosh Chauhan <ha...@apache.org>
> >
> > > Hey Jon,
> > >
> > > Your windowing udf will be very useful outside of this particular
> > usecase.
> > > It will be great if you can contribute it to PiggyBank.
> > >
> > > Thanks,
> > > Ashutosh
> > >
> > > On Tue, Nov 1, 2011 at 10:44, Jonathan Coveney <jc...@gmail.com>
> > wrote:
> > >
> > > > Okie dokie. So first, let's clarify and simplify the problem a
> little,
> > > > especially to ensure that I know what is going on.
> > > >
> > > > Let's first just focus on a particular class. This is ok since
> > presumably
> > > > each class is independent. Now, we have user_id, start_time, and
> > end_time
> > > > (start_time+duration). If I understand correctly, a user_id should be
> > > > included up to end_time+30s, since this is a 30s moving window. As
> > such,
> > > > we'll just ignore that side of things for now, because you can just
> > > > transform people's start times accordingly. Further, the assumption
> is
> > > that
> > > > for a given user_id, you will not have overlapping start and end
> > > > times...you can have multiple entries, ie "user 1, start 1, end 3;
> user
> > > 1,
> > > > start 5, end 7;" but you can't have them in this form: "user 1, start
> > 1,
> > > > end 3; user 1, start 2, end 4."
> > > >
> > > > So we have simplified the question to this: given: user_id,
> start_time,
> > > > and end_time (which never overlap), how can I get a count of unique
> > users
> > > > for every second? So now we will design a UDF to generate that output
> > as
> > > a
> > > > bag of (time, # of people) pairs, for every second from
> min(start_time)
> > > to
> > > > max(end_time). The UDF will accept a bag sorted on the start time.
> Now,
> > > as
> > > > I write it it's going to be a simple evalfunc, but it should be an
> > > > accumulator. It's easy to make the transition.
> > > >
> > > > Here is what you do. Initialize a PriorityQueue. The natural ordering
> > for
> > > > int and long is fine, as it will ensure that when we poll it, we'll
> get
> > > the
> > > > earliest end time, which is what we want.
> > > >
> > > > So step one is to pull the first tuple, and get the start_time and
> > > > end_time. The start time will set our time to start_time (which is
> > > > min(start_time) since it was sorted on start_time), and we add the
> > > end_time
> > > > to the priority queue. We have a counter "uniques" which we
> increment.
> > > >
> > > > Now, before we actually do increment, we grab the next tuple. Why do
> > you
> > > > do this instead of go to the next end time? Because we don't know if
> > > > someone starts in between now and the next end time. So we grab the
> > tuple
> > > > and get its start and end time. Now there are two cases.
> > > >
> > > > Case 1: the start time is less than the head of the priority queue,
> > via a
> > > > peek. If this is the case, then we can safely increment up to the
> > > > start_time we just got, and then go from there. This is because it's
> > > > impossible for there to be a new end_time less than the start_time we
> > > just
> > > > got, because they are ordered by start_time and end_time>start_time.
> So
> > > we
> > > > add the new end_time, and then we increment our timer until we get to
> > the
> > > > new start_time we just got, and add (timer,unique) at each step. When
> > we
> > > > get to start_time, we unique++. Now we get the next tuple and repeat.
> > > >
> > > > Case 2: the start time comes after the head of the priority queue,
> via
> > a
> > > > peek. If this is the case, then we need to increment up to the
> current
> > > > head, emitting (timer,unique). Then when we get to the time_value
> equal
> > > to
> > > > that end_time, we unique--, and check again if the start_time comes
> > > before
> > > > than the head of the priority queue. Until it does, we repeat step 2.
> > > Once
> > > > it does, we do step 1.
> > > >
> > > > I've attached a crude, untested UDF that does this. Buyer beware. But
> > it
> > > > shows the general flow, and should be better than exploding the data
> (I
> > > > really hate exploding data like that unless it's absolutely
> necessary).
> > > >
> > > > To use, generate some data, then...
> > > >
> > > > register window.jar;
> > > > define window com.jcoveney.Window('30');
> > > > a = load 'data' using PigStorage(',') as
> > (uid:long,start:long,end:long);
> > > > b = foreach (group a all) {
> > > >   ord = order a by start asc;
> > > >   generate flatten(window(ord));
> > > > }
> > > > dump b;
> > > >
> > > > to generate data, I first did just  a small subsample just to think
> > about
> > > > it, then I did (in python)
> > > >
> > > > import random
> > > > f=open("data","w")
> > > > for i in range(0,1000000):
> > > >   v1=random.randint(1,10000000)
> > > >   v2=random.randint(1,10000000)
> > > >   start=min(v1,v2)
> > > >   stop=max(v1,v2)
> > > >   print >>f,"%i,%i,%i" % (i,start,stop)
> > > >
> > > > If this function is at all useful, I can clean it up and put in in
> the
> > > > piggybank. Let me know if the logic doesn't make sense, or if it
> isn't
> > > > quite what you had in mind.
> > > >
> > > > Jon
> > > >
> > > >
> > > > 2011/11/1 Marco Cadetg <ma...@zattoo.com>
> > > >
> > > >> Thanks again for all your comments.
> > > >>
> > > >> Jonathan, would you mind to enlighten me on the way you would keep
> > track
> > > >> of the
> > > >> people you need to "eject". I don't get the min heap based tuple...
> > > >>
> > > >> Cheers
> > > >> -Marco
> > > >>
> > > >>
> > > >> On Mon, Oct 31, 2011 at 6:15 PM, Jonathan Coveney <
> jcoveney@gmail.com
> > > >wrote:
> > > >>
> > > >>> Perhaps I'm misunderstanding your use case, and this depends on the
> > > >>> amount
> > > >>> of data, but you could consider something like this (to avoid
> > exploding
> > > >>> the
> > > >>> data, which could perhaps be inavoidable but I hate resorting to
> that
> > > if
> > > >>> I
> > > >>> don't have to).
> > > >>>
> > > >>> a = foreach yourdata generate student_id, start_time,
> > > start_time+duration
> > > >>> as end_time, course;
> > > >>> b = group a by course;
> > > >>> c = foreach b {
> > > >>>  ord = order a by start_time;
> > > >>>  generate yourudf.process(ord);
> > > >>> }
> > > >>>
> > > >>> Here is generally what process could do. It would be an accumulator
> > UDF
> > > >>> that expected tuples sorted on start_time. Now you basically need a
> > way
> > > >>> to
> > > >>> know who the distinct users are. Now, since you want 30s windows,
> > your
> > > >>> first window will presumably be 30s after the first start_time in
> > your
> > > >>> data, and you would just tick ahead in 1s and write to a bag which
> > > would
> > > >>> have second, # of distinct student_ids. To know when to eject
> people,
> > > you
> > > >>> could have any number of data structures... perhaps a min heap
> based
> > on
> > > >>> end_time, and of course instead of "ticking" ahead, you would grab
> a
> > > new
> > > >>> tuple (since this is the only thing that would change the state of
> > the
> > > #
> > > >>> of
> > > >>> distinct ids), and then do all of the ticking ahead as you adjust
> the
> > > >>> heap
> > > >>> and write the seconds in between the current time pointer and the
> > > >>> start_time of the new tuple, making sure in each step to check
> > against
> > > >>> the
> > > >>> min heap to eject any users that expired.
> > > >>>
> > > >>> That was a little rambly, I could quickly put together some more
> > > >>> reasonable
> > > >>> pseudocode if that would help. I think the general idea is clear
> > > >>> though...
> > > >>>
> > > >>> 2011/10/31 Guy Bayes <fa...@gmail.com>
> > > >>>
> > > >>> > ahh TV that explains it
> > > >>> >
> > > >>> > 12G data file is a bit too big for R unless you sample, not sure
> if
> > > >>> the use
> > > >>> > case is conducive to sampling?
> > > >>> >
> > > >>> > If it is, could sample it down and structure in pig/hadoop and
> then
> > > >>> load it
> > > >>> > into the analytical/visualization tool of choice...
> > > >>> >
> > > >>> > Guy
> > > >>> >
> > > >>> > On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <ma...@zattoo.com>
> > > >>> wrote:
> > > >>> >
> > > >>> > > The data is not about students but about television ;)
> Regarding
> > > the
> > > >>> > size.
> > > >>> > > The raw input data size is about 150m although when I 'explode'
> > the
> > > >>> > > timeseries
> > > >>> > > it will be around 80x bigger. I guess the average user duration
> > > will
> > > >>> be
> > > >>> > > around
> > > >>> > > 40 Minutes which means when sampling it at a 30s interval will
> > > >>> increase
> > > >>> > the
> > > >>> > > size by ~12GB.
> > > >>> > >
> > > >>> > > I think that is a size which my hadoop cluster with five
> 8-core x
> > > >>> 8GB x
> > > >>> > 2TB
> > > >>> > > HD
> > > >>> > > should be able to cope with.
> > > >>> > >
> > > >>> > > I don't know about R. Are you able to handle 12Gb
> > > >>> > > files well in R (off course it depends on your computer so
> assume
> > > an
> > > >>> > > average business computer e.g. 2-core 2GHz 4GB ram)?
> > > >>> > >
> > > >>> > > Cheers
> > > >>> > > -Marco
> > > >>> > >
> > > >>> > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <
> > fatal.error@gmail.com>
> > > >>> > wrote:
> > > >>> > >
> > > >>> > > > if it fits in R, it's trivial, draw a density plot or a
> > > histogram,
> > > >>> > about
> > > >>> > > > three lines of R code
> > > >>> > > >
> > > >>> > > > why I was wondering about the data volume.
> > > >>> > > >
> > > >>> > > > His example is students attending classes, if  that is really
> > the
> > > >>> data
> > > >>> > > hard
> > > >>> > > > to believe it's super huge?
> > > >>> > > >
> > > >>> > > > Guy
> > > >>> > > >
> > > >>> > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
> > > >>> > > norbert.burger@gmail.com
> > > >>> > > > >wrote:
> > > >>> > > >
> > > >>> > > > > Perhaps another way to approach this problem is to
> visualize
> > it
> > > >>> > > > > geometrically.  You have a long series of class session
> > > >>> instances,
> > > >>> > > where
> > > >>> > > > > each class session is like 1D line segment,
> > beginning/stopping
> > > at
> > > >>> > some
> > > >>> > > > > start/end time.
> > > >>> > > > >
> > > >>> > > > > These segments naturally overlap, and I think the question
> > > you're
> > > >>> > > asking
> > > >>> > > > is
> > > >>> > > > > equivalent to finding the number of overlaps at every
> > > subsegment.
> > > >>> > > > >
> > > >>> > > > > To answer this, you want to first break every class session
> > > into
> > > >>> a
> > > >>> > full
> > > >>> > > > > list
> > > >>> > > > > of subsegments, where a subsegment is created by "breaking"
> > > each
> > > >>> > class
> > > >>> > > > > session/segment into multiple parts at the start/end point
> of
> > > any
> > > >>> > other
> > > >>> > > > > class session.  You can create this full set of subsegments
> > in
> > > >>> one
> > > >>> > pass
> > > >>> > > > by
> > > >>> > > > > comparing pairwise (CROSS) each start/end point with your
> > > >>> original
> > > >>> > list
> > > >>> > > > of
> > > >>> > > > > class sessions.
> > > >>> > > > >
> > > >>> > > > > Once you have the full list of "broken" segments, then a
> > final
> > > >>> GROUP
> > > >>> > > > > BY/COUNT(*) will you give you the number of overlaps.
>  Seems
> > > like
> > > >>> > > > approach
> > > >>> > > > > would be faster than the previous approach if your class
> > > >>> sessions are
> > > >>> > > > very
> > > >>> > > > > long, or there are many overlaps.
> > > >>> > > > >
> > > >>> > > > > Norbert
> > > >>> > > > >
> > > >>> > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <
> > > >>> fatal.error@gmail.com>
> > > >>> > > > wrote:
> > > >>> > > > >
> > > >>> > > > > > how big is your dataset?
> > > >>> > > > > >
> > > >>> > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <
> > > >>> marco@zattoo.com>
> > > >>> > > > wrote:
> > > >>> > > > > >
> > > >>> > > > > > > Thanks Bill and Norbert that seems like what I was
> > looking
> > > >>> for.
> > > >>> > > I'm a
> > > >>> > > > > bit
> > > >>> > > > > > > worried about
> > > >>> > > > > > > how much data/io this could create. But I'll see ;)
> > > >>> > > > > > >
> > > >>> > > > > > > Cheers
> > > >>> > > > > > > -Marco
> > > >>> > > > > > >
> > > >>> > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > > >>> > > > > > norbert.burger@gmail.com
> > > >>> > > > > > > >wrote:
> > > >>> > > > > > >
> > > >>> > > > > > > > In case what you're looking for is an analysis over
> the
> > > >>> full
> > > >>> > > > learning
> > > >>> > > > > > > > duration, and not just the start interval, then one
> > > further
> > > >>> > > insight
> > > >>> > > > > is
> > > >>> > > > > > > > that each original record can be transformed into a
> > > >>> sequence of
> > > >>> > > > > > > > records, where the size of the sequence corresponds
> to
> > > the
> > > >>> > > session
> > > >>> > > > > > > > duration.  In other words, you can use a UDF to
> > "explode"
> > > >>> the
> > > >>> > > > > original
> > > >>> > > > > > > > record:
> > > >>> > > > > > > >
> > > >>> > > > > > > > 1,marco,1319708213,500,math
> > > >>> > > > > > > >
> > > >>> > > > > > > > into:
> > > >>> > > > > > > >
> > > >>> > > > > > > > 1,marco,1319708190,500,math
> > > >>> > > > > > > > 1,marco,1319708220,500,math
> > > >>> > > > > > > > 1,marco,1319708250,500,math
> > > >>> > > > > > > > 1,marco,1319708280,500,math
> > > >>> > > > > > > > 1,marco,1319708310,500,math
> > > >>> > > > > > > > 1,marco,1319708340,500,math
> > > >>> > > > > > > > 1,marco,1319708370,500,math
> > > >>> > > > > > > > 1,marco,1319708400,500,math
> > > >>> > > > > > > > 1,marco,1319708430,500,math
> > > >>> > > > > > > > 1,marco,1319708460,500,math
> > > >>> > > > > > > > 1,marco,1319708490,500,math
> > > >>> > > > > > > > 1,marco,1319708520,500,math
> > > >>> > > > > > > > 1,marco,1319708550,500,math
> > > >>> > > > > > > > 1,marco,1319708580,500,math
> > > >>> > > > > > > > 1,marco,1319708610,500,math
> > > >>> > > > > > > > 1,marco,1319708640,500,math
> > > >>> > > > > > > > 1,marco,1319708670,500,math
> > > >>> > > > > > > > 1,marco,1319708700,500,math
> > > >>> > > > > > > >
> > > >>> > > > > > > > and then use Bill's suggestion to group by course,
> > > >>> interval.
> > > >>> > > > > > > >
> > > >>> > > > > > > > Norbert
> > > >>> > > > > > > >
> > > >>> > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
> > > >>> > > > billgraham@gmail.com>
> > > >>> > > > > > > > wrote:
> > > >>> > > > > > > > > You can pass your time to a udf that rounds it down
> > to
> > > >>> the
> > > >>> > > > nearest
> > > >>> > > > > 30
> > > >>> > > > > > > > second
> > > >>> > > > > > > > > interval and then group by course, interval to get
> > > >>> counts for
> > > >>> > > > each
> > > >>> > > > > > > > course,
> > > >>> > > > > > > > > interval.
> > > >>> > > > > > > > >
> > > >>> > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <
> > > >>> > marco@zattoo.com>
> > > >>> > > > > > wrote:
> > > >>> > > > > > > > >> I have a problem where I don't know how or if pig
> is
> > > >>> even
> > > >>> > > > suitable
> > > >>> > > > > > to
> > > >>> > > > > > > > > solve
> > > >>> > > > > > > > >> it.
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >> I have a schema like this:
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >> student-id,student-name,start-time,duration,course
> > > >>> > > > > > > > >> 1,marco,1319708213,500,math
> > > >>> > > > > > > > >> 2,ralf,1319708111,112,english
> > > >>> > > > > > > > >> 3,greg,1319708321,333,french
> > > >>> > > > > > > > >> 4,diva,1319708444,80,english
> > > >>> > > > > > > > >> 5,susanne,1319708123,2000,math
> > > >>> > > > > > > > >> 1,marco,1319708564,500,french
> > > >>> > > > > > > > >> 2,ralf,1319708789,123,french
> > > >>> > > > > > > > >> 7,fred,1319708213,5675,french
> > > >>> > > > > > > > >> 8,laura,1319708233,123,math
> > > >>> > > > > > > > >> 10,sab,1319708999,777,math
> > > >>> > > > > > > > >> 11,fibo,1319708789,565,math
> > > >>> > > > > > > > >> 6,dan,1319708456,50,english
> > > >>> > > > > > > > >> 9,marco,1319708123,60,english
> > > >>> > > > > > > > >> 12,bo,1319708456,345,math
> > > >>> > > > > > > > >> 1,marco,1319708789,673,math
> > > >>> > > > > > > > >> ...
> > > >>> > > > > > > > >> ...
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >> I would like to retrieve a graph (interpolation)
> > over
> > > >>> time
> > > >>> > > > grouped
> > > >>> > > > > > by
> > > >>> > > > > > > > >> course. Meaning how many students are learning
> for a
> > > >>> course
> > > >>> > > > based
> > > >>> > > > > on
> > > >>> > > > > > a
> > > >>> > > > > > > > 30
> > > >>> > > > > > > > >> sec interval.
> > > >>> > > > > > > > >> The grouping by course is easy but from there I've
> > no
> > > >>> clue
> > > >>> > > how I
> > > >>> > > > > > would
> > > >>> > > > > > > > >> achieve the rest. I guess the rest needs to be
> > > achieved
> > > >>> via
> > > >>> > > some
> > > >>> > > > > UDF
> > > >>> > > > > > > > >> or is there any way how to this in pig? I often
> > think
> > > >>> that I
> > > >>> > > > need
> > > >>> > > > > a
> > > >>> > > > > > > "for
> > > >>> > > > > > > > >> loop" or something similar in pig.
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >> Thanks for your help!
> > > >>> > > > > > > > >> -Marco
> > > >>> > > > > > > > >>
> > > >>> > > > > > > > >
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: creating a graph over time

Posted by Marco Cadetg <ma...@zattoo.com>.

Yeha, that is awesome. Thank you very much Jonathan.
-Marco

On Wed, Nov 2, 2011 at 7:52 PM, Jonathan Coveney <jc...@gmail.com> wrote:

> I'll make it less hideous and submit a patch this weekend, then :)
>
> 2011/11/2 Ashutosh Chauhan <ha...@apache.org>
>
> > Hey Jon,
> >
> > Your windowing udf will be very useful outside of this particular
> usecase.
> > It will be great if you can contribute it to PiggyBank.
> >
> > Thanks,
> > Ashutosh
> >
> > On Tue, Nov 1, 2011 at 10:44, Jonathan Coveney <jc...@gmail.com>
> wrote:
> >
> > > Okie dokie. So first, let's clarify and simplify the problem a little,
> > > especially to ensure that I know what is going on.
> > >
> > > Let's first just focus on a particular class. This is ok since
> presumably
> > > each class is independent. Now, we have user_id, start_time, and
> end_time
> > > (start_time+duration). If I understand correctly, a user_id should be
> > > included up to end_time+30s, since this is a 30s moving window. As
> such,
> > > we'll just ignore that side of things for now, because you can just
> > > transform people's start times accordingly. Further, the assumption is
> > that
> > > for a given user_id, you will not have overlapping start and end
> > > times...you can have multiple entries, ie "user 1, start 1, end 3; user
> > 1,
> > > start 5, end 7;" but you can't have them in this form: "user 1, start
> 1,
> > > end 3; user 1, start 2, end 4."
> > >
> > > So we have simplified the question to this: given: user_id, start_time,
> > > and end_time (which never overlap), how can I get a count of unique
> users
> > > for every second? So now we will design a UDF to generate that output
> as
> > a
> > > bag of (time, # of people) pairs, for every second from min(start_time)
> > to
> > > max(end_time). The UDF will accept a bag sorted on the start time. Now,
> > as
> > > I write it it's going to be a simple evalfunc, but it should be an
> > > accumulator. It's easy to make the transition.
> > >
> > > Here is what you do. Initialize a PriorityQueue. The natural ordering
> for
> > > int and long is fine, as it will ensure that when we poll it, we'll get
> > the
> > > earliest end time, which is what we want.
> > >
> > > So step one is to pull the first tuple, and get the start_time and
> > > end_time. The start time will set our time to start_time (which is
> > > min(start_time) since it was sorted on start_time), and we add the
> > end_time
> > > to the priority queue. We have a counter "uniques" which we increment.
> > >
> > > Now, before we actually do increment, we grab the next tuple. Why do
> you
> > > do this instead of go to the next end time? Because we don't know if
> > > someone starts in between now and the next end time. So we grab the
> tuple
> > > and get its start and end time. Now there are two cases.
> > >
> > > Case 1: the start time is less than the head of the priority queue,
> via a
> > > peek. If this is the case, then we can safely increment up to the
> > > start_time we just got, and then go from there. This is because it's
> > > impossible for there to be a new end_time less than the start_time we
> > just
> > > got, because they are ordered by start_time and end_time>start_time. So
> > we
> > > add the new end_time, and then we increment our timer until we get to
> the
> > > new start_time we just got, and add (timer,unique) at each step. When
> we
> > > get to start_time, we unique++. Now we get the next tuple and repeat.
> > >
> > > Case 2: the start time comes after the head of the priority queue, via
> a
> > > peek. If this is the case, then we need to increment up to the current
> > > head, emitting (timer,unique). Then when we get to the time_value equal
> > to
> > > that end_time, we unique--, and check again if the start_time comes
> > before
> > > than the head of the priority queue. Until it does, we repeat step 2.
> > Once
> > > it does, we do step 1.
> > >
> > > I've attached a crude, untested UDF that does this. Buyer beware. But
> it
> > > shows the general flow, and should be better than exploding the data (I
> > > really hate exploding data like that unless it's absolutely necessary).
> > >
> > > To use, generate some data, then...
> > >
> > > register window.jar;
> > > define window com.jcoveney.Window('30');
> > > a = load 'data' using PigStorage(',') as
> (uid:long,start:long,end:long);
> > > b = foreach (group a all) {
> > >   ord = order a by start asc;
> > >   generate flatten(window(ord));
> > > }
> > > dump b;
> > >
> > > to generate data, I first did just  a small subsample just to think
> about
> > > it, then I did (in python)
> > >
> > > import random
> > > f=open("data","w")
> > > for i in range(0,1000000):
> > >   v1=random.randint(1,10000000)
> > >   v2=random.randint(1,10000000)
> > >   start=min(v1,v2)
> > >   stop=max(v1,v2)
> > >   print >>f,"%i,%i,%i" % (i,start,stop)
> > >
> > > If this function is at all useful, I can clean it up and put in in the
> > > piggybank. Let me know if the logic doesn't make sense, or if it isn't
> > > quite what you had in mind.
> > >
> > > Jon
> > >
> > >
> > > 2011/11/1 Marco Cadetg <ma...@zattoo.com>
> > >
> > >> Thanks again for all your comments.
> > >>
> > >> Jonathan, would you mind to enlighten me on the way you would keep
> track
> > >> of the
> > >> people you need to "eject". I don't get the min heap based tuple...
> > >>
> > >> Cheers
> > >> -Marco
> > >>
> > >>
> > >> On Mon, Oct 31, 2011 at 6:15 PM, Jonathan Coveney <jcoveney@gmail.com
> > >wrote:
> > >>
> > >>> Perhaps I'm misunderstanding your use case, and this depends on the
> > >>> amount
> > >>> of data, but you could consider something like this (to avoid
> exploding
> > >>> the
> > >>> data, which could perhaps be inavoidable but I hate resorting to that
> > if
> > >>> I
> > >>> don't have to).
> > >>>
> > >>> a = foreach yourdata generate student_id, start_time,
> > start_time+duration
> > >>> as end_time, course;
> > >>> b = group a by course;
> > >>> c = foreach b {
> > >>>  ord = order a by start_time;
> > >>>  generate yourudf.process(ord);
> > >>> }
> > >>>
> > >>> Here is generally what process could do. It would be an accumulator
> UDF
> > >>> that expected tuples sorted on start_time. Now you basically need a
> way
> > >>> to
> > >>> know who the distinct users are. Now, since you want 30s windows,
> your
> > >>> first window will presumably be 30s after the first start_time in
> your
> > >>> data, and you would just tick ahead in 1s and write to a bag which
> > would
> > >>> have second, # of distinct student_ids. To know when to eject people,
> > you
> > >>> could have any number of data structures... perhaps a min heap based
> on
> > >>> end_time, and of course instead of "ticking" ahead, you would grab a
> > new
> > >>> tuple (since this is the only thing that would change the state of
> the
> > #
> > >>> of
> > >>> distinct ids), and then do all of the ticking ahead as you adjust the
> > >>> heap
> > >>> and write the seconds in between the current time pointer and the
> > >>> start_time of the new tuple, making sure in each step to check
> against
> > >>> the
> > >>> min heap to eject any users that expired.
> > >>>
> > >>> That was a little rambly, I could quickly put together some more
> > >>> reasonable
> > >>> pseudocode if that would help. I think the general idea is clear
> > >>> though...
> > >>>
> > >>> 2011/10/31 Guy Bayes <fa...@gmail.com>
> > >>>
> > >>> > ahh TV that explains it
> > >>> >
> > >>> > 12G data file is a bit too big for R unless you sample, not sure if
> > >>> the use
> > >>> > case is conducive to sampling?
> > >>> >
> > >>> > If it is, could sample it down and structure in pig/hadoop and then
> > >>> load it
> > >>> > into the analytical/visualization tool of choice...
> > >>> >
> > >>> > Guy
> > >>> >
> > >>> > On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <ma...@zattoo.com>
> > >>> wrote:
> > >>> >
> > >>> > > The data is not about students but about television ;) Regarding
> > the
> > >>> > size.
> > >>> > > The raw input data size is about 150m although when I 'explode'
> the
> > >>> > > timeseries
> > >>> > > it will be around 80x bigger. I guess the average user duration
> > will
> > >>> be
> > >>> > > around
> > >>> > > 40 Minutes which means when sampling it at a 30s interval will
> > >>> increase
> > >>> > the
> > >>> > > size by ~12GB.
> > >>> > >
> > >>> > > I think that is a size which my hadoop cluster with five 8-core x
> > >>> 8GB x
> > >>> > 2TB
> > >>> > > HD
> > >>> > > should be able to cope with.
> > >>> > >
> > >>> > > I don't know about R. Are you able to handle 12Gb
> > >>> > > files well in R (off course it depends on your computer so assume
> > an
> > >>> > > average business computer e.g. 2-core 2GHz 4GB ram)?
> > >>> > >
> > >>> > > Cheers
> > >>> > > -Marco
> > >>> > >
> > >>> > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <
> fatal.error@gmail.com>
> > >>> > wrote:
> > >>> > >
> > >>> > > > if it fits in R, it's trivial, draw a density plot or a
> > histogram,
> > >>> > about
> > >>> > > > three lines of R code
> > >>> > > >
> > >>> > > > why I was wondering about the data volume.
> > >>> > > >
> > >>> > > > His example is students attending classes, if  that is really
> the
> > >>> data
> > >>> > > hard
> > >>> > > > to believe it's super huge?
> > >>> > > >
> > >>> > > > Guy
> > >>> > > >
> > >>> > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
> > >>> > > norbert.burger@gmail.com
> > >>> > > > >wrote:
> > >>> > > >
> > >>> > > > > Perhaps another way to approach this problem is to visualize
> it
> > >>> > > > > geometrically.  You have a long series of class session
> > >>> instances,
> > >>> > > where
> > >>> > > > > each class session is like 1D line segment,
> beginning/stopping
> > at
> > >>> > some
> > >>> > > > > start/end time.
> > >>> > > > >
> > >>> > > > > These segments naturally overlap, and I think the question
> > you're
> > >>> > > asking
> > >>> > > > is
> > >>> > > > > equivalent to finding the number of overlaps at every
> > subsegment.
> > >>> > > > >
> > >>> > > > > To answer this, you want to first break every class session
> > into
> > >>> a
> > >>> > full
> > >>> > > > > list
> > >>> > > > > of subsegments, where a subsegment is created by "breaking"
> > each
> > >>> > class
> > >>> > > > > session/segment into multiple parts at the start/end point of
> > any
> > >>> > other
> > >>> > > > > class session.  You can create this full set of subsegments
> in
> > >>> one
> > >>> > pass
> > >>> > > > by
> > >>> > > > > comparing pairwise (CROSS) each start/end point with your
> > >>> original
> > >>> > list
> > >>> > > > of
> > >>> > > > > class sessions.
> > >>> > > > >
> > >>> > > > > Once you have the full list of "broken" segments, then a
> final
> > >>> GROUP
> > >>> > > > > BY/COUNT(*) will you give you the number of overlaps.  Seems
> > like
> > >>> > > > approach
> > >>> > > > > would be faster than the previous approach if your class
> > >>> sessions are
> > >>> > > > very
> > >>> > > > > long, or there are many overlaps.
> > >>> > > > >
> > >>> > > > > Norbert
> > >>> > > > >
> > >>> > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <
> > >>> fatal.error@gmail.com>
> > >>> > > > wrote:
> > >>> > > > >
> > >>> > > > > > how big is your dataset?
> > >>> > > > > >
> > >>> > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <
> > >>> marco@zattoo.com>
> > >>> > > > wrote:
> > >>> > > > > >
> > >>> > > > > > > Thanks Bill and Norbert that seems like what I was
> looking
> > >>> for.
> > >>> > > I'm a
> > >>> > > > > bit
> > >>> > > > > > > worried about
> > >>> > > > > > > how much data/io this could create. But I'll see ;)
> > >>> > > > > > >
> > >>> > > > > > > Cheers
> > >>> > > > > > > -Marco
> > >>> > > > > > >
> > >>> > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > >>> > > > > > norbert.burger@gmail.com
> > >>> > > > > > > >wrote:
> > >>> > > > > > >
> > >>> > > > > > > > In case what you're looking for is an analysis over the
> > >>> full
> > >>> > > > learning
> > >>> > > > > > > > duration, and not just the start interval, then one
> > further
> > >>> > > insight
> > >>> > > > > is
> > >>> > > > > > > > that each original record can be transformed into a
> > >>> sequence of
> > >>> > > > > > > > records, where the size of the sequence corresponds to
> > the
> > >>> > > session
> > >>> > > > > > > > duration.  In other words, you can use a UDF to
> "explode"
> > >>> the
> > >>> > > > > original
> > >>> > > > > > > > record:
> > >>> > > > > > > >
> > >>> > > > > > > > 1,marco,1319708213,500,math
> > >>> > > > > > > >
> > >>> > > > > > > > into:
> > >>> > > > > > > >
> > >>> > > > > > > > 1,marco,1319708190,500,math
> > >>> > > > > > > > 1,marco,1319708220,500,math
> > >>> > > > > > > > 1,marco,1319708250,500,math
> > >>> > > > > > > > 1,marco,1319708280,500,math
> > >>> > > > > > > > 1,marco,1319708310,500,math
> > >>> > > > > > > > 1,marco,1319708340,500,math
> > >>> > > > > > > > 1,marco,1319708370,500,math
> > >>> > > > > > > > 1,marco,1319708400,500,math
> > >>> > > > > > > > 1,marco,1319708430,500,math
> > >>> > > > > > > > 1,marco,1319708460,500,math
> > >>> > > > > > > > 1,marco,1319708490,500,math
> > >>> > > > > > > > 1,marco,1319708520,500,math
> > >>> > > > > > > > 1,marco,1319708550,500,math
> > >>> > > > > > > > 1,marco,1319708580,500,math
> > >>> > > > > > > > 1,marco,1319708610,500,math
> > >>> > > > > > > > 1,marco,1319708640,500,math
> > >>> > > > > > > > 1,marco,1319708670,500,math
> > >>> > > > > > > > 1,marco,1319708700,500,math
> > >>> > > > > > > >
> > >>> > > > > > > > and then use Bill's suggestion to group by course,
> > >>> interval.
> > >>> > > > > > > >
> > >>> > > > > > > > Norbert
> > >>> > > > > > > >
> > >>> > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
> > >>> > > > billgraham@gmail.com>
> > >>> > > > > > > > wrote:
> > >>> > > > > > > > > You can pass your time to a udf that rounds it down
> to
> > >>> the
> > >>> > > > nearest
> > >>> > > > > 30
> > >>> > > > > > > > second
> > >>> > > > > > > > > interval and then group by course, interval to get
> > >>> counts for
> > >>> > > > each
> > >>> > > > > > > > course,
> > >>> > > > > > > > > interval.
> > >>> > > > > > > > >
> > >>> > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <
> > >>> > marco@zattoo.com>
> > >>> > > > > > wrote:
> > >>> > > > > > > > >> I have a problem where I don't know how or if pig is
> > >>> even
> > >>> > > > suitable
> > >>> > > > > > to
> > >>> > > > > > > > > solve
> > >>> > > > > > > > >> it.
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> I have a schema like this:
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> student-id,student-name,start-time,duration,course
> > >>> > > > > > > > >> 1,marco,1319708213,500,math
> > >>> > > > > > > > >> 2,ralf,1319708111,112,english
> > >>> > > > > > > > >> 3,greg,1319708321,333,french
> > >>> > > > > > > > >> 4,diva,1319708444,80,english
> > >>> > > > > > > > >> 5,susanne,1319708123,2000,math
> > >>> > > > > > > > >> 1,marco,1319708564,500,french
> > >>> > > > > > > > >> 2,ralf,1319708789,123,french
> > >>> > > > > > > > >> 7,fred,1319708213,5675,french
> > >>> > > > > > > > >> 8,laura,1319708233,123,math
> > >>> > > > > > > > >> 10,sab,1319708999,777,math
> > >>> > > > > > > > >> 11,fibo,1319708789,565,math
> > >>> > > > > > > > >> 6,dan,1319708456,50,english
> > >>> > > > > > > > >> 9,marco,1319708123,60,english
> > >>> > > > > > > > >> 12,bo,1319708456,345,math
> > >>> > > > > > > > >> 1,marco,1319708789,673,math
> > >>> > > > > > > > >> ...
> > >>> > > > > > > > >> ...
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> I would like to retrieve a graph (interpolation)
> over
> > >>> time
> > >>> > > > grouped
> > >>> > > > > > by
> > >>> > > > > > > > >> course. Meaning how many students are learning for a
> > >>> course
> > >>> > > > based
> > >>> > > > > on
> > >>> > > > > > a
> > >>> > > > > > > > 30
> > >>> > > > > > > > >> sec interval.
> > >>> > > > > > > > >> The grouping by course is easy but from there I've
> no
> > >>> clue
> > >>> > > how I
> > >>> > > > > > would
> > >>> > > > > > > > >> achieve the rest. I guess the rest needs to be
> > achieved
> > >>> via
> > >>> > > some
> > >>> > > > > UDF
> > >>> > > > > > > > >> or is there any way how to this in pig? I often
> think
> > >>> that I
> > >>> > > > need
> > >>> > > > > a
> > >>> > > > > > > "for
> > >>> > > > > > > > >> loop" or something similar in pig.
> > >>> > > > > > > > >>
> > >>> > > > > > > > >> Thanks for your help!
> > >>> > > > > > > > >> -Marco
> > >>> > > > > > > > >>
> > >>> > > > > > > > >
> > >>> > > > > > > >
> > >>> > > > > > >
> > >>> > > > > >
> > >>> > > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: creating a graph over time

Posted by Jonathan Coveney <jc...@gmail.com>.

I'll make it less hideous and submit a patch this weekend, then :)

2011/11/2 Ashutosh Chauhan <ha...@apache.org>

> Hey Jon,
>
> Your windowing udf will be very useful outside of this particular usecase.
> It will be great if you can contribute it to PiggyBank.
>
> Thanks,
> Ashutosh
>
> On Tue, Nov 1, 2011 at 10:44, Jonathan Coveney <jc...@gmail.com> wrote:
>
> > Okie dokie. So first, let's clarify and simplify the problem a little,
> > especially to ensure that I know what is going on.
> >
> > Let's first just focus on a particular class. This is ok since presumably
> > each class is independent. Now, we have user_id, start_time, and end_time
> > (start_time+duration). If I understand correctly, a user_id should be
> > included up to end_time+30s, since this is a 30s moving window. As such,
> > we'll just ignore that side of things for now, because you can just
> > transform people's start times accordingly. Further, the assumption is
> that
> > for a given user_id, you will not have overlapping start and end
> > times...you can have multiple entries, ie "user 1, start 1, end 3; user
> 1,
> > start 5, end 7;" but you can't have them in this form: "user 1, start 1,
> > end 3; user 1, start 2, end 4."
> >
> > So we have simplified the question to this: given: user_id, start_time,
> > and end_time (which never overlap), how can I get a count of unique users
> > for every second? So now we will design a UDF to generate that output as
> a
> > bag of (time, # of people) pairs, for every second from min(start_time)
> to
> > max(end_time). The UDF will accept a bag sorted on the start time. Now,
> as
> > I write it it's going to be a simple evalfunc, but it should be an
> > accumulator. It's easy to make the transition.
> >
> > Here is what you do. Initialize a PriorityQueue. The natural ordering for
> > int and long is fine, as it will ensure that when we poll it, we'll get
> the
> > earliest end time, which is what we want.
> >
> > So step one is to pull the first tuple, and get the start_time and
> > end_time. The start time will set our time to start_time (which is
> > min(start_time) since it was sorted on start_time), and we add the
> end_time
> > to the priority queue. We have a counter "uniques" which we increment.
> >
> > Now, before we actually do increment, we grab the next tuple. Why do you
> > do this instead of go to the next end time? Because we don't know if
> > someone starts in between now and the next end time. So we grab the tuple
> > and get its start and end time. Now there are two cases.
> >
> > Case 1: the start time is less than the head of the priority queue, via a
> > peek. If this is the case, then we can safely increment up to the
> > start_time we just got, and then go from there. This is because it's
> > impossible for there to be a new end_time less than the start_time we
> just
> > got, because they are ordered by start_time and end_time>start_time. So
> we
> > add the new end_time, and then we increment our timer until we get to the
> > new start_time we just got, and add (timer,unique) at each step. When we
> > get to start_time, we unique++. Now we get the next tuple and repeat.
> >
> > Case 2: the start time comes after the head of the priority queue, via a
> > peek. If this is the case, then we need to increment up to the current
> > head, emitting (timer,unique). Then when we get to the time_value equal
> to
> > that end_time, we unique--, and check again if the start_time comes
> before
> > than the head of the priority queue. Until it does, we repeat step 2.
> Once
> > it does, we do step 1.
> >
> > I've attached a crude, untested UDF that does this. Buyer beware. But it
> > shows the general flow, and should be better than exploding the data (I
> > really hate exploding data like that unless it's absolutely necessary).
> >
> > To use, generate some data, then...
> >
> > register window.jar;
> > define window com.jcoveney.Window('30');
> > a = load 'data' using PigStorage(',') as (uid:long,start:long,end:long);
> > b = foreach (group a all) {
> >   ord = order a by start asc;
> >   generate flatten(window(ord));
> > }
> > dump b;
> >
> > to generate data, I first did just  a small subsample just to think about
> > it, then I did (in python)
> >
> > import random
> > f=open("data","w")
> > for i in range(0,1000000):
> >   v1=random.randint(1,10000000)
> >   v2=random.randint(1,10000000)
> >   start=min(v1,v2)
> >   stop=max(v1,v2)
> >   print >>f,"%i,%i,%i" % (i,start,stop)
> >
> > If this function is at all useful, I can clean it up and put in in the
> > piggybank. Let me know if the logic doesn't make sense, or if it isn't
> > quite what you had in mind.
> >
> > Jon
> >
> >
> > 2011/11/1 Marco Cadetg <ma...@zattoo.com>
> >
> >> Thanks again for all your comments.
> >>
> >> Jonathan, would you mind to enlighten me on the way you would keep track
> >> of the
> >> people you need to "eject". I don't get the min heap based tuple...
> >>
> >> Cheers
> >> -Marco
> >>
> >>
> >> On Mon, Oct 31, 2011 at 6:15 PM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
> >>
> >>> Perhaps I'm misunderstanding your use case, and this depends on the
> >>> amount
> >>> of data, but you could consider something like this (to avoid exploding
> >>> the
> >>> data, which could perhaps be inavoidable but I hate resorting to that
> if
> >>> I
> >>> don't have to).
> >>>
> >>> a = foreach yourdata generate student_id, start_time,
> start_time+duration
> >>> as end_time, course;
> >>> b = group a by course;
> >>> c = foreach b {
> >>>  ord = order a by start_time;
> >>>  generate yourudf.process(ord);
> >>> }
> >>>
> >>> Here is generally what process could do. It would be an accumulator UDF
> >>> that expected tuples sorted on start_time. Now you basically need a way
> >>> to
> >>> know who the distinct users are. Now, since you want 30s windows, your
> >>> first window will presumably be 30s after the first start_time in your
> >>> data, and you would just tick ahead in 1s and write to a bag which
> would
> >>> have second, # of distinct student_ids. To know when to eject people,
> you
> >>> could have any number of data structures... perhaps a min heap based on
> >>> end_time, and of course instead of "ticking" ahead, you would grab a
> new
> >>> tuple (since this is the only thing that would change the state of the
> #
> >>> of
> >>> distinct ids), and then do all of the ticking ahead as you adjust the
> >>> heap
> >>> and write the seconds in between the current time pointer and the
> >>> start_time of the new tuple, making sure in each step to check against
> >>> the
> >>> min heap to eject any users that expired.
> >>>
> >>> That was a little rambly, I could quickly put together some more
> >>> reasonable
> >>> pseudocode if that would help. I think the general idea is clear
> >>> though...
> >>>
> >>> 2011/10/31 Guy Bayes <fa...@gmail.com>
> >>>
> >>> > ahh TV that explains it
> >>> >
> >>> > 12G data file is a bit too big for R unless you sample, not sure if
> >>> the use
> >>> > case is conducive to sampling?
> >>> >
> >>> > If it is, could sample it down and structure in pig/hadoop and then
> >>> load it
> >>> > into the analytical/visualization tool of choice...
> >>> >
> >>> > Guy
> >>> >
> >>> > On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <ma...@zattoo.com>
> >>> wrote:
> >>> >
> >>> > > The data is not about students but about television ;) Regarding
> the
> >>> > size.
> >>> > > The raw input data size is about 150m although when I 'explode' the
> >>> > > timeseries
> >>> > > it will be around 80x bigger. I guess the average user duration
> will
> >>> be
> >>> > > around
> >>> > > 40 Minutes which means when sampling it at a 30s interval will
> >>> increase
> >>> > the
> >>> > > size by ~12GB.
> >>> > >
> >>> > > I think that is a size which my hadoop cluster with five 8-core x
> >>> 8GB x
> >>> > 2TB
> >>> > > HD
> >>> > > should be able to cope with.
> >>> > >
> >>> > > I don't know about R. Are you able to handle 12Gb
> >>> > > files well in R (off course it depends on your computer so assume
> an
> >>> > > average business computer e.g. 2-core 2GHz 4GB ram)?
> >>> > >
> >>> > > Cheers
> >>> > > -Marco
> >>> > >
> >>> > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <fa...@gmail.com>
> >>> > wrote:
> >>> > >
> >>> > > > if it fits in R, it's trivial, draw a density plot or a
> histogram,
> >>> > about
> >>> > > > three lines of R code
> >>> > > >
> >>> > > > why I was wondering about the data volume.
> >>> > > >
> >>> > > > His example is students attending classes, if  that is really the
> >>> data
> >>> > > hard
> >>> > > > to believe it's super huge?
> >>> > > >
> >>> > > > Guy
> >>> > > >
> >>> > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
> >>> > > norbert.burger@gmail.com
> >>> > > > >wrote:
> >>> > > >
> >>> > > > > Perhaps another way to approach this problem is to visualize it
> >>> > > > > geometrically.  You have a long series of class session
> >>> instances,
> >>> > > where
> >>> > > > > each class session is like 1D line segment, beginning/stopping
> at
> >>> > some
> >>> > > > > start/end time.
> >>> > > > >
> >>> > > > > These segments naturally overlap, and I think the question
> you're
> >>> > > asking
> >>> > > > is
> >>> > > > > equivalent to finding the number of overlaps at every
> subsegment.
> >>> > > > >
> >>> > > > > To answer this, you want to first break every class session
> into
> >>> a
> >>> > full
> >>> > > > > list
> >>> > > > > of subsegments, where a subsegment is created by "breaking"
> each
> >>> > class
> >>> > > > > session/segment into multiple parts at the start/end point of
> any
> >>> > other
> >>> > > > > class session.  You can create this full set of subsegments in
> >>> one
> >>> > pass
> >>> > > > by
> >>> > > > > comparing pairwise (CROSS) each start/end point with your
> >>> original
> >>> > list
> >>> > > > of
> >>> > > > > class sessions.
> >>> > > > >
> >>> > > > > Once you have the full list of "broken" segments, then a final
> >>> GROUP
> >>> > > > > BY/COUNT(*) will you give you the number of overlaps.  Seems
> like
> >>> > > > approach
> >>> > > > > would be faster than the previous approach if your class
> >>> sessions are
> >>> > > > very
> >>> > > > > long, or there are many overlaps.
> >>> > > > >
> >>> > > > > Norbert
> >>> > > > >
> >>> > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <
> >>> fatal.error@gmail.com>
> >>> > > > wrote:
> >>> > > > >
> >>> > > > > > how big is your dataset?
> >>> > > > > >
> >>> > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <
> >>> marco@zattoo.com>
> >>> > > > wrote:
> >>> > > > > >
> >>> > > > > > > Thanks Bill and Norbert that seems like what I was looking
> >>> for.
> >>> > > I'm a
> >>> > > > > bit
> >>> > > > > > > worried about
> >>> > > > > > > how much data/io this could create. But I'll see ;)
> >>> > > > > > >
> >>> > > > > > > Cheers
> >>> > > > > > > -Marco
> >>> > > > > > >
> >>> > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> >>> > > > > > norbert.burger@gmail.com
> >>> > > > > > > >wrote:
> >>> > > > > > >
> >>> > > > > > > > In case what you're looking for is an analysis over the
> >>> full
> >>> > > > learning
> >>> > > > > > > > duration, and not just the start interval, then one
> further
> >>> > > insight
> >>> > > > > is
> >>> > > > > > > > that each original record can be transformed into a
> >>> sequence of
> >>> > > > > > > > records, where the size of the sequence corresponds to
> the
> >>> > > session
> >>> > > > > > > > duration.  In other words, you can use a UDF to "explode"
> >>> the
> >>> > > > > original
> >>> > > > > > > > record:
> >>> > > > > > > >
> >>> > > > > > > > 1,marco,1319708213,500,math
> >>> > > > > > > >
> >>> > > > > > > > into:
> >>> > > > > > > >
> >>> > > > > > > > 1,marco,1319708190,500,math
> >>> > > > > > > > 1,marco,1319708220,500,math
> >>> > > > > > > > 1,marco,1319708250,500,math
> >>> > > > > > > > 1,marco,1319708280,500,math
> >>> > > > > > > > 1,marco,1319708310,500,math
> >>> > > > > > > > 1,marco,1319708340,500,math
> >>> > > > > > > > 1,marco,1319708370,500,math
> >>> > > > > > > > 1,marco,1319708400,500,math
> >>> > > > > > > > 1,marco,1319708430,500,math
> >>> > > > > > > > 1,marco,1319708460,500,math
> >>> > > > > > > > 1,marco,1319708490,500,math
> >>> > > > > > > > 1,marco,1319708520,500,math
> >>> > > > > > > > 1,marco,1319708550,500,math
> >>> > > > > > > > 1,marco,1319708580,500,math
> >>> > > > > > > > 1,marco,1319708610,500,math
> >>> > > > > > > > 1,marco,1319708640,500,math
> >>> > > > > > > > 1,marco,1319708670,500,math
> >>> > > > > > > > 1,marco,1319708700,500,math
> >>> > > > > > > >
> >>> > > > > > > > and then use Bill's suggestion to group by course,
> >>> interval.
> >>> > > > > > > >
> >>> > > > > > > > Norbert
> >>> > > > > > > >
> >>> > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
> >>> > > > billgraham@gmail.com>
> >>> > > > > > > > wrote:
> >>> > > > > > > > > You can pass your time to a udf that rounds it down to
> >>> the
> >>> > > > nearest
> >>> > > > > 30
> >>> > > > > > > > second
> >>> > > > > > > > > interval and then group by course, interval to get
> >>> counts for
> >>> > > > each
> >>> > > > > > > > course,
> >>> > > > > > > > > interval.
> >>> > > > > > > > >
> >>> > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <
> >>> > marco@zattoo.com>
> >>> > > > > > wrote:
> >>> > > > > > > > >> I have a problem where I don't know how or if pig is
> >>> even
> >>> > > > suitable
> >>> > > > > > to
> >>> > > > > > > > > solve
> >>> > > > > > > > >> it.
> >>> > > > > > > > >>
> >>> > > > > > > > >> I have a schema like this:
> >>> > > > > > > > >>
> >>> > > > > > > > >> student-id,student-name,start-time,duration,course
> >>> > > > > > > > >> 1,marco,1319708213,500,math
> >>> > > > > > > > >> 2,ralf,1319708111,112,english
> >>> > > > > > > > >> 3,greg,1319708321,333,french
> >>> > > > > > > > >> 4,diva,1319708444,80,english
> >>> > > > > > > > >> 5,susanne,1319708123,2000,math
> >>> > > > > > > > >> 1,marco,1319708564,500,french
> >>> > > > > > > > >> 2,ralf,1319708789,123,french
> >>> > > > > > > > >> 7,fred,1319708213,5675,french
> >>> > > > > > > > >> 8,laura,1319708233,123,math
> >>> > > > > > > > >> 10,sab,1319708999,777,math
> >>> > > > > > > > >> 11,fibo,1319708789,565,math
> >>> > > > > > > > >> 6,dan,1319708456,50,english
> >>> > > > > > > > >> 9,marco,1319708123,60,english
> >>> > > > > > > > >> 12,bo,1319708456,345,math
> >>> > > > > > > > >> 1,marco,1319708789,673,math
> >>> > > > > > > > >> ...
> >>> > > > > > > > >> ...
> >>> > > > > > > > >>
> >>> > > > > > > > >> I would like to retrieve a graph (interpolation) over
> >>> time
> >>> > > > grouped
> >>> > > > > > by
> >>> > > > > > > > >> course. Meaning how many students are learning for a
> >>> course
> >>> > > > based
> >>> > > > > on
> >>> > > > > > a
> >>> > > > > > > > 30
> >>> > > > > > > > >> sec interval.
> >>> > > > > > > > >> The grouping by course is easy but from there I've no
> >>> clue
> >>> > > how I
> >>> > > > > > would
> >>> > > > > > > > >> achieve the rest. I guess the rest needs to be
> achieved
> >>> via
> >>> > > some
> >>> > > > > UDF
> >>> > > > > > > > >> or is there any way how to this in pig? I often think
> >>> that I
> >>> > > > need
> >>> > > > > a
> >>> > > > > > > "for
> >>> > > > > > > > >> loop" or something similar in pig.
> >>> > > > > > > > >>
> >>> > > > > > > > >> Thanks for your help!
> >>> > > > > > > > >> -Marco
> >>> > > > > > > > >>
> >>> > > > > > > > >
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: creating a graph over time

Posted by Ashutosh Chauhan <ha...@apache.org>.

Hey Jon,

Your windowing udf will be very useful outside of this particular usecase.
It will be great if you can contribute it to PiggyBank.

Thanks,
Ashutosh

On Tue, Nov 1, 2011 at 10:44, Jonathan Coveney <jc...@gmail.com> wrote:

> Okie dokie. So first, let's clarify and simplify the problem a little,
> especially to ensure that I know what is going on.
>
> Let's first just focus on a particular class. This is ok since presumably
> each class is independent. Now, we have user_id, start_time, and end_time
> (start_time+duration). If I understand correctly, a user_id should be
> included up to end_time+30s, since this is a 30s moving window. As such,
> we'll just ignore that side of things for now, because you can just
> transform people's start times accordingly. Further, the assumption is that
> for a given user_id, you will not have overlapping start and end
> times...you can have multiple entries, ie "user 1, start 1, end 3; user 1,
> start 5, end 7;" but you can't have them in this form: "user 1, start 1,
> end 3; user 1, start 2, end 4."
>
> So we have simplified the question to this: given: user_id, start_time,
> and end_time (which never overlap), how can I get a count of unique users
> for every second? So now we will design a UDF to generate that output as a
> bag of (time, # of people) pairs, for every second from min(start_time) to
> max(end_time). The UDF will accept a bag sorted on the start time. Now, as
> I write it it's going to be a simple evalfunc, but it should be an
> accumulator. It's easy to make the transition.
>
> Here is what you do. Initialize a PriorityQueue. The natural ordering for
> int and long is fine, as it will ensure that when we poll it, we'll get the
> earliest end time, which is what we want.
>
> So step one is to pull the first tuple, and get the start_time and
> end_time. The start time will set our time to start_time (which is
> min(start_time) since it was sorted on start_time), and we add the end_time
> to the priority queue. We have a counter "uniques" which we increment.
>
> Now, before we actually do increment, we grab the next tuple. Why do you
> do this instead of go to the next end time? Because we don't know if
> someone starts in between now and the next end time. So we grab the tuple
> and get its start and end time. Now there are two cases.
>
> Case 1: the start time is less than the head of the priority queue, via a
> peek. If this is the case, then we can safely increment up to the
> start_time we just got, and then go from there. This is because it's
> impossible for there to be a new end_time less than the start_time we just
> got, because they are ordered by start_time and end_time>start_time. So we
> add the new end_time, and then we increment our timer until we get to the
> new start_time we just got, and add (timer,unique) at each step. When we
> get to start_time, we unique++. Now we get the next tuple and repeat.
>
> Case 2: the start time comes after the head of the priority queue, via a
> peek. If this is the case, then we need to increment up to the current
> head, emitting (timer,unique). Then when we get to the time_value equal to
> that end_time, we unique--, and check again if the start_time comes before
> than the head of the priority queue. Until it does, we repeat step 2. Once
> it does, we do step 1.
>
> I've attached a crude, untested UDF that does this. Buyer beware. But it
> shows the general flow, and should be better than exploding the data (I
> really hate exploding data like that unless it's absolutely necessary).
>
> To use, generate some data, then...
>
> register window.jar;
> define window com.jcoveney.Window('30');
> a = load 'data' using PigStorage(',') as (uid:long,start:long,end:long);
> b = foreach (group a all) {
>   ord = order a by start asc;
>   generate flatten(window(ord));
> }
> dump b;
>
> to generate data, I first did just  a small subsample just to think about
> it, then I did (in python)
>
> import random
> f=open("data","w")
> for i in range(0,1000000):
>   v1=random.randint(1,10000000)
>   v2=random.randint(1,10000000)
>   start=min(v1,v2)
>   stop=max(v1,v2)
>   print >>f,"%i,%i,%i" % (i,start,stop)
>
> If this function is at all useful, I can clean it up and put in in the
> piggybank. Let me know if the logic doesn't make sense, or if it isn't
> quite what you had in mind.
>
> Jon
>
>
> 2011/11/1 Marco Cadetg <ma...@zattoo.com>
>
>> Thanks again for all your comments.
>>
>> Jonathan, would you mind to enlighten me on the way you would keep track
>> of the
>> people you need to "eject". I don't get the min heap based tuple...
>>
>> Cheers
>> -Marco
>>
>>
>> On Mon, Oct 31, 2011 at 6:15 PM, Jonathan Coveney <jc...@gmail.com>wrote:
>>
>>> Perhaps I'm misunderstanding your use case, and this depends on the
>>> amount
>>> of data, but you could consider something like this (to avoid exploding
>>> the
>>> data, which could perhaps be inavoidable but I hate resorting to that if
>>> I
>>> don't have to).
>>>
>>> a = foreach yourdata generate student_id, start_time, start_time+duration
>>> as end_time, course;
>>> b = group a by course;
>>> c = foreach b {
>>>  ord = order a by start_time;
>>>  generate yourudf.process(ord);
>>> }
>>>
>>> Here is generally what process could do. It would be an accumulator UDF
>>> that expected tuples sorted on start_time. Now you basically need a way
>>> to
>>> know who the distinct users are. Now, since you want 30s windows, your
>>> first window will presumably be 30s after the first start_time in your
>>> data, and you would just tick ahead in 1s and write to a bag which would
>>> have second, # of distinct student_ids. To know when to eject people, you
>>> could have any number of data structures... perhaps a min heap based on
>>> end_time, and of course instead of "ticking" ahead, you would grab a new
>>> tuple (since this is the only thing that would change the state of the #
>>> of
>>> distinct ids), and then do all of the ticking ahead as you adjust the
>>> heap
>>> and write the seconds in between the current time pointer and the
>>> start_time of the new tuple, making sure in each step to check against
>>> the
>>> min heap to eject any users that expired.
>>>
>>> That was a little rambly, I could quickly put together some more
>>> reasonable
>>> pseudocode if that would help. I think the general idea is clear
>>> though...
>>>
>>> 2011/10/31 Guy Bayes <fa...@gmail.com>
>>>
>>> > ahh TV that explains it
>>> >
>>> > 12G data file is a bit too big for R unless you sample, not sure if
>>> the use
>>> > case is conducive to sampling?
>>> >
>>> > If it is, could sample it down and structure in pig/hadoop and then
>>> load it
>>> > into the analytical/visualization tool of choice...
>>> >
>>> > Guy
>>> >
>>> > On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <ma...@zattoo.com>
>>> wrote:
>>> >
>>> > > The data is not about students but about television ;) Regarding the
>>> > size.
>>> > > The raw input data size is about 150m although when I 'explode' the
>>> > > timeseries
>>> > > it will be around 80x bigger. I guess the average user duration will
>>> be
>>> > > around
>>> > > 40 Minutes which means when sampling it at a 30s interval will
>>> increase
>>> > the
>>> > > size by ~12GB.
>>> > >
>>> > > I think that is a size which my hadoop cluster with five 8-core x
>>> 8GB x
>>> > 2TB
>>> > > HD
>>> > > should be able to cope with.
>>> > >
>>> > > I don't know about R. Are you able to handle 12Gb
>>> > > files well in R (off course it depends on your computer so assume an
>>> > > average business computer e.g. 2-core 2GHz 4GB ram)?
>>> > >
>>> > > Cheers
>>> > > -Marco
>>> > >
>>> > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <fa...@gmail.com>
>>> > wrote:
>>> > >
>>> > > > if it fits in R, it's trivial, draw a density plot or a histogram,
>>> > about
>>> > > > three lines of R code
>>> > > >
>>> > > > why I was wondering about the data volume.
>>> > > >
>>> > > > His example is students attending classes, if  that is really the
>>> data
>>> > > hard
>>> > > > to believe it's super huge?
>>> > > >
>>> > > > Guy
>>> > > >
>>> > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
>>> > > norbert.burger@gmail.com
>>> > > > >wrote:
>>> > > >
>>> > > > > Perhaps another way to approach this problem is to visualize it
>>> > > > > geometrically.  You have a long series of class session
>>> instances,
>>> > > where
>>> > > > > each class session is like 1D line segment, beginning/stopping at
>>> > some
>>> > > > > start/end time.
>>> > > > >
>>> > > > > These segments naturally overlap, and I think the question you're
>>> > > asking
>>> > > > is
>>> > > > > equivalent to finding the number of overlaps at every subsegment.
>>> > > > >
>>> > > > > To answer this, you want to first break every class session into
>>> a
>>> > full
>>> > > > > list
>>> > > > > of subsegments, where a subsegment is created by "breaking" each
>>> > class
>>> > > > > session/segment into multiple parts at the start/end point of any
>>> > other
>>> > > > > class session.  You can create this full set of subsegments in
>>> one
>>> > pass
>>> > > > by
>>> > > > > comparing pairwise (CROSS) each start/end point with your
>>> original
>>> > list
>>> > > > of
>>> > > > > class sessions.
>>> > > > >
>>> > > > > Once you have the full list of "broken" segments, then a final
>>> GROUP
>>> > > > > BY/COUNT(*) will you give you the number of overlaps.  Seems like
>>> > > > approach
>>> > > > > would be faster than the previous approach if your class
>>> sessions are
>>> > > > very
>>> > > > > long, or there are many overlaps.
>>> > > > >
>>> > > > > Norbert
>>> > > > >
>>> > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <
>>> fatal.error@gmail.com>
>>> > > > wrote:
>>> > > > >
>>> > > > > > how big is your dataset?
>>> > > > > >
>>> > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <
>>> marco@zattoo.com>
>>> > > > wrote:
>>> > > > > >
>>> > > > > > > Thanks Bill and Norbert that seems like what I was looking
>>> for.
>>> > > I'm a
>>> > > > > bit
>>> > > > > > > worried about
>>> > > > > > > how much data/io this could create. But I'll see ;)
>>> > > > > > >
>>> > > > > > > Cheers
>>> > > > > > > -Marco
>>> > > > > > >
>>> > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
>>> > > > > > norbert.burger@gmail.com
>>> > > > > > > >wrote:
>>> > > > > > >
>>> > > > > > > > In case what you're looking for is an analysis over the
>>> full
>>> > > > learning
>>> > > > > > > > duration, and not just the start interval, then one further
>>> > > insight
>>> > > > > is
>>> > > > > > > > that each original record can be transformed into a
>>> sequence of
>>> > > > > > > > records, where the size of the sequence corresponds to the
>>> > > session
>>> > > > > > > > duration.  In other words, you can use a UDF to "explode"
>>> the
>>> > > > > original
>>> > > > > > > > record:
>>> > > > > > > >
>>> > > > > > > > 1,marco,1319708213,500,math
>>> > > > > > > >
>>> > > > > > > > into:
>>> > > > > > > >
>>> > > > > > > > 1,marco,1319708190,500,math
>>> > > > > > > > 1,marco,1319708220,500,math
>>> > > > > > > > 1,marco,1319708250,500,math
>>> > > > > > > > 1,marco,1319708280,500,math
>>> > > > > > > > 1,marco,1319708310,500,math
>>> > > > > > > > 1,marco,1319708340,500,math
>>> > > > > > > > 1,marco,1319708370,500,math
>>> > > > > > > > 1,marco,1319708400,500,math
>>> > > > > > > > 1,marco,1319708430,500,math
>>> > > > > > > > 1,marco,1319708460,500,math
>>> > > > > > > > 1,marco,1319708490,500,math
>>> > > > > > > > 1,marco,1319708520,500,math
>>> > > > > > > > 1,marco,1319708550,500,math
>>> > > > > > > > 1,marco,1319708580,500,math
>>> > > > > > > > 1,marco,1319708610,500,math
>>> > > > > > > > 1,marco,1319708640,500,math
>>> > > > > > > > 1,marco,1319708670,500,math
>>> > > > > > > > 1,marco,1319708700,500,math
>>> > > > > > > >
>>> > > > > > > > and then use Bill's suggestion to group by course,
>>> interval.
>>> > > > > > > >
>>> > > > > > > > Norbert
>>> > > > > > > >
>>> > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
>>> > > > billgraham@gmail.com>
>>> > > > > > > > wrote:
>>> > > > > > > > > You can pass your time to a udf that rounds it down to
>>> the
>>> > > > nearest
>>> > > > > 30
>>> > > > > > > > second
>>> > > > > > > > > interval and then group by course, interval to get
>>> counts for
>>> > > > each
>>> > > > > > > > course,
>>> > > > > > > > > interval.
>>> > > > > > > > >
>>> > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <
>>> > marco@zattoo.com>
>>> > > > > > wrote:
>>> > > > > > > > >> I have a problem where I don't know how or if pig is
>>> even
>>> > > > suitable
>>> > > > > > to
>>> > > > > > > > > solve
>>> > > > > > > > >> it.
>>> > > > > > > > >>
>>> > > > > > > > >> I have a schema like this:
>>> > > > > > > > >>
>>> > > > > > > > >> student-id,student-name,start-time,duration,course
>>> > > > > > > > >> 1,marco,1319708213,500,math
>>> > > > > > > > >> 2,ralf,1319708111,112,english
>>> > > > > > > > >> 3,greg,1319708321,333,french
>>> > > > > > > > >> 4,diva,1319708444,80,english
>>> > > > > > > > >> 5,susanne,1319708123,2000,math
>>> > > > > > > > >> 1,marco,1319708564,500,french
>>> > > > > > > > >> 2,ralf,1319708789,123,french
>>> > > > > > > > >> 7,fred,1319708213,5675,french
>>> > > > > > > > >> 8,laura,1319708233,123,math
>>> > > > > > > > >> 10,sab,1319708999,777,math
>>> > > > > > > > >> 11,fibo,1319708789,565,math
>>> > > > > > > > >> 6,dan,1319708456,50,english
>>> > > > > > > > >> 9,marco,1319708123,60,english
>>> > > > > > > > >> 12,bo,1319708456,345,math
>>> > > > > > > > >> 1,marco,1319708789,673,math
>>> > > > > > > > >> ...
>>> > > > > > > > >> ...
>>> > > > > > > > >>
>>> > > > > > > > >> I would like to retrieve a graph (interpolation) over
>>> time
>>> > > > grouped
>>> > > > > > by
>>> > > > > > > > >> course. Meaning how many students are learning for a
>>> course
>>> > > > based
>>> > > > > on
>>> > > > > > a
>>> > > > > > > > 30
>>> > > > > > > > >> sec interval.
>>> > > > > > > > >> The grouping by course is easy but from there I've no
>>> clue
>>> > > how I
>>> > > > > > would
>>> > > > > > > > >> achieve the rest. I guess the rest needs to be achieved
>>> via
>>> > > some
>>> > > > > UDF
>>> > > > > > > > >> or is there any way how to this in pig? I often think
>>> that I
>>> > > > need
>>> > > > > a
>>> > > > > > > "for
>>> > > > > > > > >> loop" or something similar in pig.
>>> > > > > > > > >>
>>> > > > > > > > >> Thanks for your help!
>>> > > > > > > > >> -Marco
>>> > > > > > > > >>
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: creating a graph over time

Posted by Jonathan Coveney <jc...@gmail.com>.

Okie dokie. So first, let's clarify and simplify the problem a little,
especially to ensure that I know what is going on.

Let's first just focus on a particular class. This is ok since presumably
each class is independent. Now, we have user_id, start_time, and end_time
(start_time+duration). If I understand correctly, a user_id should be
included up to end_time+30s, since this is a 30s moving window. As such,
we'll just ignore that side of things for now, because you can just
transform people's start times accordingly. Further, the assumption is that
for a given user_id, you will not have overlapping start and end
times...you can have multiple entries, ie "user 1, start 1, end 3; user 1,
start 5, end 7;" but you can't have them in this form: "user 1, start 1,
end 3; user 1, start 2, end 4."

So we have simplified the question to this: given: user_id, start_time, and
end_time (which never overlap), how can I get a count of unique users for
every second? So now we will design a UDF to generate that output as a bag
of (time, # of people) pairs, for every second from min(start_time) to
max(end_time). The UDF will accept a bag sorted on the start time. Now, as
I write it it's going to be a simple evalfunc, but it should be an
accumulator. It's easy to make the transition.

Here is what you do. Initialize a PriorityQueue. The natural ordering for
int and long is fine, as it will ensure that when we poll it, we'll get the
earliest end time, which is what we want.

So step one is to pull the first tuple, and get the start_time and
end_time. The start time will set our time to start_time (which is
min(start_time) since it was sorted on start_time), and we add the end_time
to the priority queue. We have a counter "uniques" which we increment.

Now, before we actually do increment, we grab the next tuple. Why do you do
this instead of go to the next end time? Because we don't know if someone
starts in between now and the next end time. So we grab the tuple and get
its start and end time. Now there are two cases.

Case 1: the start time is less than the head of the priority queue, via a
peek. If this is the case, then we can safely increment up to the
start_time we just got, and then go from there. This is because it's
impossible for there to be a new end_time less than the start_time we just
got, because they are ordered by start_time and end_time>start_time. So we
add the new end_time, and then we increment our timer until we get to the
new start_time we just got, and add (timer,unique) at each step. When we
get to start_time, we unique++. Now we get the next tuple and repeat.

Case 2: the start time comes after the head of the priority queue, via a
peek. If this is the case, then we need to increment up to the current
head, emitting (timer,unique). Then when we get to the time_value equal to
that end_time, we unique--, and check again if the start_time comes before
than the head of the priority queue. Until it does, we repeat step 2. Once
it does, we do step 1.

I've attached a crude, untested UDF that does this. Buyer beware. But it
shows the general flow, and should be better than exploding the data (I
really hate exploding data like that unless it's absolutely necessary).

To use, generate some data, then...

register window.jar;
define window com.jcoveney.Window('30');
a = load 'data' using PigStorage(',') as (uid:long,start:long,end:long);
b = foreach (group a all) {
  ord = order a by start asc;
  generate flatten(window(ord));
}
dump b;

to generate data, I first did just  a small subsample just to think about
it, then I did (in python)

import random
f=open("data","w")
for i in range(0,1000000):
  v1=random.randint(1,10000000)
  v2=random.randint(1,10000000)
  start=min(v1,v2)
  stop=max(v1,v2)
  print >>f,"%i,%i,%i" % (i,start,stop)

If this function is at all useful, I can clean it up and put in in the
piggybank. Let me know if the logic doesn't make sense, or if it isn't
quite what you had in mind.

Jon

2011/11/1 Marco Cadetg <ma...@zattoo.com>

> Thanks again for all your comments.
>
> Jonathan, would you mind to enlighten me on the way you would keep track
> of the
> people you need to "eject". I don't get the min heap based tuple...
>
> Cheers
> -Marco
>
>
> On Mon, Oct 31, 2011 at 6:15 PM, Jonathan Coveney <jc...@gmail.com>wrote:
>
>> Perhaps I'm misunderstanding your use case, and this depends on the amount
>> of data, but you could consider something like this (to avoid exploding
>> the
>> data, which could perhaps be inavoidable but I hate resorting to that if I
>> don't have to).
>>
>> a = foreach yourdata generate student_id, start_time, start_time+duration
>> as end_time, course;
>> b = group a by course;
>> c = foreach b {
>>  ord = order a by start_time;
>>  generate yourudf.process(ord);
>> }
>>
>> Here is generally what process could do. It would be an accumulator UDF
>> that expected tuples sorted on start_time. Now you basically need a way to
>> know who the distinct users are. Now, since you want 30s windows, your
>> first window will presumably be 30s after the first start_time in your
>> data, and you would just tick ahead in 1s and write to a bag which would
>> have second, # of distinct student_ids. To know when to eject people, you
>> could have any number of data structures... perhaps a min heap based on
>> end_time, and of course instead of "ticking" ahead, you would grab a new
>> tuple (since this is the only thing that would change the state of the #
>> of
>> distinct ids), and then do all of the ticking ahead as you adjust the heap
>> and write the seconds in between the current time pointer and the
>> start_time of the new tuple, making sure in each step to check against the
>> min heap to eject any users that expired.
>>
>> That was a little rambly, I could quickly put together some more
>> reasonable
>> pseudocode if that would help. I think the general idea is clear though...
>>
>> 2011/10/31 Guy Bayes <fa...@gmail.com>
>>
>> > ahh TV that explains it
>> >
>> > 12G data file is a bit too big for R unless you sample, not sure if the
>> use
>> > case is conducive to sampling?
>> >
>> > If it is, could sample it down and structure in pig/hadoop and then
>> load it
>> > into the analytical/visualization tool of choice...
>> >
>> > Guy
>> >
>> > On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <ma...@zattoo.com> wrote:
>> >
>> > > The data is not about students but about television ;) Regarding the
>> > size.
>> > > The raw input data size is about 150m although when I 'explode' the
>> > > timeseries
>> > > it will be around 80x bigger. I guess the average user duration will
>> be
>> > > around
>> > > 40 Minutes which means when sampling it at a 30s interval will
>> increase
>> > the
>> > > size by ~12GB.
>> > >
>> > > I think that is a size which my hadoop cluster with five 8-core x 8GB
>> x
>> > 2TB
>> > > HD
>> > > should be able to cope with.
>> > >
>> > > I don't know about R. Are you able to handle 12Gb
>> > > files well in R (off course it depends on your computer so assume an
>> > > average business computer e.g. 2-core 2GHz 4GB ram)?
>> > >
>> > > Cheers
>> > > -Marco
>> > >
>> > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <fa...@gmail.com>
>> > wrote:
>> > >
>> > > > if it fits in R, it's trivial, draw a density plot or a histogram,
>> > about
>> > > > three lines of R code
>> > > >
>> > > > why I was wondering about the data volume.
>> > > >
>> > > > His example is students attending classes, if  that is really the
>> data
>> > > hard
>> > > > to believe it's super huge?
>> > > >
>> > > > Guy
>> > > >
>> > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
>> > > norbert.burger@gmail.com
>> > > > >wrote:
>> > > >
>> > > > > Perhaps another way to approach this problem is to visualize it
>> > > > > geometrically.  You have a long series of class session instances,
>> > > where
>> > > > > each class session is like 1D line segment, beginning/stopping at
>> > some
>> > > > > start/end time.
>> > > > >
>> > > > > These segments naturally overlap, and I think the question you're
>> > > asking
>> > > > is
>> > > > > equivalent to finding the number of overlaps at every subsegment.
>> > > > >
>> > > > > To answer this, you want to first break every class session into a
>> > full
>> > > > > list
>> > > > > of subsegments, where a subsegment is created by "breaking" each
>> > class
>> > > > > session/segment into multiple parts at the start/end point of any
>> > other
>> > > > > class session.  You can create this full set of subsegments in one
>> > pass
>> > > > by
>> > > > > comparing pairwise (CROSS) each start/end point with your original
>> > list
>> > > > of
>> > > > > class sessions.
>> > > > >
>> > > > > Once you have the full list of "broken" segments, then a final
>> GROUP
>> > > > > BY/COUNT(*) will you give you the number of overlaps.  Seems like
>> > > > approach
>> > > > > would be faster than the previous approach if your class sessions
>> are
>> > > > very
>> > > > > long, or there are many overlaps.
>> > > > >
>> > > > > Norbert
>> > > > >
>> > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <fatal.error@gmail.com
>> >
>> > > > wrote:
>> > > > >
>> > > > > > how big is your dataset?
>> > > > > >
>> > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <marco@zattoo.com
>> >
>> > > > wrote:
>> > > > > >
>> > > > > > > Thanks Bill and Norbert that seems like what I was looking
>> for.
>> > > I'm a
>> > > > > bit
>> > > > > > > worried about
>> > > > > > > how much data/io this could create. But I'll see ;)
>> > > > > > >
>> > > > > > > Cheers
>> > > > > > > -Marco
>> > > > > > >
>> > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
>> > > > > > norbert.burger@gmail.com
>> > > > > > > >wrote:
>> > > > > > >
>> > > > > > > > In case what you're looking for is an analysis over the full
>> > > > learning
>> > > > > > > > duration, and not just the start interval, then one further
>> > > insight
>> > > > > is
>> > > > > > > > that each original record can be transformed into a
>> sequence of
>> > > > > > > > records, where the size of the sequence corresponds to the
>> > > session
>> > > > > > > > duration.  In other words, you can use a UDF to "explode"
>> the
>> > > > > original
>> > > > > > > > record:
>> > > > > > > >
>> > > > > > > > 1,marco,1319708213,500,math
>> > > > > > > >
>> > > > > > > > into:
>> > > > > > > >
>> > > > > > > > 1,marco,1319708190,500,math
>> > > > > > > > 1,marco,1319708220,500,math
>> > > > > > > > 1,marco,1319708250,500,math
>> > > > > > > > 1,marco,1319708280,500,math
>> > > > > > > > 1,marco,1319708310,500,math
>> > > > > > > > 1,marco,1319708340,500,math
>> > > > > > > > 1,marco,1319708370,500,math
>> > > > > > > > 1,marco,1319708400,500,math
>> > > > > > > > 1,marco,1319708430,500,math
>> > > > > > > > 1,marco,1319708460,500,math
>> > > > > > > > 1,marco,1319708490,500,math
>> > > > > > > > 1,marco,1319708520,500,math
>> > > > > > > > 1,marco,1319708550,500,math
>> > > > > > > > 1,marco,1319708580,500,math
>> > > > > > > > 1,marco,1319708610,500,math
>> > > > > > > > 1,marco,1319708640,500,math
>> > > > > > > > 1,marco,1319708670,500,math
>> > > > > > > > 1,marco,1319708700,500,math
>> > > > > > > >
>> > > > > > > > and then use Bill's suggestion to group by course, interval.
>> > > > > > > >
>> > > > > > > > Norbert
>> > > > > > > >
>> > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
>> > > > billgraham@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > > > You can pass your time to a udf that rounds it down to the
>> > > > nearest
>> > > > > 30
>> > > > > > > > second
>> > > > > > > > > interval and then group by course, interval to get counts
>> for
>> > > > each
>> > > > > > > > course,
>> > > > > > > > > interval.
>> > > > > > > > >
>> > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <
>> > marco@zattoo.com>
>> > > > > > wrote:
>> > > > > > > > >> I have a problem where I don't know how or if pig is even
>> > > > suitable
>> > > > > > to
>> > > > > > > > > solve
>> > > > > > > > >> it.
>> > > > > > > > >>
>> > > > > > > > >> I have a schema like this:
>> > > > > > > > >>
>> > > > > > > > >> student-id,student-name,start-time,duration,course
>> > > > > > > > >> 1,marco,1319708213,500,math
>> > > > > > > > >> 2,ralf,1319708111,112,english
>> > > > > > > > >> 3,greg,1319708321,333,french
>> > > > > > > > >> 4,diva,1319708444,80,english
>> > > > > > > > >> 5,susanne,1319708123,2000,math
>> > > > > > > > >> 1,marco,1319708564,500,french
>> > > > > > > > >> 2,ralf,1319708789,123,french
>> > > > > > > > >> 7,fred,1319708213,5675,french
>> > > > > > > > >> 8,laura,1319708233,123,math
>> > > > > > > > >> 10,sab,1319708999,777,math
>> > > > > > > > >> 11,fibo,1319708789,565,math
>> > > > > > > > >> 6,dan,1319708456,50,english
>> > > > > > > > >> 9,marco,1319708123,60,english
>> > > > > > > > >> 12,bo,1319708456,345,math
>> > > > > > > > >> 1,marco,1319708789,673,math
>> > > > > > > > >> ...
>> > > > > > > > >> ...
>> > > > > > > > >>
>> > > > > > > > >> I would like to retrieve a graph (interpolation) over
>> time
>> > > > grouped
>> > > > > > by
>> > > > > > > > >> course. Meaning how many students are learning for a
>> course
>> > > > based
>> > > > > on
>> > > > > > a
>> > > > > > > > 30
>> > > > > > > > >> sec interval.
>> > > > > > > > >> The grouping by course is easy but from there I've no
>> clue
>> > > how I
>> > > > > > would
>> > > > > > > > >> achieve the rest. I guess the rest needs to be achieved
>> via
>> > > some
>> > > > > UDF
>> > > > > > > > >> or is there any way how to this in pig? I often think
>> that I
>> > > > need
>> > > > > a
>> > > > > > > "for
>> > > > > > > > >> loop" or something similar in pig.
>> > > > > > > > >>
>> > > > > > > > >> Thanks for your help!
>> > > > > > > > >> -Marco
>> > > > > > > > >>
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: creating a graph over time

Posted by Marco Cadetg <ma...@zattoo.com>.

Thanks again for all your comments.

Jonathan, would you mind to enlighten me on the way you would keep track of
the
people you need to "eject". I don't get the min heap based tuple...

Cheers
-Marco

On Mon, Oct 31, 2011 at 6:15 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> Perhaps I'm misunderstanding your use case, and this depends on the amount
> of data, but you could consider something like this (to avoid exploding the
> data, which could perhaps be inavoidable but I hate resorting to that if I
> don't have to).
>
> a = foreach yourdata generate student_id, start_time, start_time+duration
> as end_time, course;
> b = group a by course;
> c = foreach b {
>  ord = order a by start_time;
>  generate yourudf.process(ord);
> }
>
> Here is generally what process could do. It would be an accumulator UDF
> that expected tuples sorted on start_time. Now you basically need a way to
> know who the distinct users are. Now, since you want 30s windows, your
> first window will presumably be 30s after the first start_time in your
> data, and you would just tick ahead in 1s and write to a bag which would
> have second, # of distinct student_ids. To know when to eject people, you
> could have any number of data structures... perhaps a min heap based on
> end_time, and of course instead of "ticking" ahead, you would grab a new
> tuple (since this is the only thing that would change the state of the # of
> distinct ids), and then do all of the ticking ahead as you adjust the heap
> and write the seconds in between the current time pointer and the
> start_time of the new tuple, making sure in each step to check against the
> min heap to eject any users that expired.
>
> That was a little rambly, I could quickly put together some more reasonable
> pseudocode if that would help. I think the general idea is clear though...
>
> 2011/10/31 Guy Bayes <fa...@gmail.com>
>
> > ahh TV that explains it
> >
> > 12G data file is a bit too big for R unless you sample, not sure if the
> use
> > case is conducive to sampling?
> >
> > If it is, could sample it down and structure in pig/hadoop and then load
> it
> > into the analytical/visualization tool of choice...
> >
> > Guy
> >
> > On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <ma...@zattoo.com> wrote:
> >
> > > The data is not about students but about television ;) Regarding the
> > size.
> > > The raw input data size is about 150m although when I 'explode' the
> > > timeseries
> > > it will be around 80x bigger. I guess the average user duration will be
> > > around
> > > 40 Minutes which means when sampling it at a 30s interval will increase
> > the
> > > size by ~12GB.
> > >
> > > I think that is a size which my hadoop cluster with five 8-core x 8GB x
> > 2TB
> > > HD
> > > should be able to cope with.
> > >
> > > I don't know about R. Are you able to handle 12Gb
> > > files well in R (off course it depends on your computer so assume an
> > > average business computer e.g. 2-core 2GHz 4GB ram)?
> > >
> > > Cheers
> > > -Marco
> > >
> > > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <fa...@gmail.com>
> > wrote:
> > >
> > > > if it fits in R, it's trivial, draw a density plot or a histogram,
> > about
> > > > three lines of R code
> > > >
> > > > why I was wondering about the data volume.
> > > >
> > > > His example is students attending classes, if  that is really the
> data
> > > hard
> > > > to believe it's super huge?
> > > >
> > > > Guy
> > > >
> > > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
> > > norbert.burger@gmail.com
> > > > >wrote:
> > > >
> > > > > Perhaps another way to approach this problem is to visualize it
> > > > > geometrically.  You have a long series of class session instances,
> > > where
> > > > > each class session is like 1D line segment, beginning/stopping at
> > some
> > > > > start/end time.
> > > > >
> > > > > These segments naturally overlap, and I think the question you're
> > > asking
> > > > is
> > > > > equivalent to finding the number of overlaps at every subsegment.
> > > > >
> > > > > To answer this, you want to first break every class session into a
> > full
> > > > > list
> > > > > of subsegments, where a subsegment is created by "breaking" each
> > class
> > > > > session/segment into multiple parts at the start/end point of any
> > other
> > > > > class session.  You can create this full set of subsegments in one
> > pass
> > > > by
> > > > > comparing pairwise (CROSS) each start/end point with your original
> > list
> > > > of
> > > > > class sessions.
> > > > >
> > > > > Once you have the full list of "broken" segments, then a final
> GROUP
> > > > > BY/COUNT(*) will you give you the number of overlaps.  Seems like
> > > > approach
> > > > > would be faster than the previous approach if your class sessions
> are
> > > > very
> > > > > long, or there are many overlaps.
> > > > >
> > > > > Norbert
> > > > >
> > > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <fa...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > how big is your dataset?
> > > > > >
> > > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <ma...@zattoo.com>
> > > > wrote:
> > > > > >
> > > > > > > Thanks Bill and Norbert that seems like what I was looking for.
> > > I'm a
> > > > > bit
> > > > > > > worried about
> > > > > > > how much data/io this could create. But I'll see ;)
> > > > > > >
> > > > > > > Cheers
> > > > > > > -Marco
> > > > > > >
> > > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > > > > > norbert.burger@gmail.com
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > In case what you're looking for is an analysis over the full
> > > > learning
> > > > > > > > duration, and not just the start interval, then one further
> > > insight
> > > > > is
> > > > > > > > that each original record can be transformed into a sequence
> of
> > > > > > > > records, where the size of the sequence corresponds to the
> > > session
> > > > > > > > duration.  In other words, you can use a UDF to "explode" the
> > > > > original
> > > > > > > > record:
> > > > > > > >
> > > > > > > > 1,marco,1319708213,500,math
> > > > > > > >
> > > > > > > > into:
> > > > > > > >
> > > > > > > > 1,marco,1319708190,500,math
> > > > > > > > 1,marco,1319708220,500,math
> > > > > > > > 1,marco,1319708250,500,math
> > > > > > > > 1,marco,1319708280,500,math
> > > > > > > > 1,marco,1319708310,500,math
> > > > > > > > 1,marco,1319708340,500,math
> > > > > > > > 1,marco,1319708370,500,math
> > > > > > > > 1,marco,1319708400,500,math
> > > > > > > > 1,marco,1319708430,500,math
> > > > > > > > 1,marco,1319708460,500,math
> > > > > > > > 1,marco,1319708490,500,math
> > > > > > > > 1,marco,1319708520,500,math
> > > > > > > > 1,marco,1319708550,500,math
> > > > > > > > 1,marco,1319708580,500,math
> > > > > > > > 1,marco,1319708610,500,math
> > > > > > > > 1,marco,1319708640,500,math
> > > > > > > > 1,marco,1319708670,500,math
> > > > > > > > 1,marco,1319708700,500,math
> > > > > > > >
> > > > > > > > and then use Bill's suggestion to group by course, interval.
> > > > > > > >
> > > > > > > > Norbert
> > > > > > > >
> > > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
> > > > billgraham@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > You can pass your time to a udf that rounds it down to the
> > > > nearest
> > > > > 30
> > > > > > > > second
> > > > > > > > > interval and then group by course, interval to get counts
> for
> > > > each
> > > > > > > > course,
> > > > > > > > > interval.
> > > > > > > > >
> > > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <
> > marco@zattoo.com>
> > > > > > wrote:
> > > > > > > > >> I have a problem where I don't know how or if pig is even
> > > > suitable
> > > > > > to
> > > > > > > > > solve
> > > > > > > > >> it.
> > > > > > > > >>
> > > > > > > > >> I have a schema like this:
> > > > > > > > >>
> > > > > > > > >> student-id,student-name,start-time,duration,course
> > > > > > > > >> 1,marco,1319708213,500,math
> > > > > > > > >> 2,ralf,1319708111,112,english
> > > > > > > > >> 3,greg,1319708321,333,french
> > > > > > > > >> 4,diva,1319708444,80,english
> > > > > > > > >> 5,susanne,1319708123,2000,math
> > > > > > > > >> 1,marco,1319708564,500,french
> > > > > > > > >> 2,ralf,1319708789,123,french
> > > > > > > > >> 7,fred,1319708213,5675,french
> > > > > > > > >> 8,laura,1319708233,123,math
> > > > > > > > >> 10,sab,1319708999,777,math
> > > > > > > > >> 11,fibo,1319708789,565,math
> > > > > > > > >> 6,dan,1319708456,50,english
> > > > > > > > >> 9,marco,1319708123,60,english
> > > > > > > > >> 12,bo,1319708456,345,math
> > > > > > > > >> 1,marco,1319708789,673,math
> > > > > > > > >> ...
> > > > > > > > >> ...
> > > > > > > > >>
> > > > > > > > >> I would like to retrieve a graph (interpolation) over time
> > > > grouped
> > > > > > by
> > > > > > > > >> course. Meaning how many students are learning for a
> course
> > > > based
> > > > > on
> > > > > > a
> > > > > > > > 30
> > > > > > > > >> sec interval.
> > > > > > > > >> The grouping by course is easy but from there I've no clue
> > > how I
> > > > > > would
> > > > > > > > >> achieve the rest. I guess the rest needs to be achieved
> via
> > > some
> > > > > UDF
> > > > > > > > >> or is there any way how to this in pig? I often think
> that I
> > > > need
> > > > > a
> > > > > > > "for
> > > > > > > > >> loop" or something similar in pig.
> > > > > > > > >>
> > > > > > > > >> Thanks for your help!
> > > > > > > > >> -Marco
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: creating a graph over time

Posted by Jonathan Coveney <jc...@gmail.com>.

Perhaps I'm misunderstanding your use case, and this depends on the amount
of data, but you could consider something like this (to avoid exploding the
data, which could perhaps be inavoidable but I hate resorting to that if I
don't have to).

a = foreach yourdata generate student_id, start_time, start_time+duration
as end_time, course;
b = group a by course;
c = foreach b {
  ord = order a by start_time;
  generate yourudf.process(ord);
}

Here is generally what process could do. It would be an accumulator UDF
that expected tuples sorted on start_time. Now you basically need a way to
know who the distinct users are. Now, since you want 30s windows, your
first window will presumably be 30s after the first start_time in your
data, and you would just tick ahead in 1s and write to a bag which would
have second, # of distinct student_ids. To know when to eject people, you
could have any number of data structures... perhaps a min heap based on
end_time, and of course instead of "ticking" ahead, you would grab a new
tuple (since this is the only thing that would change the state of the # of
distinct ids), and then do all of the ticking ahead as you adjust the heap
and write the seconds in between the current time pointer and the
start_time of the new tuple, making sure in each step to check against the
min heap to eject any users that expired.

That was a little rambly, I could quickly put together some more reasonable
pseudocode if that would help. I think the general idea is clear though...

2011/10/31 Guy Bayes <fa...@gmail.com>

> ahh TV that explains it
>
> 12G data file is a bit too big for R unless you sample, not sure if the use
> case is conducive to sampling?
>
> If it is, could sample it down and structure in pig/hadoop and then load it
> into the analytical/visualization tool of choice...
>
> Guy
>
> On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <ma...@zattoo.com> wrote:
>
> > The data is not about students but about television ;) Regarding the
> size.
> > The raw input data size is about 150m although when I 'explode' the
> > timeseries
> > it will be around 80x bigger. I guess the average user duration will be
> > around
> > 40 Minutes which means when sampling it at a 30s interval will increase
> the
> > size by ~12GB.
> >
> > I think that is a size which my hadoop cluster with five 8-core x 8GB x
> 2TB
> > HD
> > should be able to cope with.
> >
> > I don't know about R. Are you able to handle 12Gb
> > files well in R (off course it depends on your computer so assume an
> > average business computer e.g. 2-core 2GHz 4GB ram)?
> >
> > Cheers
> > -Marco
> >
> > On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <fa...@gmail.com>
> wrote:
> >
> > > if it fits in R, it's trivial, draw a density plot or a histogram,
> about
> > > three lines of R code
> > >
> > > why I was wondering about the data volume.
> > >
> > > His example is students attending classes, if  that is really the data
> > hard
> > > to believe it's super huge?
> > >
> > > Guy
> > >
> > > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
> > norbert.burger@gmail.com
> > > >wrote:
> > >
> > > > Perhaps another way to approach this problem is to visualize it
> > > > geometrically.  You have a long series of class session instances,
> > where
> > > > each class session is like 1D line segment, beginning/stopping at
> some
> > > > start/end time.
> > > >
> > > > These segments naturally overlap, and I think the question you're
> > asking
> > > is
> > > > equivalent to finding the number of overlaps at every subsegment.
> > > >
> > > > To answer this, you want to first break every class session into a
> full
> > > > list
> > > > of subsegments, where a subsegment is created by "breaking" each
> class
> > > > session/segment into multiple parts at the start/end point of any
> other
> > > > class session.  You can create this full set of subsegments in one
> pass
> > > by
> > > > comparing pairwise (CROSS) each start/end point with your original
> list
> > > of
> > > > class sessions.
> > > >
> > > > Once you have the full list of "broken" segments, then a final GROUP
> > > > BY/COUNT(*) will you give you the number of overlaps.  Seems like
> > > approach
> > > > would be faster than the previous approach if your class sessions are
> > > very
> > > > long, or there are many overlaps.
> > > >
> > > > Norbert
> > > >
> > > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <fa...@gmail.com>
> > > wrote:
> > > >
> > > > > how big is your dataset?
> > > > >
> > > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <ma...@zattoo.com>
> > > wrote:
> > > > >
> > > > > > Thanks Bill and Norbert that seems like what I was looking for.
> > I'm a
> > > > bit
> > > > > > worried about
> > > > > > how much data/io this could create. But I'll see ;)
> > > > > >
> > > > > > Cheers
> > > > > > -Marco
> > > > > >
> > > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > > > > norbert.burger@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > In case what you're looking for is an analysis over the full
> > > learning
> > > > > > > duration, and not just the start interval, then one further
> > insight
> > > > is
> > > > > > > that each original record can be transformed into a sequence of
> > > > > > > records, where the size of the sequence corresponds to the
> > session
> > > > > > > duration.  In other words, you can use a UDF to "explode" the
> > > > original
> > > > > > > record:
> > > > > > >
> > > > > > > 1,marco,1319708213,500,math
> > > > > > >
> > > > > > > into:
> > > > > > >
> > > > > > > 1,marco,1319708190,500,math
> > > > > > > 1,marco,1319708220,500,math
> > > > > > > 1,marco,1319708250,500,math
> > > > > > > 1,marco,1319708280,500,math
> > > > > > > 1,marco,1319708310,500,math
> > > > > > > 1,marco,1319708340,500,math
> > > > > > > 1,marco,1319708370,500,math
> > > > > > > 1,marco,1319708400,500,math
> > > > > > > 1,marco,1319708430,500,math
> > > > > > > 1,marco,1319708460,500,math
> > > > > > > 1,marco,1319708490,500,math
> > > > > > > 1,marco,1319708520,500,math
> > > > > > > 1,marco,1319708550,500,math
> > > > > > > 1,marco,1319708580,500,math
> > > > > > > 1,marco,1319708610,500,math
> > > > > > > 1,marco,1319708640,500,math
> > > > > > > 1,marco,1319708670,500,math
> > > > > > > 1,marco,1319708700,500,math
> > > > > > >
> > > > > > > and then use Bill's suggestion to group by course, interval.
> > > > > > >
> > > > > > > Norbert
> > > > > > >
> > > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
> > > billgraham@gmail.com>
> > > > > > > wrote:
> > > > > > > > You can pass your time to a udf that rounds it down to the
> > > nearest
> > > > 30
> > > > > > > second
> > > > > > > > interval and then group by course, interval to get counts for
> > > each
> > > > > > > course,
> > > > > > > > interval.
> > > > > > > >
> > > > > > > > On Thursday, October 27, 2011, Marco Cadetg <
> marco@zattoo.com>
> > > > > wrote:
> > > > > > > >> I have a problem where I don't know how or if pig is even
> > > suitable
> > > > > to
> > > > > > > > solve
> > > > > > > >> it.
> > > > > > > >>
> > > > > > > >> I have a schema like this:
> > > > > > > >>
> > > > > > > >> student-id,student-name,start-time,duration,course
> > > > > > > >> 1,marco,1319708213,500,math
> > > > > > > >> 2,ralf,1319708111,112,english
> > > > > > > >> 3,greg,1319708321,333,french
> > > > > > > >> 4,diva,1319708444,80,english
> > > > > > > >> 5,susanne,1319708123,2000,math
> > > > > > > >> 1,marco,1319708564,500,french
> > > > > > > >> 2,ralf,1319708789,123,french
> > > > > > > >> 7,fred,1319708213,5675,french
> > > > > > > >> 8,laura,1319708233,123,math
> > > > > > > >> 10,sab,1319708999,777,math
> > > > > > > >> 11,fibo,1319708789,565,math
> > > > > > > >> 6,dan,1319708456,50,english
> > > > > > > >> 9,marco,1319708123,60,english
> > > > > > > >> 12,bo,1319708456,345,math
> > > > > > > >> 1,marco,1319708789,673,math
> > > > > > > >> ...
> > > > > > > >> ...
> > > > > > > >>
> > > > > > > >> I would like to retrieve a graph (interpolation) over time
> > > grouped
> > > > > by
> > > > > > > >> course. Meaning how many students are learning for a course
> > > based
> > > > on
> > > > > a
> > > > > > > 30
> > > > > > > >> sec interval.
> > > > > > > >> The grouping by course is easy but from there I've no clue
> > how I
> > > > > would
> > > > > > > >> achieve the rest. I guess the rest needs to be achieved via
> > some
> > > > UDF
> > > > > > > >> or is there any way how to this in pig? I often think that I
> > > need
> > > > a
> > > > > > "for
> > > > > > > >> loop" or something similar in pig.
> > > > > > > >>
> > > > > > > >> Thanks for your help!
> > > > > > > >> -Marco
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: creating a graph over time

Posted by Guy Bayes <fa...@gmail.com>.

ahh TV that explains it

12G data file is a bit too big for R unless you sample, not sure if the use
case is conducive to sampling?

If it is, could sample it down and structure in pig/hadoop and then load it
into the analytical/visualization tool of choice...

Guy

On Mon, Oct 31, 2011 at 8:55 AM, Marco Cadetg <ma...@zattoo.com> wrote:

> The data is not about students but about television ;) Regarding the size.
> The raw input data size is about 150m although when I 'explode' the
> timeseries
> it will be around 80x bigger. I guess the average user duration will be
> around
> 40 Minutes which means when sampling it at a 30s interval will increase the
> size by ~12GB.
>
> I think that is a size which my hadoop cluster with five 8-core x 8GB x 2TB
> HD
> should be able to cope with.
>
> I don't know about R. Are you able to handle 12Gb
> files well in R (off course it depends on your computer so assume an
> average business computer e.g. 2-core 2GHz 4GB ram)?
>
> Cheers
> -Marco
>
> On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <fa...@gmail.com> wrote:
>
> > if it fits in R, it's trivial, draw a density plot or a histogram, about
> > three lines of R code
> >
> > why I was wondering about the data volume.
> >
> > His example is students attending classes, if  that is really the data
> hard
> > to believe it's super huge?
> >
> > Guy
> >
> > On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <
> norbert.burger@gmail.com
> > >wrote:
> >
> > > Perhaps another way to approach this problem is to visualize it
> > > geometrically.  You have a long series of class session instances,
> where
> > > each class session is like 1D line segment, beginning/stopping at some
> > > start/end time.
> > >
> > > These segments naturally overlap, and I think the question you're
> asking
> > is
> > > equivalent to finding the number of overlaps at every subsegment.
> > >
> > > To answer this, you want to first break every class session into a full
> > > list
> > > of subsegments, where a subsegment is created by "breaking" each class
> > > session/segment into multiple parts at the start/end point of any other
> > > class session.  You can create this full set of subsegments in one pass
> > by
> > > comparing pairwise (CROSS) each start/end point with your original list
> > of
> > > class sessions.
> > >
> > > Once you have the full list of "broken" segments, then a final GROUP
> > > BY/COUNT(*) will you give you the number of overlaps.  Seems like
> > approach
> > > would be faster than the previous approach if your class sessions are
> > very
> > > long, or there are many overlaps.
> > >
> > > Norbert
> > >
> > > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <fa...@gmail.com>
> > wrote:
> > >
> > > > how big is your dataset?
> > > >
> > > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <ma...@zattoo.com>
> > wrote:
> > > >
> > > > > Thanks Bill and Norbert that seems like what I was looking for.
> I'm a
> > > bit
> > > > > worried about
> > > > > how much data/io this could create. But I'll see ;)
> > > > >
> > > > > Cheers
> > > > > -Marco
> > > > >
> > > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > > > norbert.burger@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > In case what you're looking for is an analysis over the full
> > learning
> > > > > > duration, and not just the start interval, then one further
> insight
> > > is
> > > > > > that each original record can be transformed into a sequence of
> > > > > > records, where the size of the sequence corresponds to the
> session
> > > > > > duration.  In other words, you can use a UDF to "explode" the
> > > original
> > > > > > record:
> > > > > >
> > > > > > 1,marco,1319708213,500,math
> > > > > >
> > > > > > into:
> > > > > >
> > > > > > 1,marco,1319708190,500,math
> > > > > > 1,marco,1319708220,500,math
> > > > > > 1,marco,1319708250,500,math
> > > > > > 1,marco,1319708280,500,math
> > > > > > 1,marco,1319708310,500,math
> > > > > > 1,marco,1319708340,500,math
> > > > > > 1,marco,1319708370,500,math
> > > > > > 1,marco,1319708400,500,math
> > > > > > 1,marco,1319708430,500,math
> > > > > > 1,marco,1319708460,500,math
> > > > > > 1,marco,1319708490,500,math
> > > > > > 1,marco,1319708520,500,math
> > > > > > 1,marco,1319708550,500,math
> > > > > > 1,marco,1319708580,500,math
> > > > > > 1,marco,1319708610,500,math
> > > > > > 1,marco,1319708640,500,math
> > > > > > 1,marco,1319708670,500,math
> > > > > > 1,marco,1319708700,500,math
> > > > > >
> > > > > > and then use Bill's suggestion to group by course, interval.
> > > > > >
> > > > > > Norbert
> > > > > >
> > > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
> > billgraham@gmail.com>
> > > > > > wrote:
> > > > > > > You can pass your time to a udf that rounds it down to the
> > nearest
> > > 30
> > > > > > second
> > > > > > > interval and then group by course, interval to get counts for
> > each
> > > > > > course,
> > > > > > > interval.
> > > > > > >
> > > > > > > On Thursday, October 27, 2011, Marco Cadetg <ma...@zattoo.com>
> > > > wrote:
> > > > > > >> I have a problem where I don't know how or if pig is even
> > suitable
> > > > to
> > > > > > > solve
> > > > > > >> it.
> > > > > > >>
> > > > > > >> I have a schema like this:
> > > > > > >>
> > > > > > >> student-id,student-name,start-time,duration,course
> > > > > > >> 1,marco,1319708213,500,math
> > > > > > >> 2,ralf,1319708111,112,english
> > > > > > >> 3,greg,1319708321,333,french
> > > > > > >> 4,diva,1319708444,80,english
> > > > > > >> 5,susanne,1319708123,2000,math
> > > > > > >> 1,marco,1319708564,500,french
> > > > > > >> 2,ralf,1319708789,123,french
> > > > > > >> 7,fred,1319708213,5675,french
> > > > > > >> 8,laura,1319708233,123,math
> > > > > > >> 10,sab,1319708999,777,math
> > > > > > >> 11,fibo,1319708789,565,math
> > > > > > >> 6,dan,1319708456,50,english
> > > > > > >> 9,marco,1319708123,60,english
> > > > > > >> 12,bo,1319708456,345,math
> > > > > > >> 1,marco,1319708789,673,math
> > > > > > >> ...
> > > > > > >> ...
> > > > > > >>
> > > > > > >> I would like to retrieve a graph (interpolation) over time
> > grouped
> > > > by
> > > > > > >> course. Meaning how many students are learning for a course
> > based
> > > on
> > > > a
> > > > > > 30
> > > > > > >> sec interval.
> > > > > > >> The grouping by course is easy but from there I've no clue
> how I
> > > > would
> > > > > > >> achieve the rest. I guess the rest needs to be achieved via
> some
> > > UDF
> > > > > > >> or is there any way how to this in pig? I often think that I
> > need
> > > a
> > > > > "for
> > > > > > >> loop" or something similar in pig.
> > > > > > >>
> > > > > > >> Thanks for your help!
> > > > > > >> -Marco
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: creating a graph over time

Posted by Marco Cadetg <ma...@zattoo.com>.

The data is not about students but about television ;) Regarding the size.
The raw input data size is about 150m although when I 'explode' the
timeseries
it will be around 80x bigger. I guess the average user duration will be
around
40 Minutes which means when sampling it at a 30s interval will increase the
size by ~12GB.

I think that is a size which my hadoop cluster with five 8-core x 8GB x 2TB
HD
should be able to cope with.

I don't know about R. Are you able to handle 12Gb
files well in R (off course it depends on your computer so assume an
average business computer e.g. 2-core 2GHz 4GB ram)?

Cheers
-Marco

On Fri, Oct 28, 2011 at 5:02 PM, Guy Bayes <fa...@gmail.com> wrote:

> if it fits in R, it's trivial, draw a density plot or a histogram, about
> three lines of R code
>
> why I was wondering about the data volume.
>
> His example is students attending classes, if  that is really the data hard
> to believe it's super huge?
>
> Guy
>
> On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <norbert.burger@gmail.com
> >wrote:
>
> > Perhaps another way to approach this problem is to visualize it
> > geometrically.  You have a long series of class session instances, where
> > each class session is like 1D line segment, beginning/stopping at some
> > start/end time.
> >
> > These segments naturally overlap, and I think the question you're asking
> is
> > equivalent to finding the number of overlaps at every subsegment.
> >
> > To answer this, you want to first break every class session into a full
> > list
> > of subsegments, where a subsegment is created by "breaking" each class
> > session/segment into multiple parts at the start/end point of any other
> > class session.  You can create this full set of subsegments in one pass
> by
> > comparing pairwise (CROSS) each start/end point with your original list
> of
> > class sessions.
> >
> > Once you have the full list of "broken" segments, then a final GROUP
> > BY/COUNT(*) will you give you the number of overlaps.  Seems like
> approach
> > would be faster than the previous approach if your class sessions are
> very
> > long, or there are many overlaps.
> >
> > Norbert
> >
> > On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <fa...@gmail.com>
> wrote:
> >
> > > how big is your dataset?
> > >
> > > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <ma...@zattoo.com>
> wrote:
> > >
> > > > Thanks Bill and Norbert that seems like what I was looking for. I'm a
> > bit
> > > > worried about
> > > > how much data/io this could create. But I'll see ;)
> > > >
> > > > Cheers
> > > > -Marco
> > > >
> > > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > > norbert.burger@gmail.com
> > > > >wrote:
> > > >
> > > > > In case what you're looking for is an analysis over the full
> learning
> > > > > duration, and not just the start interval, then one further insight
> > is
> > > > > that each original record can be transformed into a sequence of
> > > > > records, where the size of the sequence corresponds to the session
> > > > > duration.  In other words, you can use a UDF to "explode" the
> > original
> > > > > record:
> > > > >
> > > > > 1,marco,1319708213,500,math
> > > > >
> > > > > into:
> > > > >
> > > > > 1,marco,1319708190,500,math
> > > > > 1,marco,1319708220,500,math
> > > > > 1,marco,1319708250,500,math
> > > > > 1,marco,1319708280,500,math
> > > > > 1,marco,1319708310,500,math
> > > > > 1,marco,1319708340,500,math
> > > > > 1,marco,1319708370,500,math
> > > > > 1,marco,1319708400,500,math
> > > > > 1,marco,1319708430,500,math
> > > > > 1,marco,1319708460,500,math
> > > > > 1,marco,1319708490,500,math
> > > > > 1,marco,1319708520,500,math
> > > > > 1,marco,1319708550,500,math
> > > > > 1,marco,1319708580,500,math
> > > > > 1,marco,1319708610,500,math
> > > > > 1,marco,1319708640,500,math
> > > > > 1,marco,1319708670,500,math
> > > > > 1,marco,1319708700,500,math
> > > > >
> > > > > and then use Bill's suggestion to group by course, interval.
> > > > >
> > > > > Norbert
> > > > >
> > > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <
> billgraham@gmail.com>
> > > > > wrote:
> > > > > > You can pass your time to a udf that rounds it down to the
> nearest
> > 30
> > > > > second
> > > > > > interval and then group by course, interval to get counts for
> each
> > > > > course,
> > > > > > interval.
> > > > > >
> > > > > > On Thursday, October 27, 2011, Marco Cadetg <ma...@zattoo.com>
> > > wrote:
> > > > > >> I have a problem where I don't know how or if pig is even
> suitable
> > > to
> > > > > > solve
> > > > > >> it.
> > > > > >>
> > > > > >> I have a schema like this:
> > > > > >>
> > > > > >> student-id,student-name,start-time,duration,course
> > > > > >> 1,marco,1319708213,500,math
> > > > > >> 2,ralf,1319708111,112,english
> > > > > >> 3,greg,1319708321,333,french
> > > > > >> 4,diva,1319708444,80,english
> > > > > >> 5,susanne,1319708123,2000,math
> > > > > >> 1,marco,1319708564,500,french
> > > > > >> 2,ralf,1319708789,123,french
> > > > > >> 7,fred,1319708213,5675,french
> > > > > >> 8,laura,1319708233,123,math
> > > > > >> 10,sab,1319708999,777,math
> > > > > >> 11,fibo,1319708789,565,math
> > > > > >> 6,dan,1319708456,50,english
> > > > > >> 9,marco,1319708123,60,english
> > > > > >> 12,bo,1319708456,345,math
> > > > > >> 1,marco,1319708789,673,math
> > > > > >> ...
> > > > > >> ...
> > > > > >>
> > > > > >> I would like to retrieve a graph (interpolation) over time
> grouped
> > > by
> > > > > >> course. Meaning how many students are learning for a course
> based
> > on
> > > a
> > > > > 30
> > > > > >> sec interval.
> > > > > >> The grouping by course is easy but from there I've no clue how I
> > > would
> > > > > >> achieve the rest. I guess the rest needs to be achieved via some
> > UDF
> > > > > >> or is there any way how to this in pig? I often think that I
> need
> > a
> > > > "for
> > > > > >> loop" or something similar in pig.
> > > > > >>
> > > > > >> Thanks for your help!
> > > > > >> -Marco
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: creating a graph over time

Posted by Guy Bayes <fa...@gmail.com>.

if it fits in R, it's trivial, draw a density plot or a histogram, about
three lines of R code

why I was wondering about the data volume.

His example is students attending classes, if  that is really the data hard
to believe it's super huge?

Guy

On Fri, Oct 28, 2011 at 6:12 AM, Norbert Burger <no...@gmail.com>wrote:

> Perhaps another way to approach this problem is to visualize it
> geometrically.  You have a long series of class session instances, where
> each class session is like 1D line segment, beginning/stopping at some
> start/end time.
>
> These segments naturally overlap, and I think the question you're asking is
> equivalent to finding the number of overlaps at every subsegment.
>
> To answer this, you want to first break every class session into a full
> list
> of subsegments, where a subsegment is created by "breaking" each class
> session/segment into multiple parts at the start/end point of any other
> class session.  You can create this full set of subsegments in one pass by
> comparing pairwise (CROSS) each start/end point with your original list of
> class sessions.
>
> Once you have the full list of "broken" segments, then a final GROUP
> BY/COUNT(*) will you give you the number of overlaps.  Seems like approach
> would be faster than the previous approach if your class sessions are very
> long, or there are many overlaps.
>
> Norbert
>
> On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <fa...@gmail.com> wrote:
>
> > how big is your dataset?
> >
> > On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <ma...@zattoo.com> wrote:
> >
> > > Thanks Bill and Norbert that seems like what I was looking for. I'm a
> bit
> > > worried about
> > > how much data/io this could create. But I'll see ;)
> > >
> > > Cheers
> > > -Marco
> > >
> > > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> > norbert.burger@gmail.com
> > > >wrote:
> > >
> > > > In case what you're looking for is an analysis over the full learning
> > > > duration, and not just the start interval, then one further insight
> is
> > > > that each original record can be transformed into a sequence of
> > > > records, where the size of the sequence corresponds to the session
> > > > duration.  In other words, you can use a UDF to "explode" the
> original
> > > > record:
> > > >
> > > > 1,marco,1319708213,500,math
> > > >
> > > > into:
> > > >
> > > > 1,marco,1319708190,500,math
> > > > 1,marco,1319708220,500,math
> > > > 1,marco,1319708250,500,math
> > > > 1,marco,1319708280,500,math
> > > > 1,marco,1319708310,500,math
> > > > 1,marco,1319708340,500,math
> > > > 1,marco,1319708370,500,math
> > > > 1,marco,1319708400,500,math
> > > > 1,marco,1319708430,500,math
> > > > 1,marco,1319708460,500,math
> > > > 1,marco,1319708490,500,math
> > > > 1,marco,1319708520,500,math
> > > > 1,marco,1319708550,500,math
> > > > 1,marco,1319708580,500,math
> > > > 1,marco,1319708610,500,math
> > > > 1,marco,1319708640,500,math
> > > > 1,marco,1319708670,500,math
> > > > 1,marco,1319708700,500,math
> > > >
> > > > and then use Bill's suggestion to group by course, interval.
> > > >
> > > > Norbert
> > > >
> > > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <bi...@gmail.com>
> > > > wrote:
> > > > > You can pass your time to a udf that rounds it down to the nearest
> 30
> > > > second
> > > > > interval and then group by course, interval to get counts for each
> > > > course,
> > > > > interval.
> > > > >
> > > > > On Thursday, October 27, 2011, Marco Cadetg <ma...@zattoo.com>
> > wrote:
> > > > >> I have a problem where I don't know how or if pig is even suitable
> > to
> > > > > solve
> > > > >> it.
> > > > >>
> > > > >> I have a schema like this:
> > > > >>
> > > > >> student-id,student-name,start-time,duration,course
> > > > >> 1,marco,1319708213,500,math
> > > > >> 2,ralf,1319708111,112,english
> > > > >> 3,greg,1319708321,333,french
> > > > >> 4,diva,1319708444,80,english
> > > > >> 5,susanne,1319708123,2000,math
> > > > >> 1,marco,1319708564,500,french
> > > > >> 2,ralf,1319708789,123,french
> > > > >> 7,fred,1319708213,5675,french
> > > > >> 8,laura,1319708233,123,math
> > > > >> 10,sab,1319708999,777,math
> > > > >> 11,fibo,1319708789,565,math
> > > > >> 6,dan,1319708456,50,english
> > > > >> 9,marco,1319708123,60,english
> > > > >> 12,bo,1319708456,345,math
> > > > >> 1,marco,1319708789,673,math
> > > > >> ...
> > > > >> ...
> > > > >>
> > > > >> I would like to retrieve a graph (interpolation) over time grouped
> > by
> > > > >> course. Meaning how many students are learning for a course based
> on
> > a
> > > > 30
> > > > >> sec interval.
> > > > >> The grouping by course is easy but from there I've no clue how I
> > would
> > > > >> achieve the rest. I guess the rest needs to be achieved via some
> UDF
> > > > >> or is there any way how to this in pig? I often think that I need
> a
> > > "for
> > > > >> loop" or something similar in pig.
> > > > >>
> > > > >> Thanks for your help!
> > > > >> -Marco
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: creating a graph over time

Posted by Norbert Burger <no...@gmail.com>.

Perhaps another way to approach this problem is to visualize it
geometrically.  You have a long series of class session instances, where
each class session is like 1D line segment, beginning/stopping at some
start/end time.

These segments naturally overlap, and I think the question you're asking is
equivalent to finding the number of overlaps at every subsegment.

To answer this, you want to first break every class session into a full list
of subsegments, where a subsegment is created by "breaking" each class
session/segment into multiple parts at the start/end point of any other
class session.  You can create this full set of subsegments in one pass by
comparing pairwise (CROSS) each start/end point with your original list of
class sessions.

Once you have the full list of "broken" segments, then a final GROUP
BY/COUNT(*) will you give you the number of overlaps.  Seems like approach
would be faster than the previous approach if your class sessions are very
long, or there are many overlaps.

Norbert

On Thu, Oct 27, 2011 at 4:05 PM, Guy Bayes <fa...@gmail.com> wrote:

> how big is your dataset?
>
> On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <ma...@zattoo.com> wrote:
>
> > Thanks Bill and Norbert that seems like what I was looking for. I'm a bit
> > worried about
> > how much data/io this could create. But I'll see ;)
> >
> > Cheers
> > -Marco
> >
> > On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <
> norbert.burger@gmail.com
> > >wrote:
> >
> > > In case what you're looking for is an analysis over the full learning
> > > duration, and not just the start interval, then one further insight is
> > > that each original record can be transformed into a sequence of
> > > records, where the size of the sequence corresponds to the session
> > > duration.  In other words, you can use a UDF to "explode" the original
> > > record:
> > >
> > > 1,marco,1319708213,500,math
> > >
> > > into:
> > >
> > > 1,marco,1319708190,500,math
> > > 1,marco,1319708220,500,math
> > > 1,marco,1319708250,500,math
> > > 1,marco,1319708280,500,math
> > > 1,marco,1319708310,500,math
> > > 1,marco,1319708340,500,math
> > > 1,marco,1319708370,500,math
> > > 1,marco,1319708400,500,math
> > > 1,marco,1319708430,500,math
> > > 1,marco,1319708460,500,math
> > > 1,marco,1319708490,500,math
> > > 1,marco,1319708520,500,math
> > > 1,marco,1319708550,500,math
> > > 1,marco,1319708580,500,math
> > > 1,marco,1319708610,500,math
> > > 1,marco,1319708640,500,math
> > > 1,marco,1319708670,500,math
> > > 1,marco,1319708700,500,math
> > >
> > > and then use Bill's suggestion to group by course, interval.
> > >
> > > Norbert
> > >
> > > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <bi...@gmail.com>
> > > wrote:
> > > > You can pass your time to a udf that rounds it down to the nearest 30
> > > second
> > > > interval and then group by course, interval to get counts for each
> > > course,
> > > > interval.
> > > >
> > > > On Thursday, October 27, 2011, Marco Cadetg <ma...@zattoo.com>
> wrote:
> > > >> I have a problem where I don't know how or if pig is even suitable
> to
> > > > solve
> > > >> it.
> > > >>
> > > >> I have a schema like this:
> > > >>
> > > >> student-id,student-name,start-time,duration,course
> > > >> 1,marco,1319708213,500,math
> > > >> 2,ralf,1319708111,112,english
> > > >> 3,greg,1319708321,333,french
> > > >> 4,diva,1319708444,80,english
> > > >> 5,susanne,1319708123,2000,math
> > > >> 1,marco,1319708564,500,french
> > > >> 2,ralf,1319708789,123,french
> > > >> 7,fred,1319708213,5675,french
> > > >> 8,laura,1319708233,123,math
> > > >> 10,sab,1319708999,777,math
> > > >> 11,fibo,1319708789,565,math
> > > >> 6,dan,1319708456,50,english
> > > >> 9,marco,1319708123,60,english
> > > >> 12,bo,1319708456,345,math
> > > >> 1,marco,1319708789,673,math
> > > >> ...
> > > >> ...
> > > >>
> > > >> I would like to retrieve a graph (interpolation) over time grouped
> by
> > > >> course. Meaning how many students are learning for a course based on
> a
> > > 30
> > > >> sec interval.
> > > >> The grouping by course is easy but from there I've no clue how I
> would
> > > >> achieve the rest. I guess the rest needs to be achieved via some UDF
> > > >> or is there any way how to this in pig? I often think that I need a
> > "for
> > > >> loop" or something similar in pig.
> > > >>
> > > >> Thanks for your help!
> > > >> -Marco
> > > >>
> > > >
> > >
> >
>

Re: creating a graph over time

Posted by Guy Bayes <fa...@gmail.com>.

how big is your dataset?

On Thu, Oct 27, 2011 at 9:23 AM, Marco Cadetg <ma...@zattoo.com> wrote:

> Thanks Bill and Norbert that seems like what I was looking for. I'm a bit
> worried about
> how much data/io this could create. But I'll see ;)
>
> Cheers
> -Marco
>
> On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <norbert.burger@gmail.com
> >wrote:
>
> > In case what you're looking for is an analysis over the full learning
> > duration, and not just the start interval, then one further insight is
> > that each original record can be transformed into a sequence of
> > records, where the size of the sequence corresponds to the session
> > duration.  In other words, you can use a UDF to "explode" the original
> > record:
> >
> > 1,marco,1319708213,500,math
> >
> > into:
> >
> > 1,marco,1319708190,500,math
> > 1,marco,1319708220,500,math
> > 1,marco,1319708250,500,math
> > 1,marco,1319708280,500,math
> > 1,marco,1319708310,500,math
> > 1,marco,1319708340,500,math
> > 1,marco,1319708370,500,math
> > 1,marco,1319708400,500,math
> > 1,marco,1319708430,500,math
> > 1,marco,1319708460,500,math
> > 1,marco,1319708490,500,math
> > 1,marco,1319708520,500,math
> > 1,marco,1319708550,500,math
> > 1,marco,1319708580,500,math
> > 1,marco,1319708610,500,math
> > 1,marco,1319708640,500,math
> > 1,marco,1319708670,500,math
> > 1,marco,1319708700,500,math
> >
> > and then use Bill's suggestion to group by course, interval.
> >
> > Norbert
> >
> > On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <bi...@gmail.com>
> > wrote:
> > > You can pass your time to a udf that rounds it down to the nearest 30
> > second
> > > interval and then group by course, interval to get counts for each
> > course,
> > > interval.
> > >
> > > On Thursday, October 27, 2011, Marco Cadetg <ma...@zattoo.com> wrote:
> > >> I have a problem where I don't know how or if pig is even suitable to
> > > solve
> > >> it.
> > >>
> > >> I have a schema like this:
> > >>
> > >> student-id,student-name,start-time,duration,course
> > >> 1,marco,1319708213,500,math
> > >> 2,ralf,1319708111,112,english
> > >> 3,greg,1319708321,333,french
> > >> 4,diva,1319708444,80,english
> > >> 5,susanne,1319708123,2000,math
> > >> 1,marco,1319708564,500,french
> > >> 2,ralf,1319708789,123,french
> > >> 7,fred,1319708213,5675,french
> > >> 8,laura,1319708233,123,math
> > >> 10,sab,1319708999,777,math
> > >> 11,fibo,1319708789,565,math
> > >> 6,dan,1319708456,50,english
> > >> 9,marco,1319708123,60,english
> > >> 12,bo,1319708456,345,math
> > >> 1,marco,1319708789,673,math
> > >> ...
> > >> ...
> > >>
> > >> I would like to retrieve a graph (interpolation) over time grouped by
> > >> course. Meaning how many students are learning for a course based on a
> > 30
> > >> sec interval.
> > >> The grouping by course is easy but from there I've no clue how I would
> > >> achieve the rest. I guess the rest needs to be achieved via some UDF
> > >> or is there any way how to this in pig? I often think that I need a
> "for
> > >> loop" or something similar in pig.
> > >>
> > >> Thanks for your help!
> > >> -Marco
> > >>
> > >
> >
>

Re: creating a graph over time

Posted by Marco Cadetg <ma...@zattoo.com>.

Thanks Bill and Norbert that seems like what I was looking for. I'm a bit
worried about
how much data/io this could create. But I'll see ;)

Cheers
-Marco

On Thu, Oct 27, 2011 at 6:03 PM, Norbert Burger <no...@gmail.com>wrote:

> In case what you're looking for is an analysis over the full learning
> duration, and not just the start interval, then one further insight is
> that each original record can be transformed into a sequence of
> records, where the size of the sequence corresponds to the session
> duration.  In other words, you can use a UDF to "explode" the original
> record:
>
> 1,marco,1319708213,500,math
>
> into:
>
> 1,marco,1319708190,500,math
> 1,marco,1319708220,500,math
> 1,marco,1319708250,500,math
> 1,marco,1319708280,500,math
> 1,marco,1319708310,500,math
> 1,marco,1319708340,500,math
> 1,marco,1319708370,500,math
> 1,marco,1319708400,500,math
> 1,marco,1319708430,500,math
> 1,marco,1319708460,500,math
> 1,marco,1319708490,500,math
> 1,marco,1319708520,500,math
> 1,marco,1319708550,500,math
> 1,marco,1319708580,500,math
> 1,marco,1319708610,500,math
> 1,marco,1319708640,500,math
> 1,marco,1319708670,500,math
> 1,marco,1319708700,500,math
>
> and then use Bill's suggestion to group by course, interval.
>
> Norbert
>
> On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <bi...@gmail.com>
> wrote:
> > You can pass your time to a udf that rounds it down to the nearest 30
> second
> > interval and then group by course, interval to get counts for each
> course,
> > interval.
> >
> > On Thursday, October 27, 2011, Marco Cadetg <ma...@zattoo.com> wrote:
> >> I have a problem where I don't know how or if pig is even suitable to
> > solve
> >> it.
> >>
> >> I have a schema like this:
> >>
> >> student-id,student-name,start-time,duration,course
> >> 1,marco,1319708213,500,math
> >> 2,ralf,1319708111,112,english
> >> 3,greg,1319708321,333,french
> >> 4,diva,1319708444,80,english
> >> 5,susanne,1319708123,2000,math
> >> 1,marco,1319708564,500,french
> >> 2,ralf,1319708789,123,french
> >> 7,fred,1319708213,5675,french
> >> 8,laura,1319708233,123,math
> >> 10,sab,1319708999,777,math
> >> 11,fibo,1319708789,565,math
> >> 6,dan,1319708456,50,english
> >> 9,marco,1319708123,60,english
> >> 12,bo,1319708456,345,math
> >> 1,marco,1319708789,673,math
> >> ...
> >> ...
> >>
> >> I would like to retrieve a graph (interpolation) over time grouped by
> >> course. Meaning how many students are learning for a course based on a
> 30
> >> sec interval.
> >> The grouping by course is easy but from there I've no clue how I would
> >> achieve the rest. I guess the rest needs to be achieved via some UDF
> >> or is there any way how to this in pig? I often think that I need a "for
> >> loop" or something similar in pig.
> >>
> >> Thanks for your help!
> >> -Marco
> >>
> >
>

Re: creating a graph over time

Posted by Norbert Burger <no...@gmail.com>.

In case what you're looking for is an analysis over the full learning
duration, and not just the start interval, then one further insight is
that each original record can be transformed into a sequence of
records, where the size of the sequence corresponds to the session
duration.  In other words, you can use a UDF to "explode" the original
record:

1,marco,1319708213,500,math

into:

1,marco,1319708190,500,math
1,marco,1319708220,500,math
1,marco,1319708250,500,math
1,marco,1319708280,500,math
1,marco,1319708310,500,math
1,marco,1319708340,500,math
1,marco,1319708370,500,math
1,marco,1319708400,500,math
1,marco,1319708430,500,math
1,marco,1319708460,500,math
1,marco,1319708490,500,math
1,marco,1319708520,500,math
1,marco,1319708550,500,math
1,marco,1319708580,500,math
1,marco,1319708610,500,math
1,marco,1319708640,500,math
1,marco,1319708670,500,math
1,marco,1319708700,500,math

and then use Bill's suggestion to group by course, interval.

Norbert

On Thu, Oct 27, 2011 at 11:05 AM, Bill Graham <bi...@gmail.com> wrote:
> You can pass your time to a udf that rounds it down to the nearest 30 second
> interval and then group by course, interval to get counts for each course,
> interval.
>
> On Thursday, October 27, 2011, Marco Cadetg <ma...@zattoo.com> wrote:
>> I have a problem where I don't know how or if pig is even suitable to
> solve
>> it.
>>
>> I have a schema like this:
>>
>> student-id,student-name,start-time,duration,course
>> 1,marco,1319708213,500,math
>> 2,ralf,1319708111,112,english
>> 3,greg,1319708321,333,french
>> 4,diva,1319708444,80,english
>> 5,susanne,1319708123,2000,math
>> 1,marco,1319708564,500,french
>> 2,ralf,1319708789,123,french
>> 7,fred,1319708213,5675,french
>> 8,laura,1319708233,123,math
>> 10,sab,1319708999,777,math
>> 11,fibo,1319708789,565,math
>> 6,dan,1319708456,50,english
>> 9,marco,1319708123,60,english
>> 12,bo,1319708456,345,math
>> 1,marco,1319708789,673,math
>> ...
>> ...
>>
>> I would like to retrieve a graph (interpolation) over time grouped by
>> course. Meaning how many students are learning for a course based on a 30
>> sec interval.
>> The grouping by course is easy but from there I've no clue how I would
>> achieve the rest. I guess the rest needs to be achieved via some UDF
>> or is there any way how to this in pig? I often think that I need a "for
>> loop" or something similar in pig.
>>
>> Thanks for your help!
>> -Marco
>>
>

Re: creating a graph over time

Posted by Bill Graham <bi...@gmail.com>.

You can pass your time to a udf that rounds it down to the nearest 30 second
interval and then group by course, interval to get counts for each course,
interval.

On Thursday, October 27, 2011, Marco Cadetg <ma...@zattoo.com> wrote:
> I have a problem where I don't know how or if pig is even suitable to
solve
> it.
>
> I have a schema like this:
>
> student-id,student-name,start-time,duration,course
> 1,marco,1319708213,500,math
> 2,ralf,1319708111,112,english
> 3,greg,1319708321,333,french
> 4,diva,1319708444,80,english
> 5,susanne,1319708123,2000,math
> 1,marco,1319708564,500,french
> 2,ralf,1319708789,123,french
> 7,fred,1319708213,5675,french
> 8,laura,1319708233,123,math
> 10,sab,1319708999,777,math
> 11,fibo,1319708789,565,math
> 6,dan,1319708456,50,english
> 9,marco,1319708123,60,english
> 12,bo,1319708456,345,math
> 1,marco,1319708789,673,math
> ...
> ...
>
> I would like to retrieve a graph (interpolation) over time grouped by
> course. Meaning how many students are learning for a course based on a 30
> sec interval.
> The grouping by course is easy but from there I've no clue how I would
> achieve the rest. I guess the rest needs to be achieved via some UDF
> or is there any way how to this in pig? I often think that I need a "for
> loop" or something similar in pig.
>
> Thanks for your help!
> -Marco
>

Re: creating a graph over time

Posted by pablomar <pa...@gmail.com>.

you can loop from python. I've never tried it but you have a pretty good
explanation here (
http://ofps.oreilly.com/titles/9781449302641/embedding.html )
recently, I have to analyze some log files and I needed to loop (to
calculate some stats) and I used an UDF

in your case, I would go with Bill proposal



On Thu, Oct 27, 2011 at 5:56 AM, Marco Cadetg <ma...@zattoo.com> wrote:

> I have a problem where I don't know how or if pig is even suitable to solve
> it.
>
> I have a schema like this:
>
> student-id,student-name,start-time,duration,course
> 1,marco,1319708213,500,math
> 2,ralf,1319708111,112,english
> 3,greg,1319708321,333,french
> 4,diva,1319708444,80,english
> 5,susanne,1319708123,2000,math
> 1,marco,1319708564,500,french
> 2,ralf,1319708789,123,french
> 7,fred,1319708213,5675,french
> 8,laura,1319708233,123,math
> 10,sab,1319708999,777,math
> 11,fibo,1319708789,565,math
> 6,dan,1319708456,50,english
> 9,marco,1319708123,60,english
> 12,bo,1319708456,345,math
> 1,marco,1319708789,673,math
> ...
> ...
>
> I would like to retrieve a graph (interpolation) over time grouped by
> course. Meaning how many students are learning for a course based on a 30
> sec interval.
> The grouping by course is easy but from there I've no clue how I would
> achieve the rest. I guess the rest needs to be achieved via some UDF
> or is there any way how to this in pig? I often think that I need a "for
> loop" or something similar in pig.
>
> Thanks for your help!
> -Marco
>