You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Kris Coward <kr...@melon.org> on 2010/12/17 20:31:14 UTC

Cumulative totals in an ORDERed relation.

Hello,

Is there some sort of mechanism by which I could cause a value to
accumulate within a relation? What I'd like to do is something along the
lines of having a long called accumulator, and an outer bag called
hourlyTotals with a schema of (hour:int, collected:int)

accumulator = 0L; -- I know this line doesn't work
ORDER hourlyTotals BY collected;
cumulativeTotals = FOREACH hourlyTotals {
			accumulator += collected;
			GENERATE day, accumulator AS collected;
			}

Could something like this be made to work? Is there something similar that
I can do instead? Do I just need to pipe the relation through an
external script to get what I want?

Thanks,
Kris

-- 
Kris Coward					http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: Cumulative totals in an ORDERed relation.

Posted by Kris Coward <kr...@melon.org>.

Right, that's a good point, it is a non-parallelizable process. I
probably should just dump it through a script, since even an entire
century of data would be <1M hours and not really need to take advantage
of the cluster. ISTR there's some pretty good functionality for that, so
I just need to look it up in the documentation again.

Thanks,
Kris

On Fri, Dec 17, 2010 at 03:22:53PM -0800, Dmitriy Ryaboy wrote:
> What you are suggesting seems to be a fundamentally single-threaded process
> (well, it can be parallelized, but it's not pretty and involves multiple
> passes), so it's not a good fit for the map-reduce paradigm (how would you
> do accumulative totals for 25 billion entries?).  Pig tends to avoid
> implementing methods that restrict scaling computations in this way. Your
> idea of streaming through a script would work; you could also write an
> accumulative UDF and use it on the result of doing a GROUP ALL on your
> relation.
> 
> -Dmitriy
> 
> On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <kr...@melon.org> wrote:
> 
> > Hello,
> >
> > Is there some sort of mechanism by which I could cause a value to
> > accumulate within a relation? What I'd like to do is something along the
> > lines of having a long called accumulator, and an outer bag called
> > hourlyTotals with a schema of (hour:int, collected:int)
> >
> > accumulator = 0L; -- I know this line doesn't work
> > ORDER hourlyTotals BY collected;
> > cumulativeTotals = FOREACH hourlyTotals {
> >                        accumulator += collected;
> >                        GENERATE day, accumulator AS collected;
> >                        }
> >
> > Could something like this be made to work? Is there something similar that
> > I can do instead? Do I just need to pipe the relation through an
> > external script to get what I want?
> >
> > Thanks,
> > Kris
> >
> > --
> > Kris Coward                                     http://unripe.melon.org/
> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
> >

Re: Cumulative totals in an ORDERed relation.

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

My interpretation was that he wants something more like this:

in: {2, 5, 7, 1, 1, 3}
out: {2, 7, 14, 15, 16, 19}

.. which you can't get using a simple group/count.

-D

On Fri, Dec 17, 2010 at 3:36 PM, Zach Bailey <za...@dataclip.com>wrote:

>
>  Forgive me but I got one thing slightly wrong. Since you're wanting to do
> hourly totals and not daily totals you will want to change this line:
>
> > allDataISODates = FOREACH allData GENERATE string,
> org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts))
> as isoHour;
> >
> >
> >
> >
> to this:
>
>
> allDataISODates = FOREACH allData GENERATE string,
> org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToHour(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts))
> as isoHour;
>
>
> Of course I just illustrated how easy it is to swap in different piggybank
> functions to do different statistical roll-ups depending on what sort of
> temporal granularity you need. Huzzah!
>
> Happy pigging,
> Zach
>
>
> On Friday, December 17, 2010 at 6:32 PM, Zach Bailey wrote:
>
> >
> >  I believe what you're trying to do is this. You have some sort of data,
> and a timestamp:
> >
> >
> > What you want to figure out is how many times each possible value of
> "data" appears in a certain time period (say, hourly).
> >
> >
> > Let's say data can have three possible string values: {'a', 'b', 'c'}
> >
> >
> > Your timestamp for convenience sake is a Unix UTC timestamp or ISO
> formatted date (I would strongly recommend using one of these since there
> are already piggybank functions to slice and dice them).
> >
> >
> > To accumulate all the times that the data 'a' appeared in an hour you
> would do something like this:
> >
> >
> > --register piggybank.jar for iso date functions
> > REGISTER ./piggybank.jar
> > allData = load ... as (string:chararray, ts:long);
> > --convert ts to ISO Date, and truncate to the hour
> > allDataISODates = FOREACH allData GENERATE string,
> org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts))
> as isoHour;
> > -- group by hour and string
> > groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour);
> > -- append counts
> > stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string
> as string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count;
> >
> >
> > You will now have a relation that looks like:
> > {'a', '2010-12-13T12:00:00', 2334}
> > {'b', '2010-12-13T12:00:00', 123}
> > {'c', '2010-12-13T12:00:00', 3}
> > {'a', '2010-12-13T13:00:00', 34231}
> > {'b', '2010-12-13T13:00:00', 34}
> > {'c', '2010-12-13T13:00:00', 134}
> >
> >
> > Is that the sort of thing you're looking to do?
> >
> > -Zach
> >
> >
> > On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote:
> >
> > > What you are suggesting seems to be a fundamentally single-threaded
> process
> > > (well, it can be parallelized, but it's not pretty and involves
> multiple
> > > passes), so it's not a good fit for the map-reduce paradigm (how would
> you
> > > do accumulative totals for 25 billion entries?). Pig tends to avoid
> > > implementing methods that restrict scaling computations in this way.
> Your
> > > idea of streaming through a script would work; you could also write an
> > > accumulative UDF and use it on the result of doing a GROUP ALL on your
> > > relation.
> > >
> > > -Dmitriy
> > >
> > > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <kr...@melon.org> wrote:
> > >
> > >
> > > >  Hello,
> > > >
> > > >  Is there some sort of mechanism by which I could cause a value to
> > > >  accumulate within a relation? What I'd like to do is something along
> the
> > > >  lines of having a long called accumulator, and an outer bag called
> > > >  hourlyTotals with a schema of (hour:int, collected:int)
> > > >
> > > >  accumulator = 0L; -- I know this line doesn't work
> > > >  ORDER hourlyTotals BY collected;
> > > >  cumulativeTotals = FOREACH hourlyTotals {
> > > >  accumulator += collected;
> > > >  GENERATE day, accumulator AS collected;
> > > >  }
> > > >
> > > >  Could something like this be made to work? Is there something
> similar that
> > > >  I can do instead? Do I just need to pipe the relation through an
> > > >  external script to get what I want?
> > > >
> > > >  Thanks,
> > > >  Kris
> > > >
> > > >  --
> > > >  Kris Coward d"http:="" unripe.melon.org"="">
> http://unripe.melon.org/
> > > >  GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
> >
> >
> >
>
>
>

Re: Cumulative totals in an ORDERed relation.

Posted by Zach Bailey <za...@dataclip.com>.

 Forgive me but I got one thing slightly wrong. Since you're wanting to do hourly totals and not daily totals you will want to change this line:

> allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour;
> 
> 
> 
> 
to this:


allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToHour(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour; 


Of course I just illustrated how easy it is to swap in different piggybank functions to do different statistical roll-ups depending on what sort of temporal granularity you need. Huzzah!

Happy pigging,
Zach


On Friday, December 17, 2010 at 6:32 PM, Zach Bailey wrote:

> 
>  I believe what you're trying to do is this. You have some sort of data, and a timestamp:
> 
> 
> What you want to figure out is how many times each possible value of "data" appears in a certain time period (say, hourly).
> 
> 
> Let's say data can have three possible string values: {'a', 'b', 'c'}
> 
> 
> Your timestamp for convenience sake is a Unix UTC timestamp or ISO formatted date (I would strongly recommend using one of these since there are already piggybank functions to slice and dice them).
> 
> 
> To accumulate all the times that the data 'a' appeared in an hour you would do something like this:
> 
> 
> --register piggybank.jar for iso date functions
> REGISTER ./piggybank.jar
> allData = load ... as (string:chararray, ts:long);
> --convert ts to ISO Date, and truncate to the hour
> allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour;
> -- group by hour and string
> groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour);
> -- append counts
> stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string as string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count;
> 
> 
> You will now have a relation that looks like:
> {'a', '2010-12-13T12:00:00', 2334}
> {'b', '2010-12-13T12:00:00', 123}
> {'c', '2010-12-13T12:00:00', 3}
> {'a', '2010-12-13T13:00:00', 34231}
> {'b', '2010-12-13T13:00:00', 34}
> {'c', '2010-12-13T13:00:00', 134}
> 
> 
> Is that the sort of thing you're looking to do?
> 
> -Zach
> 
> 
> On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote:
> 
> > What you are suggesting seems to be a fundamentally single-threaded process
> > (well, it can be parallelized, but it's not pretty and involves multiple
> > passes), so it's not a good fit for the map-reduce paradigm (how would you
> > do accumulative totals for 25 billion entries?). Pig tends to avoid
> > implementing methods that restrict scaling computations in this way. Your
> > idea of streaming through a script would work; you could also write an
> > accumulative UDF and use it on the result of doing a GROUP ALL on your
> > relation.
> > 
> > -Dmitriy
> > 
> > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <kr...@melon.org> wrote:
> > 
> > 
> > >  Hello,
> > > 
> > >  Is there some sort of mechanism by which I could cause a value to
> > >  accumulate within a relation? What I'd like to do is something along the
> > >  lines of having a long called accumulator, and an outer bag called
> > >  hourlyTotals with a schema of (hour:int, collected:int)
> > > 
> > >  accumulator = 0L; -- I know this line doesn't work
> > >  ORDER hourlyTotals BY collected;
> > >  cumulativeTotals = FOREACH hourlyTotals {
> > >  accumulator += collected;
> > >  GENERATE day, accumulator AS collected;
> > >  }
> > > 
> > >  Could something like this be made to work? Is there something similar that
> > >  I can do instead? Do I just need to pipe the relation through an
> > >  external script to get what I want?
> > > 
> > >  Thanks,
> > >  Kris
> > > 
> > >  --
> > >  Kris Coward d"http:="" unripe.melon.org"="">http://unripe.melon.org/
> > >  GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 
> 
> 
>

Re: Cumulative totals in an ORDERed relation.

Posted by Kris Coward <kr...@melon.org>.

Well for the step you're describing (which I need to do as a preliminary
step to accumulating the hours), I just do something in the vein of

NewRel = GROUP OldRel BY timestamp/3600;
HourlyRel = FOREACH NewRel GENERATE group as hour, OldRel.something AS something,...;

(Noting that timestamp is stored as a long, so I get integer division
and the GROUP does what's wanted)

Dmitriy was right both about what I was trying to to, and that it's an
inherently serial operation.

Thanks,
Kris

On Fri, Dec 17, 2010 at 06:32:38PM -0500, Zach Bailey wrote:
> 
>  I believe what you're trying to do is this. You have some sort of data, and a timestamp:
> 
> 
> What you want to figure out is how many times each possible value of "data" appears in a certain time period (say, hourly).
> 
> 
> Let's say data can have three possible string values: {'a', 'b', 'c'}
> 
> 
> Your timestamp for convenience sake is a Unix UTC timestamp or ISO formatted date (I would strongly recommend using one of these since there are already piggybank functions to slice and dice them).
> 
> 
> To accumulate all the times that the data 'a' appeared in an hour you would do something like this:
> 
> 
> --register piggybank.jar for iso date functions
> REGISTER ./piggybank.jar
> allData = load ... as (string:chararray, ts:long);
> --convert ts to ISO Date, and truncate to the hour
> allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour;
> -- group by hour and string
> groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour);
> -- append counts
> stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string as string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count;
> 
> 
> You will now have a relation that looks like:
> {'a', '2010-12-13T12:00:00', 2334}
> {'b', '2010-12-13T12:00:00', 123}
> {'c', '2010-12-13T12:00:00', 3}
> {'a', '2010-12-13T13:00:00', 34231}
> {'b', '2010-12-13T13:00:00', 34}
> {'c', '2010-12-13T13:00:00', 134}
> 
> 
> Is that the sort of thing you're looking to do?
> 
> -Zach
> 
> 
> On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote:
> 
> > What you are suggesting seems to be a fundamentally single-threaded process
> > (well, it can be parallelized, but it's not pretty and involves multiple
> > passes), so it's not a good fit for the map-reduce paradigm (how would you
> > do accumulative totals for 25 billion entries?). Pig tends to avoid
> > implementing methods that restrict scaling computations in this way. Your
> > idea of streaming through a script would work; you could also write an
> > accumulative UDF and use it on the result of doing a GROUP ALL on your
> > relation.
> > 
> > -Dmitriy
> > 
> > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <kr...@melon.org> wrote:
> > 
> > 
> > >  Hello,
> > > 
> > >  Is there some sort of mechanism by which I could cause a value to
> > >  accumulate within a relation? What I'd like to do is something along the
> > >  lines of having a long called accumulator, and an outer bag called
> > >  hourlyTotals with a schema of (hour:int, collected:int)
> > > 
> > >  accumulator = 0L; -- I know this line doesn't work
> > >  ORDER hourlyTotals BY collected;
> > >  cumulativeTotals = FOREACH hourlyTotals {
> > >  accumulator += collected;
> > >  GENERATE day, accumulator AS collected;
> > >  }
> > > 
> > >  Could something like this be made to work? Is there something similar that
> > >  I can do instead? Do I just need to pipe the relation through an
> > >  external script to get what I want?
> > > 
> > >  Thanks,
> > >  Kris
> > > 
> > >  --
> > >  Kris Coward http://unripe.melon.org/
> > >  GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> 
> 
> 

-- 
Kris Coward					http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: Cumulative totals in an ORDERed relation.

Posted by Zach Bailey <za...@dataclip.com>.

 I believe what you're trying to do is this. You have some sort of data, and a timestamp:


What you want to figure out is how many times each possible value of "data" appears in a certain time period (say, hourly).


Let's say data can have three possible string values: {'a', 'b', 'c'}


Your timestamp for convenience sake is a Unix UTC timestamp or ISO formatted date (I would strongly recommend using one of these since there are already piggybank functions to slice and dice them).


To accumulate all the times that the data 'a' appeared in an hour you would do something like this:


--register piggybank.jar for iso date functions
REGISTER ./piggybank.jar
allData = load ... as (string:chararray, ts:long);
--convert ts to ISO Date, and truncate to the hour
allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour;
-- group by hour and string
groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour);
-- append counts
stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string as string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count;


You will now have a relation that looks like:
{'a', '2010-12-13T12:00:00', 2334}
{'b', '2010-12-13T12:00:00', 123}
{'c', '2010-12-13T12:00:00', 3}
{'a', '2010-12-13T13:00:00', 34231}
{'b', '2010-12-13T13:00:00', 34}
{'c', '2010-12-13T13:00:00', 134}


Is that the sort of thing you're looking to do?

-Zach


On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote:

> What you are suggesting seems to be a fundamentally single-threaded process
> (well, it can be parallelized, but it's not pretty and involves multiple
> passes), so it's not a good fit for the map-reduce paradigm (how would you
> do accumulative totals for 25 billion entries?). Pig tends to avoid
> implementing methods that restrict scaling computations in this way. Your
> idea of streaming through a script would work; you could also write an
> accumulative UDF and use it on the result of doing a GROUP ALL on your
> relation.
> 
> -Dmitriy
> 
> On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <kr...@melon.org> wrote:
> 
> 
> >  Hello,
> > 
> >  Is there some sort of mechanism by which I could cause a value to
> >  accumulate within a relation? What I'd like to do is something along the
> >  lines of having a long called accumulator, and an outer bag called
> >  hourlyTotals with a schema of (hour:int, collected:int)
> > 
> >  accumulator = 0L; -- I know this line doesn't work
> >  ORDER hourlyTotals BY collected;
> >  cumulativeTotals = FOREACH hourlyTotals {
> >  accumulator += collected;
> >  GENERATE day, accumulator AS collected;
> >  }
> > 
> >  Could something like this be made to work? Is there something similar that
> >  I can do instead? Do I just need to pipe the relation through an
> >  external script to get what I want?
> > 
> >  Thanks,
> >  Kris
> > 
> >  --
> >  Kris Coward http://unripe.melon.org/
> >  GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> > 
> > 
> > 
> 
> 
> 
>

Re: Cumulative totals in an ORDERed relation.

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

What you are suggesting seems to be a fundamentally single-threaded process
(well, it can be parallelized, but it's not pretty and involves multiple
passes), so it's not a good fit for the map-reduce paradigm (how would you
do accumulative totals for 25 billion entries?).  Pig tends to avoid
implementing methods that restrict scaling computations in this way. Your
idea of streaming through a script would work; you could also write an
accumulative UDF and use it on the result of doing a GROUP ALL on your
relation.

-Dmitriy

On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <kr...@melon.org> wrote:

> Hello,
>
> Is there some sort of mechanism by which I could cause a value to
> accumulate within a relation? What I'd like to do is something along the
> lines of having a long called accumulator, and an outer bag called
> hourlyTotals with a schema of (hour:int, collected:int)
>
> accumulator = 0L; -- I know this line doesn't work
> ORDER hourlyTotals BY collected;
> cumulativeTotals = FOREACH hourlyTotals {
>                        accumulator += collected;
>                        GENERATE day, accumulator AS collected;
>                        }
>
> Could something like this be made to work? Is there something similar that
> I can do instead? Do I just need to pipe the relation through an
> external script to get what I want?
>
> Thanks,
> Kris
>
> --
> Kris Coward                                     http://unripe.melon.org/
> GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
>