You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by pranjal rajput <fi...@gmail.com> on 2013/03/15 18:03:39 UTC

Aggregation for chronologically ordered dataset

Hi,
I am new to Pig.
I have a dataset from a time-tracker application.
It records the the time that users spend on various activities.
For example:
UserId | Activity          |  Tool  |  BeginTime | EndTime | DurationMinute
1        |  development  | tool1  |  10:00        |    10:15   |   15
1        |  development  | tool2  |  10:15        |    10:30   |   15
1        |  other             | tool3  |  10:30        |    11:00   |   30
1        |  development  | tool1  |  11:00        |    11:20   |   20
1        |  other             | tool4  |  11:20        |    12:00   |   40
1        |  development  | tool1  |  12:00        |    12:15   |   15
2        |  other             | tool3  |  10:00        |    11:00   |   60
2        |  development  | tool1  |  11:00        |    11:20   |   20
2        |  development  | tool2  |  11:20        |    11:30   |   10

I wish to find out, un-interrupted time slots spent on
Activity=development. like this:

UserId    |   Activity          |  SumDurationMinutes
1           |   development   |  30   /*notice tht two slots are summed*/
1           |   other              |  30
1           |   development   |  20
1           |   other              |  40
1           |   development   |  15
2           |   other              |  60
2           |   development   |  30 /*again sum*/

How can this be done in pig?
I am open to writing a UDF for the same, or any other work around.
Thanks in anticipation,

-- 
Best Regards
Pranjal Rajput

Re: Aggregation for chronologically ordered dataset

Posted by Vitalii Tymchyshyn <ti...@gmail.com>.
I'd use rank function to join previous and next row, then filter out middle
rows, then join first to last and calculate time.
15 бер. 2013 19:04, "pranjal rajput" <fi...@gmail.com> напис.

> Hi,
> I am new to Pig.
> I have a dataset from a time-tracker application.
> It records the the time that users spend on various activities.
> For example:
> UserId | Activity          |  Tool  |  BeginTime | EndTime | DurationMinute
> 1        |  development  | tool1  |  10:00        |    10:15   |   15
> 1        |  development  | tool2  |  10:15        |    10:30   |   15
> 1        |  other             | tool3  |  10:30        |    11:00   |   30
> 1        |  development  | tool1  |  11:00        |    11:20   |   20
> 1        |  other             | tool4  |  11:20        |    12:00   |   40
> 1        |  development  | tool1  |  12:00        |    12:15   |   15
> 2        |  other             | tool3  |  10:00        |    11:00   |   60
> 2        |  development  | tool1  |  11:00        |    11:20   |   20
> 2        |  development  | tool2  |  11:20        |    11:30   |   10
>
> I wish to find out, un-interrupted time slots spent on
> Activity=development. like this:
>
> UserId    |   Activity          |  SumDurationMinutes
> 1           |   development   |  30   /*notice tht two slots are summed*/
> 1           |   other              |  30
> 1           |   development   |  20
> 1           |   other              |  40
> 1           |   development   |  15
> 2           |   other              |  60
> 2           |   development   |  30 /*again sum*/
>
> How can this be done in pig?
> I am open to writing a UDF for the same, or any other work around.
> Thanks in anticipation,
>
> --
> Best Regards
> Pranjal Rajput
>

Re: Aggregation for chronologically ordered dataset

Posted by pranjal rajput <fi...@gmail.com>.
Hello everyone,
its like a local SUM operation.

any pointers, hints would be much appreciated.
let me know if any additional info is required.
thanks,

On Fri, Mar 15, 2013 at 10:33 PM, pranjal rajput <fighterjockey246@gmail.com
> wrote:

> Hi,
> I am new to Pig.
> I have a dataset from a time-tracker application.
> It records the the time that users spend on various activities.
> For example:
> UserId | Activity          |  Tool  |  BeginTime | EndTime | DurationMinute
> 1        |  development  | tool1  |  10:00        |    10:15   |   15
> 1        |  development  | tool2  |  10:15        |    10:30   |   15
> 1        |  other             | tool3  |  10:30        |    11:00   |   30
> 1        |  development  | tool1  |  11:00        |    11:20   |   20
> 1        |  other             | tool4  |  11:20        |    12:00   |   40
> 1        |  development  | tool1  |  12:00        |    12:15   |   15
> 2        |  other             | tool3  |  10:00        |    11:00   |   60
> 2        |  development  | tool1  |  11:00        |    11:20   |   20
> 2        |  development  | tool2  |  11:20        |    11:30   |   10
>
> I wish to find out, un-interrupted time slots spent on
> Activity=development. like this:
>
> UserId    |   Activity          |  SumDurationMinutes
> 1           |   development   |  30   /*notice tht two slots are summed*/
> 1           |   other              |  30
> 1           |   development   |  20
> 1           |   other              |  40
> 1           |   development   |  15
> 2           |   other              |  60
> 2           |   development   |  30 /*again sum*/
>
> How can this be done in pig?
> I am open to writing a UDF for the same, or any other work around.
> Thanks in anticipation,
>
> --
> Best Regards
> Pranjal Rajput
>
>


-- 
Best Regards
Pranjal Rajput
+91-81090-71747