You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Eric Yang <er...@gmail.com> on 2010/12/31 07:12:00 UTC

How to calculate delta in a column?

Hi,

What is the most efficient method to calculate delta of columns?  Consider this:

(key1, 1, 2, 3)
(key1, 2, 4, 5)
(key2, 1, 2, 4)
(key1, 3, 6, 9)
(key2, 2, 4, 6)

The expected transformation output should look like this:

(key1, 1, 2, 2)
(key1, 1, 2, 4)
(key2, 1, 2, 2)

The idea is to group by f0, and compute f1 (current value) - f1
(previous value).  How to write this in pig?

if there is a underflow value, it should reset to 0, for example:

(key1, 1, 2, 3)
(key1, 0, 0, 0)
(key1, 2, 3, 4)

The output should be:

(key1, 0, 0, 0)
(key1, 2, 3, 4)

I haven't been able to find a solution from google.  Anyone?

regards,
Eric

Re: How to calculate delta in a column?

Posted by Eric Yang <er...@gmail.com>.
You are right in my example, there should be a timestamp column.
Thanks, I will look into writing the UDF.

regards,
Eric

On Fri, Dec 31, 2010 at 1:16 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> Can't without a way of ordering the data for the same key.
>
> If you do have a way to do this (a timestamp or some such), you can group by
> key, inside the foreach order the resulting group, and then run through a
> UDF (you can even make this udf accumulative).
>
> grouped = group data by key;
> deltas = foreach grouped {
>    ordered_tuples = order grouped by ordinal;
>    generate key, FLATTEN(calculateDeltas(ordered_tuples));
> }
>
>
> -D
>
>
> On Thu, Dec 30, 2010 at 10:12 PM, Eric Yang <er...@gmail.com> wrote:
>
>> Hi,
>>
>> What is the most efficient method to calculate delta of columns?  Consider
>> this:
>>
>> (key1, 1, 2, 3)
>> (key1, 2, 4, 5)
>> (key2, 1, 2, 4)
>> (key1, 3, 6, 9)
>> (key2, 2, 4, 6)
>>
>> The expected transformation output should look like this:
>>
>> (key1, 1, 2, 2)
>> (key1, 1, 2, 4)
>> (key2, 1, 2, 2)
>>
>> The idea is to group by f0, and compute f1 (current value) - f1
>> (previous value).  How to write this in pig?
>>
>> if there is a underflow value, it should reset to 0, for example:
>>
>> (key1, 1, 2, 3)
>> (key1, 0, 0, 0)
>> (key1, 2, 3, 4)
>>
>> The output should be:
>>
>> (key1, 0, 0, 0)
>> (key1, 2, 3, 4)
>>
>> I haven't been able to find a solution from google.  Anyone?
>>
>> regards,
>> Eric
>>
>

Re: How to calculate delta in a column?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Can't without a way of ordering the data for the same key.

If you do have a way to do this (a timestamp or some such), you can group by
key, inside the foreach order the resulting group, and then run through a
UDF (you can even make this udf accumulative).

grouped = group data by key;
deltas = foreach grouped {
    ordered_tuples = order grouped by ordinal;
    generate key, FLATTEN(calculateDeltas(ordered_tuples));
}


-D


On Thu, Dec 30, 2010 at 10:12 PM, Eric Yang <er...@gmail.com> wrote:

> Hi,
>
> What is the most efficient method to calculate delta of columns?  Consider
> this:
>
> (key1, 1, 2, 3)
> (key1, 2, 4, 5)
> (key2, 1, 2, 4)
> (key1, 3, 6, 9)
> (key2, 2, 4, 6)
>
> The expected transformation output should look like this:
>
> (key1, 1, 2, 2)
> (key1, 1, 2, 4)
> (key2, 1, 2, 2)
>
> The idea is to group by f0, and compute f1 (current value) - f1
> (previous value).  How to write this in pig?
>
> if there is a underflow value, it should reset to 0, for example:
>
> (key1, 1, 2, 3)
> (key1, 0, 0, 0)
> (key1, 2, 3, 4)
>
> The output should be:
>
> (key1, 0, 0, 0)
> (key1, 2, 3, 4)
>
> I haven't been able to find a solution from google.  Anyone?
>
> regards,
> Eric
>