You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Eugene Morozov <em...@griddynamics.com> on 2013/03/06 16:20:41 UTC

Pivot table with zeros instead of empty

Hello!

I have a script that gives me following result:

time_grouped = GROUP joined BY (ip, hour);
counts = FOREACH time_grouped GENERATE group.ip as ip, group.hour as hour,
COUNT(joined) as count;

(128.187.97.22, 0, 180)
(128.187.97.22, 1, 84)
(128.187.97.22, 2, 25)
(128.187.97.22, 22, 31)
(128.187.97.22, 23, 2)

That is IP address, hour of day, counter.
I'd like to get following:

(128.187.97.22, (m1,m2, m3, ..., m23))
m1-m23 corresponds to the counter. And if there is nothing for particular
hour, then I'd like to have 0 instead empty value.

The trick here is that if I do not have anything for particular hour, then
I won't have count for it.
Is there a way to achieve the goal?

Thanks in advance
-- 
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
emorozov@griddynamics.com

Re: Pivot table with zeros instead of empty

Posted by Eugene Morozov <em...@griddynamics.com>.

I would like to share what I've been trying to do to achieve my goal with
only PIG (no custom UDF - just to please my curiosity =)).
Please tell me if that's wrong direction.

So, I've got two files
exp-tuple1:
0 1
1 1
2 2
3 3
4 4
6 6
7 7
8 8
9 9

exp-tuple2
0 0
1 0
2 0
3 0
4 0
5 0     <--- please, note, there is not 5 in exp-tuple1
6 0
7 0
8 0
9 0

The goal is to get something like (1, 1, 2, 3, 4, 0, 6, 7, 8, 9) from
exp-tuple1, but with zero for 5 instead of empty value.

a = load 'exp-tuple1' as (hour:int, count:int);
b = load 'exp-tuple2' as (hour:int, count:int);

joined = join b by hour left, a by hour;
c = foreach joined generate b::hour, b::count + (a::count is null ? 0 :
a::count) as count;
grouped = group c all;    <-- instead of all I have real value
d = foreach grouped generate group, $1.count;

dump d; -- (all,{(1),(1),(2),(3),(4),(0),(6),(7),(8),(9)})
I don't care about all these braces, it's good to have such a result.

Instead of "all" there would real value. So, it seems it's possible with
having hack of tuple with zeros for all 24 hours.


But, the question is would it be performant enough? Or it's still better to
write my own UDF?


On Fri, Mar 8, 2013 at 12:55 PM, Eugene Morozov
<em...@griddynamics.com>wrote:

> Prashant,
>
> thank you for the advice. That's a good one!
>
> On Thu, Mar 7, 2013 at 12:04 PM, Prashant Kommireddi <pr...@gmail.com>wrote:
>
>> You can group on ip which will result in ip -> bag mappings. Let's say you
>> loaded (ip, hour, count) dataset as follows
>>
>> grunt> a = load 'data' using PigStorage(',') as (ip:chararray, hour:int,
>> count:int);
>> grunt> b = group a by
>> ip;
>> grunt> describe b;
>> b: {group: chararray,a: {(ip: chararray,hour: int,count: int)}}
>>
>> The group here is ip and the resulting bag contains (ip, hour, count)
>> tuples.
>> grunt> dump b;
>>
>> (128.187.97.22,{(128.187.97.22,0,180),(128.187.97.22,1,84),(128.187.97.22,2,25),(128.187.97.22,22,31),(128.187.97.22,23,2)})
>>
>> You can write a very simple UDF that returns a Tuple based on the contents
>> of this bag that would have what you need. The UDF would traverse through
>> the tuples in the bag and create a map of hour->count. Your result tuple
>> should be sorted by hour and only return the counts in an order
>> (m0...m23).
>>
>> -Prashant
>>
>> On Wed, Mar 6, 2013 at 7:20 AM, Eugene Morozov <emorozov@griddynamics.com
>> >wrote:
>>
>> > Hello!
>> >
>> > I have a script that gives me following result:
>> >
>> > time_grouped = GROUP joined BY (ip, hour);
>> > counts = FOREACH time_grouped GENERATE group.ip as ip, group.hour as
>> hour,
>> > COUNT(joined) as count;
>> >
>> > (128.187.97.22, 0, 180)
>> > (128.187.97.22, 1, 84)
>> > (128.187.97.22, 2, 25)
>> > (128.187.97.22, 22, 31)
>> > (128.187.97.22, 23, 2)
>> >
>> > That is IP address, hour of day, counter.
>> > I'd like to get following:
>> >
>> > (128.187.97.22, (m1,m2, m3, ..., m23))
>> > m1-m23 corresponds to the counter. And if there is nothing for
>> particular
>> > hour, then I'd like to have 0 instead empty value.
>> >
>> > The trick here is that if I do not have anything for particular hour,
>> then
>> > I won't have count for it.
>> > Is there a way to achieve the goal?
>> >
>> > Thanks in advance
>> > --
>> > Evgeny Morozov
>> > Developer Grid Dynamics
>> > Skype: morozov.evgeny
>> > www.griddynamics.com
>> > emorozov@griddynamics.com
>> >
>>
>
>
>
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> emorozov@griddynamics.com
>



-- 
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
emorozov@griddynamics.com

Re: Pivot table with zeros instead of empty

Posted by Eugene Morozov <em...@griddynamics.com>.

Prashant,

thank you for the advice. That's a good one!

On Thu, Mar 7, 2013 at 12:04 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> You can group on ip which will result in ip -> bag mappings. Let's say you
> loaded (ip, hour, count) dataset as follows
>
> grunt> a = load 'data' using PigStorage(',') as (ip:chararray, hour:int,
> count:int);
> grunt> b = group a by
> ip;
> grunt> describe b;
> b: {group: chararray,a: {(ip: chararray,hour: int,count: int)}}
>
> The group here is ip and the resulting bag contains (ip, hour, count)
> tuples.
> grunt> dump b;
>
> (128.187.97.22,{(128.187.97.22,0,180),(128.187.97.22,1,84),(128.187.97.22,2,25),(128.187.97.22,22,31),(128.187.97.22,23,2)})
>
> You can write a very simple UDF that returns a Tuple based on the contents
> of this bag that would have what you need. The UDF would traverse through
> the tuples in the bag and create a map of hour->count. Your result tuple
> should be sorted by hour and only return the counts in an order (m0...m23).
>
> -Prashant
>
> On Wed, Mar 6, 2013 at 7:20 AM, Eugene Morozov <emorozov@griddynamics.com
> >wrote:
>
> > Hello!
> >
> > I have a script that gives me following result:
> >
> > time_grouped = GROUP joined BY (ip, hour);
> > counts = FOREACH time_grouped GENERATE group.ip as ip, group.hour as
> hour,
> > COUNT(joined) as count;
> >
> > (128.187.97.22, 0, 180)
> > (128.187.97.22, 1, 84)
> > (128.187.97.22, 2, 25)
> > (128.187.97.22, 22, 31)
> > (128.187.97.22, 23, 2)
> >
> > That is IP address, hour of day, counter.
> > I'd like to get following:
> >
> > (128.187.97.22, (m1,m2, m3, ..., m23))
> > m1-m23 corresponds to the counter. And if there is nothing for particular
> > hour, then I'd like to have 0 instead empty value.
> >
> > The trick here is that if I do not have anything for particular hour,
> then
> > I won't have count for it.
> > Is there a way to achieve the goal?
> >
> > Thanks in advance
> > --
> > Evgeny Morozov
> > Developer Grid Dynamics
> > Skype: morozov.evgeny
> > www.griddynamics.com
> > emorozov@griddynamics.com
> >
>



-- 
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
emorozov@griddynamics.com

Re: Pivot table with zeros instead of empty

Posted by Prashant Kommireddi <pr...@gmail.com>.

You can group on ip which will result in ip -> bag mappings. Let's say you
loaded (ip, hour, count) dataset as follows

grunt> a = load 'data' using PigStorage(',') as (ip:chararray, hour:int,
count:int);
grunt> b = group a by
ip;
grunt> describe b;
b: {group: chararray,a: {(ip: chararray,hour: int,count: int)}}

The group here is ip and the resulting bag contains (ip, hour, count)
tuples.
grunt> dump b;
(128.187.97.22,{(128.187.97.22,0,180),(128.187.97.22,1,84),(128.187.97.22,2,25),(128.187.97.22,22,31),(128.187.97.22,23,2)})

You can write a very simple UDF that returns a Tuple based on the contents
of this bag that would have what you need. The UDF would traverse through
the tuples in the bag and create a map of hour->count. Your result tuple
should be sorted by hour and only return the counts in an order (m0...m23).

-Prashant

On Wed, Mar 6, 2013 at 7:20 AM, Eugene Morozov <em...@griddynamics.com>wrote:

> Hello!
>
> I have a script that gives me following result:
>
> time_grouped = GROUP joined BY (ip, hour);
> counts = FOREACH time_grouped GENERATE group.ip as ip, group.hour as hour,
> COUNT(joined) as count;
>
> (128.187.97.22, 0, 180)
> (128.187.97.22, 1, 84)
> (128.187.97.22, 2, 25)
> (128.187.97.22, 22, 31)
> (128.187.97.22, 23, 2)
>
> That is IP address, hour of day, counter.
> I'd like to get following:
>
> (128.187.97.22, (m1,m2, m3, ..., m23))
> m1-m23 corresponds to the counter. And if there is nothing for particular
> hour, then I'd like to have 0 instead empty value.
>
> The trick here is that if I do not have anything for particular hour, then
> I won't have count for it.
> Is there a way to achieve the goal?
>
> Thanks in advance
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> emorozov@griddynamics.com
>