You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Chengi Liu <ch...@gmail.com> on 2014/05/14 22:25:04 UTC
Frequency count in pig
Hi,
My data is in format:
user_id,movie_id,timestamp
123, abc,unix_timestamp
123, def, ...
123, abc, ...
234, sda, ...
Now, I want to compute the number of times each movie is played in pig..
So the output I am expecting is:
123,abc,2
123,def,1
234,sda,1
and so on..
how do i do this in pig
Re: Frequency count in pig
Posted by Serega Sheypak <se...@gmail.com>.
Sample pseudocode.
The idea is to group tuples by movie_id and count size of group bags.
movieAlias = LOAD 'path/to/movie/files' as (
user_id:long,movie_id:long,timestamp:long);
groupedByMovie = group movieAlias by movie_id;
counted = FOREACH groupedByMovie GENERATE group as movie_id,
COUNT(movieAlias) as cnt;
projected = FOREACH counted GENERATE movie_id, cnt;
store projected into 'output/path';
2014-05-15 0:25 GMT+04:00 Chengi Liu <ch...@gmail.com>:
> Hi,
>
> My data is in format:
>
> user_id,movie_id,timestamp
> 123, abc,unix_timestamp
> 123, def, ...
> 123, abc, ...
> 234, sda, ...
>
>
> Now, I want to compute the number of times each movie is played in pig..
> So the output I am expecting is:
>
> 123,abc,2
> 123,def,1
> 234,sda,1
>
> and so on..
> how do i do this in pig
>
RE: Frequency count in pig
Posted by Steve Bernstein <St...@deem.com>.
Really easy, fundamental actually.
a = Group your_data by (user_id,movie);
foreach a generate
flatten(group)
count($1)
;
-----Original Message-----
From: Chengi Liu [mailto:chengi.liu.86@gmail.com]
Sent: Wednesday, May 14, 2014 1:25 PM
To: user@pig.apache.org
Subject: Frequency count in pig
Hi,
My data is in format:
user_id,movie_id,timestamp
123, abc,unix_timestamp
123, def, ...
123, abc, ...
234, sda, ...
Now, I want to compute the number of times each movie is played in pig..
So the output I am expecting is:
123,abc,2
123,def,1
234,sda,1
and so on..
how do i do this in pig
Re: Frequency count in pig
Posted by Darpan R <da...@gmail.com>.
group by movie name , count the tuples in the bag simple.
On 15 May 2014 01:55, Chengi Liu <ch...@gmail.com> wrote:
> Hi,
>
> My data is in format:
>
> user_id,movie_id,timestamp
> 123, abc,unix_timestamp
> 123, def, ...
> 123, abc, ...
> 234, sda, ...
>
>
> Now, I want to compute the number of times each movie is played in pig..
> So the output I am expecting is:
>
> 123,abc,2
> 123,def,1
> 234,sda,1
>
> and so on..
> how do i do this in pig
>
Re: Frequency count in pig
Posted by Shengjun Xin <sx...@gopivotal.com>.
such as the following:
movie = LOAD '$input' AS (user_id:int, movie_id:chararray, timestamp:int);
movie_group = GROUP movie by user_id;
movie_count = FOREACH movie_group GENERATE group as user_id, movie_id,
COUNT($1) AS MovieCount;
On Thu, May 15, 2014 at 4:25 AM, Chengi Liu <ch...@gmail.com> wrote:
> Hi,
>
> My data is in format:
>
> user_id,movie_id,timestamp
> 123, abc,unix_timestamp
> 123, def, ...
> 123, abc, ...
> 234, sda, ...
>
>
> Now, I want to compute the number of times each movie is played in pig..
> So the output I am expecting is:
>
> 123,abc,2
> 123,def,1
> 234,sda,1
>
> and so on..
> how do i do this in pig
>
--
Regards
Shengjun