You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Hadoop Learner <ha...@gmail.com> on 2012/09/26 03:55:25 UTC

Distinct Count

Hello,

Need help with finding the distinct count. Would appreciate if you
could please help.

Here's my data file:

id , dept, budget

1, Marketing, 9000
2, Marketing, 1000
3, Finance, 9000
4, Sales, 2000


I am trying to get the unique count of the departments in the company
so I expect 3 - since there are 3 departments.

Here's my PIG program:


deptInfo = load 'dept.txt'  using PigStorage(',') as (id, dept, budget );

-- get a distinct count of departments

groupedByDept = group  deptInfo by dept;

uniqcnt  = foreach groupedByDept  {
           dept      = deptInfo.dept;
           uniq_dept  = distinct dept ;
           generate group, COUNT(uniq_dept);

           }

dump uniqcnt;


What this gives me is this:

( Sales,1)
( Finance,1)
( Marketing,1)


What I want is : 3.

How could I get just the raw count of departments instead of a listing
of each department.

Thanks!

Re: Distinct Count

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi,

You can do the following:

groupedByDept = group deptInfo by dept;
groupedByAll = group groupedByDept all;
uniqcnt = foreach groupedByAll generate COUNT(groupedByDept);

The "group groupedByDept all" turns every row of "groupedByDept" into a
bag. Since "groupedByDept" has one row per department, counting elements in
this bag will return the number of unique departments.

Please be aware that group all may take long if your data is big since it
forces every mapper to send their output to a single reducer.

Thanks,
Cheolsoo

On Tue, Sep 25, 2012 at 6:55 PM, Hadoop Learner <ha...@gmail.com>wrote:

> Hello,
>
> Need help with finding the distinct count. Would appreciate if you
> could please help.
>
> Here's my data file:
>
> id , dept, budget
>
> 1, Marketing, 9000
> 2, Marketing, 1000
> 3, Finance, 9000
> 4, Sales, 2000
>
>
> I am trying to get the unique count of the departments in the company
> so I expect 3 - since there are 3 departments.
>
> Here's my PIG program:
>
>
> deptInfo = load 'dept.txt'  using PigStorage(',') as (id, dept, budget );
>
> -- get a distinct count of departments
>
> groupedByDept = group  deptInfo by dept;
>
> uniqcnt  = foreach groupedByDept  {
>            dept      = deptInfo.dept;
>            uniq_dept  = distinct dept ;
>            generate group, COUNT(uniq_dept);
>
>            }
>
> dump uniqcnt;
>
>
> What this gives me is this:
>
> ( Sales,1)
> ( Finance,1)
> ( Marketing,1)
>
>
> What I want is : 3.
>
> How could I get just the raw count of departments instead of a listing
> of each department.
>
> Thanks!
>