You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Arian Pasquali <ar...@arianpasquali.com> on 2014/07/29 11:49:47 UTC

Counting elements for each group

Hi,

I'm having trouble with a simple task that I believe someone out there must
have already solved some day.

I'm trying to group and count the frequency of terms for each group in
PigLatin, but I'm having some troubles to figure it out how to do it.

I have a collection of objects with the following schema:

{cluster_id: bytearray,terms: chararray}

And here is some samples

(10, smerter)
(10, graviditeten)
(10, smerter)
(10, smerter)
(10, udemærket)
(20, eis feuer)
(20, herunterladen schau)
(20, download gratis)
(20, download gratis)
(30, anschauen kinofilm)
(30, kauf rechnung)
(30, kauf rechnung)
(30, versandkostenfreie lieferung)
(30, kostenlose)
(30, kostenlose)
(30, kostenlose)

the result I m trying to get is something like this

(10, smerter, 3)
(10, graviditeten, 2)
(10, udemærket, 1)
(20, download gratis, 2)
(20, eis feuer, 1)
(20, herunterladen schau, 1)
(30, kostenlose, 3)
(30, kauf rechnung, 2)
(30, anschauen kinofilm, 1)
(30, versandkostenfreie lieferung, 1)

What would be the best way to do that? The following code groups by id and
count the terms, but I wanted to count the terms for each group.

by_clusters = GROUP sample_data by cluster_id;
by_clusters_terms_count = FOREACH by_clusters GENERATE group as
cluster_id, COUNT($1);

I make the grouping like this I end up with an object with the following
schema

by_clusters: {group: bytearray,sample_data: {(cluster_id:
bytearray,terms: chararray)}}

Now, I get to the point to actually count the terms inside the
'sample_data' tuple. I'm thinking about nested foreach, but I still didn't
get it how could I apply it in this case. The code would be something like
the following:

result = FOREACH by_clusters {

--count terms here, I don't know how

-- compiler gives me an error here
c = GROUP $1 BY terms; --
d = FOREACH c GENERATE COUNT(b), group;

GENERATE cluster_id, d;
}

Error I get:

ERROR 1200: Syntax error, unexpected symbol at or near '$1

Finally, I think I'm close, but I'm unable to solve it. I don't believe
I'll have to write an UDF in this case.


Arian

Re: Counting elements for each group

Posted by Arian Pasquali <ar...@arianpasquali.com>.
Thanks Gianmarco!!
here the final version is like this

by_clusters = GROUP sample_data by (cluster_id, terms);
by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group)
as (cluster_id, terms), COUNT($1);

cheers

Arian Pasquali
http://about.me/arianpasquali


2014-07-29 13:23 GMT+01:00 Gianmarco De Francisci Morales <gd...@apache.org>:

> Try this:
>
> by_clusters = GROUP sample_data by (cluster_id, terms);
> by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group),
> COUNT(sample_data)
> as count;
>
> Cheers,
>
> --
> Gianmarco
>
>
> On 29 July 2014 11:49, Arian Pasquali <ar...@arianpasquali.com> wrote:
>
> > Hi,
> >
> > I'm having trouble with a simple task that I believe someone out there
> must
> > have already solved some day.
> >
> > I'm trying to group and count the frequency of terms for each group in
> > PigLatin, but I'm having some troubles to figure it out how to do it.
> >
> > I have a collection of objects with the following schema:
> >
> > {cluster_id: bytearray,terms: chararray}
> >
> > And here is some samples
> >
> > (10, smerter)
> > (10, graviditeten)
> > (10, smerter)
> > (10, smerter)
> > (10, udemærket)
> > (20, eis feuer)
> > (20, herunterladen schau)
> > (20, download gratis)
> > (20, download gratis)
> > (30, anschauen kinofilm)
> > (30, kauf rechnung)
> > (30, kauf rechnung)
> > (30, versandkostenfreie lieferung)
> > (30, kostenlose)
> > (30, kostenlose)
> > (30, kostenlose)
> >
> > the result I m trying to get is something like this
> >
> > (10, smerter, 3)
> > (10, graviditeten, 2)
> > (10, udemærket, 1)
> > (20, download gratis, 2)
> > (20, eis feuer, 1)
> > (20, herunterladen schau, 1)
> > (30, kostenlose, 3)
> > (30, kauf rechnung, 2)
> > (30, anschauen kinofilm, 1)
> > (30, versandkostenfreie lieferung, 1)
> >
> > What would be the best way to do that? The following code groups by id
> and
> > count the terms, but I wanted to count the terms for each group.
> >
> > by_clusters = GROUP sample_data by cluster_id;
> > by_clusters_terms_count = FOREACH by_clusters GENERATE group as
> > cluster_id, COUNT($1);
> >
> > I make the grouping like this I end up with an object with the following
> > schema
> >
> > by_clusters: {group: bytearray,sample_data: {(cluster_id:
> > bytearray,terms: chararray)}}
> >
> > Now, I get to the point to actually count the terms inside the
> > 'sample_data' tuple. I'm thinking about nested foreach, but I still
> didn't
> > get it how could I apply it in this case. The code would be something
> like
> > the following:
> >
> > result = FOREACH by_clusters {
> >
> > --count terms here, I don't know how
> >
> > -- compiler gives me an error here
> > c = GROUP $1 BY terms; --
> > d = FOREACH c GENERATE COUNT(b), group;
> >
> > GENERATE cluster_id, d;
> > }
> >
> > Error I get:
> >
> > ERROR 1200: Syntax error, unexpected symbol at or near '$1
> >
> > Finally, I think I'm close, but I'm unable to solve it. I don't believe
> > I'll have to write an UDF in this case.
> >
> >
> > Arian
> >
>

Re: Counting elements for each group

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Try this:

by_clusters = GROUP sample_data by (cluster_id, terms);
by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group),
COUNT(sample_data)
as count;

Cheers,

--
Gianmarco


On 29 July 2014 11:49, Arian Pasquali <ar...@arianpasquali.com> wrote:

> Hi,
>
> I'm having trouble with a simple task that I believe someone out there must
> have already solved some day.
>
> I'm trying to group and count the frequency of terms for each group in
> PigLatin, but I'm having some troubles to figure it out how to do it.
>
> I have a collection of objects with the following schema:
>
> {cluster_id: bytearray,terms: chararray}
>
> And here is some samples
>
> (10, smerter)
> (10, graviditeten)
> (10, smerter)
> (10, smerter)
> (10, udemærket)
> (20, eis feuer)
> (20, herunterladen schau)
> (20, download gratis)
> (20, download gratis)
> (30, anschauen kinofilm)
> (30, kauf rechnung)
> (30, kauf rechnung)
> (30, versandkostenfreie lieferung)
> (30, kostenlose)
> (30, kostenlose)
> (30, kostenlose)
>
> the result I m trying to get is something like this
>
> (10, smerter, 3)
> (10, graviditeten, 2)
> (10, udemærket, 1)
> (20, download gratis, 2)
> (20, eis feuer, 1)
> (20, herunterladen schau, 1)
> (30, kostenlose, 3)
> (30, kauf rechnung, 2)
> (30, anschauen kinofilm, 1)
> (30, versandkostenfreie lieferung, 1)
>
> What would be the best way to do that? The following code groups by id and
> count the terms, but I wanted to count the terms for each group.
>
> by_clusters = GROUP sample_data by cluster_id;
> by_clusters_terms_count = FOREACH by_clusters GENERATE group as
> cluster_id, COUNT($1);
>
> I make the grouping like this I end up with an object with the following
> schema
>
> by_clusters: {group: bytearray,sample_data: {(cluster_id:
> bytearray,terms: chararray)}}
>
> Now, I get to the point to actually count the terms inside the
> 'sample_data' tuple. I'm thinking about nested foreach, but I still didn't
> get it how could I apply it in this case. The code would be something like
> the following:
>
> result = FOREACH by_clusters {
>
> --count terms here, I don't know how
>
> -- compiler gives me an error here
> c = GROUP $1 BY terms; --
> d = FOREACH c GENERATE COUNT(b), group;
>
> GENERATE cluster_id, d;
> }
>
> Error I get:
>
> ERROR 1200: Syntax error, unexpected symbol at or near '$1
>
> Finally, I think I'm close, but I'm unable to solve it. I don't believe
> I'll have to write an UDF in this case.
>
>
> Arian
>