You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Goel, Ankur" <an...@corp.aol.com> on 2009/01/08 10:32:36 UTC

Top-K for nested fields

Hi Folks,

              I have a case where-in I need to do top-K on nested fields
in my tuple. For e.g. Consider the following tuples (format is [url,
query])

(abc.com, A)

(abc.com, A)

(abc.com, C)

(abc.com, B)

(xyz.com, D)

(xyz.com, D)

(xyz.com, E)

 

I need to be able to group by URL and output top-K queries along with
their count for each URL. So output would be 

Abc.com A 2

Abc.com B 1

Abc.com C 1

 

 

In my understanding we would do something like

 

url = GROUP tuples BY url;

result = FOREACH url GENERATE group, top(10, query)

 

Is there a UDF to do this? If not then I can write one and possibly
contribute.

 

Is there any other way of doing it?

 

Thanks

-Ankur


Re: Top-K for nested fields

Posted by Alan Gates <ga...@yahoo-inc.com>.
I would encourage you to open a JIRA.  If people disagree with  
putting limit in the nested foreach they can make their arguments  
against it there.  In general, our desire is to make PIg Latin fully  
nestable (so any keyword could be in a foreach).  Adding this feature  
should be very simple, as limit is easy to to implement.  So if  
someone wanted to take this on it should not be much work.  I don't  
have time to implement and test it, but I'm happy to provide guidance  
on the necessary changes to anyone interested.

Alan.

On Jan 12, 2009, at 11:52 PM, Goel, Ankur wrote:

> Rad,
>      Pig types branch does have support for LIMIT but not for nested
> structures inside FOREACH. So as a workaround I did implement a top()
> UDF.
> But I think it makes sense to have LIMIT support for nested structures
> also.
> We can open a JIRA for this is people agree.
>
> Thanks
> -Ankur
>
> -----Original Message-----
> From: rad gara [mailto:radgara@gmail.com]
> Sent: Monday, January 12, 2009 5:55 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Top-K for nested fields
>
> Ankur, concerning your code below, a TakeFirst(bag, count) UDF can be
> implemented.  So the desired line would be
> topK = TakeFirst(ordered, 10);
>
> But I guess perfomance of nested FOREACH statement can be not very
> good when processing large bags within FOREACH (right?).  Seems that
> Pig support of LIMIT is necessary for limiting large relations.
>
> 2009/1/12 Goel, Ankur <an...@corp.aol.com>:
>> Hi Ted,
>>         Thanks for the response. What you suggested will still need
> the
>> use of a UDF (top) that will be case specific. I was thinking if
> there's
>> a way we can generalize it so that people can do top-K on nested
>> results.
>>
>> Better yet if PIG itself supported it by having LIMIT inside FOREACH.
> To
>> give a better idea of what I am talking about here's some sample
>> script...
>>
>> data = LOAD 'myfile' as (url, query);
>> grouped = GROUP data BY (url, query);
>> groupCount = FOREACH grouped GENERATE FLATTEN(group), COUNT(*) as
>> clicks;
>> grouped_by_url = GROUP groupCount BY url;
>> results = FOREACH grouped_by_url {
>>                ordered = ORDER groupCount BY clicks DESC;
>>            topK = LIMIT ordered 10; // This is not supported but I
> wish
>> it were :-)
>>                GENERATE FLATTEN(topK);
>> };
>> STORE results INTO 'mydir' USING PigStorage();
>>
>> Do you think it makes sense for PIG to support it? If not then do we
>> resort to a generic top() UDF ?
>>
>> Thanks
>> -Ankur
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Saturday, January 10, 2009 12:12 AM
>> To: pig-user@hadoop.apache.org
>> Subject: Re: Top-K for nested fields
>>
>> I think you could turn that inside out and do the counting first by
>> grouping
>> on both fields and then do the top-n by grouping on field1.  I would
>> cautiously expect that to be a bit faster.
>>
>> On Fri, Jan 9, 2009 at 4:11 AM, Goel, Ankur <an...@corp.aol.com>
>> wrote:
>>
>>> Let me try and rephrase by question.
>>> I have a set of tuples of the form (field1, field2). I need to group
>> by
>>> 'field1' and then sub-group by 'field2' and output top-k  
>>> instances of
>>> field2 for field1. What's the right way of doing that in pig?
>>>
>>> What I did was grouped my tuples by 'field1' and passed the DataBag
> to
>>> my UDF - top() which just counts the occurrence of each tuple and
>>> outputs top-K.
>>> This worked but it didn't look like the most efficient solution.
>>>
>>> Can anyone suggest something different?
>>>
>>> Thanks
>>> -Ankur
>>>
>>> -----Original Message-----
>>> From: Goel, Ankur [mailto:ankur.goel@corp.aol.com]
>>> Sent: Thursday, January 08, 2009 3:03 PM
>>> To: pig-user@hadoop.apache.org; pig-dev@hadoop.apache.org
>>> Subject: Top-K for nested fields
>>>
>>> Hi Folks,
>>>
>>>              I have a case where-in I need to do top-K on nested
>> fields
>>> in my tuple. For e.g. Consider the following tuples (format is [url,
>>> query])
>>>
>>> (abc.com, A)
>>>
>>> (abc.com, A)
>>>
>>> (abc.com, C)
>>>
>>> (abc.com, B)
>>>
>>> (xyz.com, D)
>>>
>>> (xyz.com, D)
>>>
>>> (xyz.com, E)
>>>
>>>
>>>
>>> I need to be able to group by URL and output top-K queries along  
>>> with
>>> their count for each URL. So output would be
>>>
>>> Abc.com A 2
>>>
>>> Abc.com B 1
>>>
>>> Abc.com C 1
>>>
>>>
>>>
>>>
>>>
>>> In my understanding we would do something like
>>>
>>>
>>>
>>> url = GROUP tuples BY url;
>>>
>>> result = FOREACH url GENERATE group, top(10, query)
>>>
>>>
>>>
>>> Is there a UDF to do this? If not then I can write one and possibly
>>> contribute.
>>>
>>>
>>>
>>> Is there any other way of doing it?
>>>
>>>
>>>
>>> Thanks
>>>
>>> -Ankur
>>>
>>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>> 4600 Bohannon Drive, Suite 220
>> Menlo Park, CA 94025
>> www.deepdyve.com
>> 650-324-0110, ext. 738
>> 858-414-0013 (m)
>>


RE: Top-K for nested fields

Posted by "Goel, Ankur" <an...@corp.aol.com>.
Rad,
     Pig types branch does have support for LIMIT but not for nested
structures inside FOREACH. So as a workaround I did implement a top()
UDF.
But I think it makes sense to have LIMIT support for nested structures
also.
We can open a JIRA for this is people agree.

Thanks
-Ankur

-----Original Message-----
From: rad gara [mailto:radgara@gmail.com] 
Sent: Monday, January 12, 2009 5:55 PM
To: pig-user@hadoop.apache.org
Subject: Re: Top-K for nested fields

Ankur, concerning your code below, a TakeFirst(bag, count) UDF can be
implemented.  So the desired line would be
topK = TakeFirst(ordered, 10);

But I guess perfomance of nested FOREACH statement can be not very
good when processing large bags within FOREACH (right?).  Seems that
Pig support of LIMIT is necessary for limiting large relations.

2009/1/12 Goel, Ankur <an...@corp.aol.com>:
> Hi Ted,
>         Thanks for the response. What you suggested will still need
the
> use of a UDF (top) that will be case specific. I was thinking if
there's
> a way we can generalize it so that people can do top-K on nested
> results.
>
> Better yet if PIG itself supported it by having LIMIT inside FOREACH.
To
> give a better idea of what I am talking about here's some sample
> script...
>
> data = LOAD 'myfile' as (url, query);
> grouped = GROUP data BY (url, query);
> groupCount = FOREACH grouped GENERATE FLATTEN(group), COUNT(*) as
> clicks;
> grouped_by_url = GROUP groupCount BY url;
> results = FOREACH grouped_by_url {
>                ordered = ORDER groupCount BY clicks DESC;
>            topK = LIMIT ordered 10; // This is not supported but I
wish
> it were :-)
>                GENERATE FLATTEN(topK);
> };
> STORE results INTO 'mydir' USING PigStorage();
>
> Do you think it makes sense for PIG to support it? If not then do we
> resort to a generic top() UDF ?
>
> Thanks
> -Ankur
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Saturday, January 10, 2009 12:12 AM
> To: pig-user@hadoop.apache.org
> Subject: Re: Top-K for nested fields
>
> I think you could turn that inside out and do the counting first by
> grouping
> on both fields and then do the top-n by grouping on field1.  I would
> cautiously expect that to be a bit faster.
>
> On Fri, Jan 9, 2009 at 4:11 AM, Goel, Ankur <an...@corp.aol.com>
> wrote:
>
>> Let me try and rephrase by question.
>> I have a set of tuples of the form (field1, field2). I need to group
> by
>> 'field1' and then sub-group by 'field2' and output top-k instances of
>> field2 for field1. What's the right way of doing that in pig?
>>
>> What I did was grouped my tuples by 'field1' and passed the DataBag
to
>> my UDF - top() which just counts the occurrence of each tuple and
>> outputs top-K.
>> This worked but it didn't look like the most efficient solution.
>>
>> Can anyone suggest something different?
>>
>> Thanks
>> -Ankur
>>
>> -----Original Message-----
>> From: Goel, Ankur [mailto:ankur.goel@corp.aol.com]
>> Sent: Thursday, January 08, 2009 3:03 PM
>> To: pig-user@hadoop.apache.org; pig-dev@hadoop.apache.org
>> Subject: Top-K for nested fields
>>
>> Hi Folks,
>>
>>              I have a case where-in I need to do top-K on nested
> fields
>> in my tuple. For e.g. Consider the following tuples (format is [url,
>> query])
>>
>> (abc.com, A)
>>
>> (abc.com, A)
>>
>> (abc.com, C)
>>
>> (abc.com, B)
>>
>> (xyz.com, D)
>>
>> (xyz.com, D)
>>
>> (xyz.com, E)
>>
>>
>>
>> I need to be able to group by URL and output top-K queries along with
>> their count for each URL. So output would be
>>
>> Abc.com A 2
>>
>> Abc.com B 1
>>
>> Abc.com C 1
>>
>>
>>
>>
>>
>> In my understanding we would do something like
>>
>>
>>
>> url = GROUP tuples BY url;
>>
>> result = FOREACH url GENERATE group, top(10, query)
>>
>>
>>
>> Is there a UDF to do this? If not then I can write one and possibly
>> contribute.
>>
>>
>>
>> Is there any other way of doing it?
>>
>>
>>
>> Thanks
>>
>> -Ankur
>>
>>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
> 4600 Bohannon Drive, Suite 220
> Menlo Park, CA 94025
> www.deepdyve.com
> 650-324-0110, ext. 738
> 858-414-0013 (m)
>

Re: Top-K for nested fields

Posted by rad gara <ra...@gmail.com>.
Ankur, concerning your code below, a TakeFirst(bag, count) UDF can be
implemented.  So the desired line would be
topK = TakeFirst(ordered, 10);

But I guess perfomance of nested FOREACH statement can be not very
good when processing large bags within FOREACH (right?).  Seems that
Pig support of LIMIT is necessary for limiting large relations.

2009/1/12 Goel, Ankur <an...@corp.aol.com>:
> Hi Ted,
>         Thanks for the response. What you suggested will still need the
> use of a UDF (top) that will be case specific. I was thinking if there's
> a way we can generalize it so that people can do top-K on nested
> results.
>
> Better yet if PIG itself supported it by having LIMIT inside FOREACH. To
> give a better idea of what I am talking about here's some sample
> script...
>
> data = LOAD 'myfile' as (url, query);
> grouped = GROUP data BY (url, query);
> groupCount = FOREACH grouped GENERATE FLATTEN(group), COUNT(*) as
> clicks;
> grouped_by_url = GROUP groupCount BY url;
> results = FOREACH grouped_by_url {
>                ordered = ORDER groupCount BY clicks DESC;
>            topK = LIMIT ordered 10; // This is not supported but I wish
> it were :-)
>                GENERATE FLATTEN(topK);
> };
> STORE results INTO 'mydir' USING PigStorage();
>
> Do you think it makes sense for PIG to support it? If not then do we
> resort to a generic top() UDF ?
>
> Thanks
> -Ankur
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Saturday, January 10, 2009 12:12 AM
> To: pig-user@hadoop.apache.org
> Subject: Re: Top-K for nested fields
>
> I think you could turn that inside out and do the counting first by
> grouping
> on both fields and then do the top-n by grouping on field1.  I would
> cautiously expect that to be a bit faster.
>
> On Fri, Jan 9, 2009 at 4:11 AM, Goel, Ankur <an...@corp.aol.com>
> wrote:
>
>> Let me try and rephrase by question.
>> I have a set of tuples of the form (field1, field2). I need to group
> by
>> 'field1' and then sub-group by 'field2' and output top-k instances of
>> field2 for field1. What's the right way of doing that in pig?
>>
>> What I did was grouped my tuples by 'field1' and passed the DataBag to
>> my UDF - top() which just counts the occurrence of each tuple and
>> outputs top-K.
>> This worked but it didn't look like the most efficient solution.
>>
>> Can anyone suggest something different?
>>
>> Thanks
>> -Ankur
>>
>> -----Original Message-----
>> From: Goel, Ankur [mailto:ankur.goel@corp.aol.com]
>> Sent: Thursday, January 08, 2009 3:03 PM
>> To: pig-user@hadoop.apache.org; pig-dev@hadoop.apache.org
>> Subject: Top-K for nested fields
>>
>> Hi Folks,
>>
>>              I have a case where-in I need to do top-K on nested
> fields
>> in my tuple. For e.g. Consider the following tuples (format is [url,
>> query])
>>
>> (abc.com, A)
>>
>> (abc.com, A)
>>
>> (abc.com, C)
>>
>> (abc.com, B)
>>
>> (xyz.com, D)
>>
>> (xyz.com, D)
>>
>> (xyz.com, E)
>>
>>
>>
>> I need to be able to group by URL and output top-K queries along with
>> their count for each URL. So output would be
>>
>> Abc.com A 2
>>
>> Abc.com B 1
>>
>> Abc.com C 1
>>
>>
>>
>>
>>
>> In my understanding we would do something like
>>
>>
>>
>> url = GROUP tuples BY url;
>>
>> result = FOREACH url GENERATE group, top(10, query)
>>
>>
>>
>> Is there a UDF to do this? If not then I can write one and possibly
>> contribute.
>>
>>
>>
>> Is there any other way of doing it?
>>
>>
>>
>> Thanks
>>
>> -Ankur
>>
>>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
> 4600 Bohannon Drive, Suite 220
> Menlo Park, CA 94025
> www.deepdyve.com
> 650-324-0110, ext. 738
> 858-414-0013 (m)
>

RE: Top-K for nested fields

Posted by "Goel, Ankur" <an...@corp.aol.com>.
Hi Ted,
         Thanks for the response. What you suggested will still need the
use of a UDF (top) that will be case specific. I was thinking if there's
a way we can generalize it so that people can do top-K on nested
results. 

Better yet if PIG itself supported it by having LIMIT inside FOREACH. To
give a better idea of what I am talking about here's some sample
script...

data = LOAD 'myfile' as (url, query);
grouped = GROUP data BY (url, query);
groupCount = FOREACH grouped GENERATE FLATTEN(group), COUNT(*) as
clicks;
grouped_by_url = GROUP groupCount BY url;
results = FOREACH grouped_by_url {
		ordered = ORDER groupCount BY clicks DESC;
            topK = LIMIT ordered 10; // This is not supported but I wish
it were :-)
		GENERATE FLATTEN(topK);
};
STORE results INTO 'mydir' USING PigStorage();

Do you think it makes sense for PIG to support it? If not then do we
resort to a generic top() UDF ?

Thanks
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Saturday, January 10, 2009 12:12 AM
To: pig-user@hadoop.apache.org
Subject: Re: Top-K for nested fields

I think you could turn that inside out and do the counting first by
grouping
on both fields and then do the top-n by grouping on field1.  I would
cautiously expect that to be a bit faster.

On Fri, Jan 9, 2009 at 4:11 AM, Goel, Ankur <an...@corp.aol.com>
wrote:

> Let me try and rephrase by question.
> I have a set of tuples of the form (field1, field2). I need to group
by
> 'field1' and then sub-group by 'field2' and output top-k instances of
> field2 for field1. What's the right way of doing that in pig?
>
> What I did was grouped my tuples by 'field1' and passed the DataBag to
> my UDF - top() which just counts the occurrence of each tuple and
> outputs top-K.
> This worked but it didn't look like the most efficient solution.
>
> Can anyone suggest something different?
>
> Thanks
> -Ankur
>
> -----Original Message-----
> From: Goel, Ankur [mailto:ankur.goel@corp.aol.com]
> Sent: Thursday, January 08, 2009 3:03 PM
> To: pig-user@hadoop.apache.org; pig-dev@hadoop.apache.org
> Subject: Top-K for nested fields
>
> Hi Folks,
>
>              I have a case where-in I need to do top-K on nested
fields
> in my tuple. For e.g. Consider the following tuples (format is [url,
> query])
>
> (abc.com, A)
>
> (abc.com, A)
>
> (abc.com, C)
>
> (abc.com, B)
>
> (xyz.com, D)
>
> (xyz.com, D)
>
> (xyz.com, E)
>
>
>
> I need to be able to group by URL and output top-K queries along with
> their count for each URL. So output would be
>
> Abc.com A 2
>
> Abc.com B 1
>
> Abc.com C 1
>
>
>
>
>
> In my understanding we would do something like
>
>
>
> url = GROUP tuples BY url;
>
> result = FOREACH url GENERATE group, top(10, query)
>
>
>
> Is there a UDF to do this? If not then I can write one and possibly
> contribute.
>
>
>
> Is there any other way of doing it?
>
>
>
> Thanks
>
> -Ankur
>
>


-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: Top-K for nested fields

Posted by Ted Dunning <te...@gmail.com>.
I think you could turn that inside out and do the counting first by grouping
on both fields and then do the top-n by grouping on field1.  I would
cautiously expect that to be a bit faster.

On Fri, Jan 9, 2009 at 4:11 AM, Goel, Ankur <an...@corp.aol.com> wrote:

> Let me try and rephrase by question.
> I have a set of tuples of the form (field1, field2). I need to group by
> 'field1' and then sub-group by 'field2' and output top-k instances of
> field2 for field1. What's the right way of doing that in pig?
>
> What I did was grouped my tuples by 'field1' and passed the DataBag to
> my UDF - top() which just counts the occurrence of each tuple and
> outputs top-K.
> This worked but it didn't look like the most efficient solution.
>
> Can anyone suggest something different?
>
> Thanks
> -Ankur
>
> -----Original Message-----
> From: Goel, Ankur [mailto:ankur.goel@corp.aol.com]
> Sent: Thursday, January 08, 2009 3:03 PM
> To: pig-user@hadoop.apache.org; pig-dev@hadoop.apache.org
> Subject: Top-K for nested fields
>
> Hi Folks,
>
>              I have a case where-in I need to do top-K on nested fields
> in my tuple. For e.g. Consider the following tuples (format is [url,
> query])
>
> (abc.com, A)
>
> (abc.com, A)
>
> (abc.com, C)
>
> (abc.com, B)
>
> (xyz.com, D)
>
> (xyz.com, D)
>
> (xyz.com, E)
>
>
>
> I need to be able to group by URL and output top-K queries along with
> their count for each URL. So output would be
>
> Abc.com A 2
>
> Abc.com B 1
>
> Abc.com C 1
>
>
>
>
>
> In my understanding we would do something like
>
>
>
> url = GROUP tuples BY url;
>
> result = FOREACH url GENERATE group, top(10, query)
>
>
>
> Is there a UDF to do this? If not then I can write one and possibly
> contribute.
>
>
>
> Is there any other way of doing it?
>
>
>
> Thanks
>
> -Ankur
>
>


-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

RE: Top-K for nested fields

Posted by "Goel, Ankur" <an...@corp.aol.com>.
Let me try and rephrase by question.
I have a set of tuples of the form (field1, field2). I need to group by
'field1' and then sub-group by 'field2' and output top-k instances of
field2 for field1. What's the right way of doing that in pig?

What I did was grouped my tuples by 'field1' and passed the DataBag to
my UDF - top() which just counts the occurrence of each tuple and
outputs top-K.
This worked but it didn't look like the most efficient solution.

Can anyone suggest something different?

Thanks
-Ankur

-----Original Message-----
From: Goel, Ankur [mailto:ankur.goel@corp.aol.com] 
Sent: Thursday, January 08, 2009 3:03 PM
To: pig-user@hadoop.apache.org; pig-dev@hadoop.apache.org
Subject: Top-K for nested fields

Hi Folks,

              I have a case where-in I need to do top-K on nested fields
in my tuple. For e.g. Consider the following tuples (format is [url,
query])

(abc.com, A)

(abc.com, A)

(abc.com, C)

(abc.com, B)

(xyz.com, D)

(xyz.com, D)

(xyz.com, E)

 

I need to be able to group by URL and output top-K queries along with
their count for each URL. So output would be 

Abc.com A 2

Abc.com B 1

Abc.com C 1

 

 

In my understanding we would do something like

 

url = GROUP tuples BY url;

result = FOREACH url GENERATE group, top(10, query)

 

Is there a UDF to do this? If not then I can write one and possibly
contribute.

 

Is there any other way of doing it?

 

Thanks

-Ankur