You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Arun A K <ar...@gmail.com> on 2011/04/27 04:07:47 UTC
Can I pass an entire relation to a Pig UDF?
Hi
I have the following input relation:
Name Score
Jack 25
Jimmy 30
Sam 20
Hick 35
Tampa 22
My goal is to rank the tuples by score.
Pig script:
sample_data = LOAD 'sample.txt' USING PigStorage() AS (name:chararray,
score:int);
sample_data_group = GROUP sample_data BY score;
sample_data_count = FOREACH sample_data_group GENERATE group AS score,
COUNT(sample_data.name) AS countVal;
sample_data_order = ORDER sample_data_count BY score DESC;
sample_data_group_all = GROUP sample_data_order all;
sample_data_project = FOREACH sample_data_group_all GENERATE
FLATTEN(myUDF.Rank(sample_data_order));
dump sample_data_project;
Can someone please point me to a UDF example where a relation is read in and
iterated over all its tuples? I plan to iterate over the tuples and assign a
rank to each of them based on the score value.
Is there any other way to generate rank?
Thanks much.
Arun
Re: Can I pass an entire relation to a Pig UDF?
Posted by Dexin Wang <wa...@gmail.com>.
If the whole set is not that big, sorting in shell might be the easiest. I've done that with result set of millions of records.
On Apr 26, 2011, at 8:49 PM, Arun A K <ar...@gmail.com> wrote:
> Thanks Jacob.
>
> I wonder if it is possible to get the rank of each record or say row number
> using Pig. Or do I need to have an external driver like a shell script which
> augments the sorted output from Pig with a rank?
>
> Thanks
> Arun
>
>
>
> On Tue, Apr 26, 2011 at 7:54 PM, Jacob Perkins <ja...@gmail.com>wrote:
>
>> What you've indicated does require access to the whole relation at once
>> or at least a way of incrementing a counter and assigning its value to
>> each tuple. This kind of shared/synchronized state isn't possible with
>> Pig at the moment as far as I know.
>>
>> --jacob
>> @thedatachef
>>
>> On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote:
>>> Thanks Jacob for the response.
>>>
>>> If I run the UDF on each tuple then how can I preserve the state of the
>> rank
>>> variable. I mean the UDF won't be able to save the rank value between
>> calls,
>>> right? Correct me if I am wrong in interpreting that the UDF would be
>>> invoked for each tuple.
>>>
>>> What I am looking in my output is an additional column indicating the
>> rank.
>>> Something like
>>>
>>> Hick 35 1
>>> Jimmy 30 2
>>> Jack 25 3
>>> Tampa 22 4
>>> Sam 20 5
>>>
>>> Thanks.
>>>
>>> Arun
>>>
>>>
>>> On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins <
>> jacob.a.perkins@gmail.com>wrote:
>>>
>>>> The question is, do you need the entire relation all at once to assign
>> a
>>>> rank? If so then map-reduce may not be the answer. If not, why not just
>>>> run the UDF on each tuple of the relation, one at a time, with a
>>>> projection?
>>>>
>>>> If you need some global information, such as the max and min score,
>> then
>>>> you might look at the MAX and MIN operations. They do require a GROUP
>>>> ALL but are algebraic so it's not actually going to bring all the data
>>>> to one machine as it otherwise would.
>>>>
>>>> --jacob
>>>> @thedatachef
>>>>
>>>>
>>>> On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
>>>>> Hi
>>>>>
>>>>> I have the following input relation:
>>>>> Name Score
>>>>> Jack 25
>>>>> Jimmy 30
>>>>> Sam 20
>>>>> Hick 35
>>>>> Tampa 22
>>>>>
>>>>> My goal is to rank the tuples by score.
>>>>>
>>>>> Pig script:
>>>>>
>>>>> sample_data = LOAD 'sample.txt' USING PigStorage() AS
>> (name:chararray,
>>>>> score:int);
>>>>> sample_data_group = GROUP sample_data BY score;
>>>>> sample_data_count = FOREACH sample_data_group GENERATE group AS
>> score,
>>>>> COUNT(sample_data.name) AS countVal;
>>>>> sample_data_order = ORDER sample_data_count BY score DESC;
>>>>> sample_data_group_all = GROUP sample_data_order all;
>>>>> sample_data_project = FOREACH sample_data_group_all GENERATE
>>>>> FLATTEN(myUDF.Rank(sample_data_order));
>>>>> dump sample_data_project;
>>>>>
>>>>> Can someone please point me to a UDF example where a relation is read
>> in
>>>> and
>>>>> iterated over all its tuples? I plan to iterate over the tuples and
>>>> assign a
>>>>> rank to each of them based on the score value.
>>>>>
>>>>> Is there any other way to generate rank?
>>>>>
>>>>> Thanks much.
>>>>>
>>>>> Arun
>>>>
>>>>
>>>>
>>
>>
>>
Re: Can I pass an entire relation to a Pig UDF?
Posted by Arun A K <ar...@gmail.com>.
Thanks Jacob.
I wonder if it is possible to get the rank of each record or say row number
using Pig. Or do I need to have an external driver like a shell script which
augments the sorted output from Pig with a rank?
Thanks
Arun
On Tue, Apr 26, 2011 at 7:54 PM, Jacob Perkins <ja...@gmail.com>wrote:
> What you've indicated does require access to the whole relation at once
> or at least a way of incrementing a counter and assigning its value to
> each tuple. This kind of shared/synchronized state isn't possible with
> Pig at the moment as far as I know.
>
> --jacob
> @thedatachef
>
> On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote:
> > Thanks Jacob for the response.
> >
> > If I run the UDF on each tuple then how can I preserve the state of the
> rank
> > variable. I mean the UDF won't be able to save the rank value between
> calls,
> > right? Correct me if I am wrong in interpreting that the UDF would be
> > invoked for each tuple.
> >
> > What I am looking in my output is an additional column indicating the
> rank.
> > Something like
> >
> > Hick 35 1
> > Jimmy 30 2
> > Jack 25 3
> > Tampa 22 4
> > Sam 20 5
> >
> > Thanks.
> >
> > Arun
> >
> >
> > On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins <
> jacob.a.perkins@gmail.com>wrote:
> >
> > > The question is, do you need the entire relation all at once to assign
> a
> > > rank? If so then map-reduce may not be the answer. If not, why not just
> > > run the UDF on each tuple of the relation, one at a time, with a
> > > projection?
> > >
> > > If you need some global information, such as the max and min score,
> then
> > > you might look at the MAX and MIN operations. They do require a GROUP
> > > ALL but are algebraic so it's not actually going to bring all the data
> > > to one machine as it otherwise would.
> > >
> > > --jacob
> > > @thedatachef
> > >
> > >
> > > On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
> > > > Hi
> > > >
> > > > I have the following input relation:
> > > > Name Score
> > > > Jack 25
> > > > Jimmy 30
> > > > Sam 20
> > > > Hick 35
> > > > Tampa 22
> > > >
> > > > My goal is to rank the tuples by score.
> > > >
> > > > Pig script:
> > > >
> > > > sample_data = LOAD 'sample.txt' USING PigStorage() AS
> (name:chararray,
> > > > score:int);
> > > > sample_data_group = GROUP sample_data BY score;
> > > > sample_data_count = FOREACH sample_data_group GENERATE group AS
> score,
> > > > COUNT(sample_data.name) AS countVal;
> > > > sample_data_order = ORDER sample_data_count BY score DESC;
> > > > sample_data_group_all = GROUP sample_data_order all;
> > > > sample_data_project = FOREACH sample_data_group_all GENERATE
> > > > FLATTEN(myUDF.Rank(sample_data_order));
> > > > dump sample_data_project;
> > > >
> > > > Can someone please point me to a UDF example where a relation is read
> in
> > > and
> > > > iterated over all its tuples? I plan to iterate over the tuples and
> > > assign a
> > > > rank to each of them based on the score value.
> > > >
> > > > Is there any other way to generate rank?
> > > >
> > > > Thanks much.
> > > >
> > > > Arun
> > >
> > >
> > >
>
>
>
Re: Can I pass an entire relation to a Pig UDF?
Posted by Jacob Perkins <ja...@gmail.com>.
What you've indicated does require access to the whole relation at once
or at least a way of incrementing a counter and assigning its value to
each tuple. This kind of shared/synchronized state isn't possible with
Pig at the moment as far as I know.
--jacob
@thedatachef
On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote:
> Thanks Jacob for the response.
>
> If I run the UDF on each tuple then how can I preserve the state of the rank
> variable. I mean the UDF won't be able to save the rank value between calls,
> right? Correct me if I am wrong in interpreting that the UDF would be
> invoked for each tuple.
>
> What I am looking in my output is an additional column indicating the rank.
> Something like
>
> Hick 35 1
> Jimmy 30 2
> Jack 25 3
> Tampa 22 4
> Sam 20 5
>
> Thanks.
>
> Arun
>
>
> On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins <ja...@gmail.com>wrote:
>
> > The question is, do you need the entire relation all at once to assign a
> > rank? If so then map-reduce may not be the answer. If not, why not just
> > run the UDF on each tuple of the relation, one at a time, with a
> > projection?
> >
> > If you need some global information, such as the max and min score, then
> > you might look at the MAX and MIN operations. They do require a GROUP
> > ALL but are algebraic so it's not actually going to bring all the data
> > to one machine as it otherwise would.
> >
> > --jacob
> > @thedatachef
> >
> >
> > On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
> > > Hi
> > >
> > > I have the following input relation:
> > > Name Score
> > > Jack 25
> > > Jimmy 30
> > > Sam 20
> > > Hick 35
> > > Tampa 22
> > >
> > > My goal is to rank the tuples by score.
> > >
> > > Pig script:
> > >
> > > sample_data = LOAD 'sample.txt' USING PigStorage() AS (name:chararray,
> > > score:int);
> > > sample_data_group = GROUP sample_data BY score;
> > > sample_data_count = FOREACH sample_data_group GENERATE group AS score,
> > > COUNT(sample_data.name) AS countVal;
> > > sample_data_order = ORDER sample_data_count BY score DESC;
> > > sample_data_group_all = GROUP sample_data_order all;
> > > sample_data_project = FOREACH sample_data_group_all GENERATE
> > > FLATTEN(myUDF.Rank(sample_data_order));
> > > dump sample_data_project;
> > >
> > > Can someone please point me to a UDF example where a relation is read in
> > and
> > > iterated over all its tuples? I plan to iterate over the tuples and
> > assign a
> > > rank to each of them based on the score value.
> > >
> > > Is there any other way to generate rank?
> > >
> > > Thanks much.
> > >
> > > Arun
> >
> >
> >
Re: Can I pass an entire relation to a Pig UDF?
Posted by Arun A K <ar...@gmail.com>.
Thanks Jacob for the response.
If I run the UDF on each tuple then how can I preserve the state of the rank
variable. I mean the UDF won't be able to save the rank value between calls,
right? Correct me if I am wrong in interpreting that the UDF would be
invoked for each tuple.
What I am looking in my output is an additional column indicating the rank.
Something like
Hick 35 1
Jimmy 30 2
Jack 25 3
Tampa 22 4
Sam 20 5
Thanks.
Arun
On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins <ja...@gmail.com>wrote:
> The question is, do you need the entire relation all at once to assign a
> rank? If so then map-reduce may not be the answer. If not, why not just
> run the UDF on each tuple of the relation, one at a time, with a
> projection?
>
> If you need some global information, such as the max and min score, then
> you might look at the MAX and MIN operations. They do require a GROUP
> ALL but are algebraic so it's not actually going to bring all the data
> to one machine as it otherwise would.
>
> --jacob
> @thedatachef
>
>
> On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
> > Hi
> >
> > I have the following input relation:
> > Name Score
> > Jack 25
> > Jimmy 30
> > Sam 20
> > Hick 35
> > Tampa 22
> >
> > My goal is to rank the tuples by score.
> >
> > Pig script:
> >
> > sample_data = LOAD 'sample.txt' USING PigStorage() AS (name:chararray,
> > score:int);
> > sample_data_group = GROUP sample_data BY score;
> > sample_data_count = FOREACH sample_data_group GENERATE group AS score,
> > COUNT(sample_data.name) AS countVal;
> > sample_data_order = ORDER sample_data_count BY score DESC;
> > sample_data_group_all = GROUP sample_data_order all;
> > sample_data_project = FOREACH sample_data_group_all GENERATE
> > FLATTEN(myUDF.Rank(sample_data_order));
> > dump sample_data_project;
> >
> > Can someone please point me to a UDF example where a relation is read in
> and
> > iterated over all its tuples? I plan to iterate over the tuples and
> assign a
> > rank to each of them based on the score value.
> >
> > Is there any other way to generate rank?
> >
> > Thanks much.
> >
> > Arun
>
>
>
Re: Can I pass an entire relation to a Pig UDF?
Posted by Jacob Perkins <ja...@gmail.com>.
The question is, do you need the entire relation all at once to assign a
rank? If so then map-reduce may not be the answer. If not, why not just
run the UDF on each tuple of the relation, one at a time, with a
projection?
If you need some global information, such as the max and min score, then
you might look at the MAX and MIN operations. They do require a GROUP
ALL but are algebraic so it's not actually going to bring all the data
to one machine as it otherwise would.
--jacob
@thedatachef
On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
> Hi
>
> I have the following input relation:
> Name Score
> Jack 25
> Jimmy 30
> Sam 20
> Hick 35
> Tampa 22
>
> My goal is to rank the tuples by score.
>
> Pig script:
>
> sample_data = LOAD 'sample.txt' USING PigStorage() AS (name:chararray,
> score:int);
> sample_data_group = GROUP sample_data BY score;
> sample_data_count = FOREACH sample_data_group GENERATE group AS score,
> COUNT(sample_data.name) AS countVal;
> sample_data_order = ORDER sample_data_count BY score DESC;
> sample_data_group_all = GROUP sample_data_order all;
> sample_data_project = FOREACH sample_data_group_all GENERATE
> FLATTEN(myUDF.Rank(sample_data_order));
> dump sample_data_project;
>
> Can someone please point me to a UDF example where a relation is read in and
> iterated over all its tuples? I plan to iterate over the tuples and assign a
> rank to each of them based on the score value.
>
> Is there any other way to generate rank?
>
> Thanks much.
>
> Arun