You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Eli Finkelshteyn <ie...@gmail.com> on 2011/09/14 22:27:58 UTC

Pig Conditionals (Do I have to use UDFs)?

Hi,
I'd like to generate based on exclusive conditions (something like the 
CASE statement in SQL). An example:

Say I have data that looks like:

(a, 1)
(a, 2)
(b, 2)
(c, 1)
(d, 3)
(d, 4)

And I want to just convert each of the numbers to their written forms to 
get:

(a, one)
(a, two)
(b, two)
(c, one)
(d, three)
(d, four)

Would I need to write a udf for that, or is there some simple way to do 
it using cases? I know I can do a bunch of bidirectional generates one 
on top of the other to achieve this, like:

FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 == 
3) ? 'three' : 'four')));

but that seems too messy. I'd appreciate any advice.

Thanks!
Eli

Re: Pig Conditionals (Do I have to use UDFs)?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

There's a fair bit of overhead there.

UDFs are ok and normal in pig. Everything is done with them. Don't be afraid
of udfs :).

There's some pain with the compile cycle (edit code in java, test, compile,
jar, register...). That's where inline python udfs become handy!

D

On Wed, Sep 14, 2011 at 2:53 PM, Eli Finkelshteyn <el...@tumblr.com> wrote:

> Ah, neat! That would do the trick. Seems like a lot of extra steps, but
> I'll take it if that's how it's done in PIG. Thanks!
>
>
> On 9/14/11 5:51 PM, Ryan Hoegg wrote:
>
>> What about trying something with SPLIT and UNION:
>>
>> SPLIT EXAMPLE_SOURCE INTO GOOD IF number>5, BETTER IF (number>=2 AND
>> number<=4), BEST IF (number>=5);
>>
>> I did a few FOREACH and a UNION, and got this:
>> (a,6,best)
>> (b,5,best)
>> (d,8,best)
>> (a,6,good)
>> (d,8,good)
>> (a,2,better)
>> (b,2,better)
>> (c,3,better)
>> (d,3,better)
>> (d,4,better)
>>
>> --
>> Ryan Hoegg
>>
>> On Wed, Sep 14, 2011 at 4:24 PM, Eli Finkelshteyn<ie...@gmail.com>
>> >wrote:
>>
>>  Sorry, bad example, I guess. I want something I can do case statements
>>> with. In this case I could map instead, but if I wanted to use less
>>> straight-forward cases (i.e. one case where number == 1, another where
>>> number between 2 and 4, another where number greater than 5, etc...), it
>>> would be much more difficult to do with mapping.
>>>
>>> Again, I know this is something I can do with udfs, but it seemed like
>>> something light enough to be built into PIG itself, so I was hoping there
>>> was a way to do it without needing to write a udf every time I have a new
>>> transformation to make.
>>>
>>> Eli
>>>
>>> On 9/14/11 5:07 PM, Ryan Hoegg wrote:
>>>
>>>  What about putting the mappings into their own relation?  I tried this
>>>> with
>>>> 0.9.0:
>>>>
>>>> example.txt:
>>>> a,1
>>>> a,2
>>>> b,2
>>>> c,1
>>>> d,3
>>>> d,4
>>>>
>>>> mapping.txt:
>>>> 1,one
>>>> 2,two
>>>> 3,three
>>>> 4,four
>>>>
>>>> MAPPINGS = LOAD 'mapping.txt' USING PigStorage(',') AS
>>>> (number:int,name:chararray);
>>>> EXAMPLE_SOURCE = LOAD 'example.txt' USING PigStorage(',') AS
>>>> (item:chararray,number:int);
>>>> MAPPED = JOIN EXAMPLE_SOURCE BY number LEFT OUTER, MAPPINGS BY number;
>>>> PRETTY = FOREACH MAPPED GENERATE item, name;
>>>> DUMP PRETTY;
>>>> (a,one)
>>>> (c,one)
>>>> (a,two)
>>>> (b,two)
>>>> (d,three)
>>>> (d,four)
>>>>
>>>> --
>>>> Ryan Hoegg
>>>>
>>>> On Wed, Sep 14, 2011 at 3:27 PM, Eli Finkelshteyn<iefinkel@gmail.****
>>>> com<ie...@gmail.com>
>>>>
>>>>> wrote:
>>>>>
>>>>  Hi,
>>>>
>>>>> I'd like to generate based on exclusive conditions (something like the
>>>>> CASE
>>>>> statement in SQL). An example:
>>>>>
>>>>> Say I have data that looks like:
>>>>>
>>>>> (a, 1)
>>>>> (a, 2)
>>>>> (b, 2)
>>>>> (c, 1)
>>>>> (d, 3)
>>>>> (d, 4)
>>>>>
>>>>> And I want to just convert each of the numbers to their written forms
>>>>> to
>>>>> get:
>>>>>
>>>>> (a, one)
>>>>> (a, two)
>>>>> (b, two)
>>>>> (c, one)
>>>>> (d, three)
>>>>> (d, four)
>>>>>
>>>>> Would I need to write a udf for that, or is there some simple way to do
>>>>> it
>>>>> using cases? I know I can do a bunch of bidirectional generates one on
>>>>> top
>>>>> of the other to achieve this, like:
>>>>>
>>>>> FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1
>>>>> ==
>>>>> 3)
>>>>> ? 'three' : 'four')));
>>>>>
>>>>> but that seems too messy. I'd appreciate any advice.
>>>>>
>>>>> Thanks!
>>>>> Eli
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>

Re: Pig Conditionals (Do I have to use UDFs)?

Posted by Eli Finkelshteyn <el...@tumblr.com>.

Ah, neat! That would do the trick. Seems like a lot of extra steps, but 
I'll take it if that's how it's done in PIG. Thanks!

On 9/14/11 5:51 PM, Ryan Hoegg wrote:
> What about trying something with SPLIT and UNION:
>
> SPLIT EXAMPLE_SOURCE INTO GOOD IF number>5, BETTER IF (number>=2 AND
> number<=4), BEST IF (number>=5);
>
> I did a few FOREACH and a UNION, and got this:
> (a,6,best)
> (b,5,best)
> (d,8,best)
> (a,6,good)
> (d,8,good)
> (a,2,better)
> (b,2,better)
> (c,3,better)
> (d,3,better)
> (d,4,better)
>
> --
> Ryan Hoegg
>
> On Wed, Sep 14, 2011 at 4:24 PM, Eli Finkelshteyn<ie...@gmail.com>wrote:
>
>> Sorry, bad example, I guess. I want something I can do case statements
>> with. In this case I could map instead, but if I wanted to use less
>> straight-forward cases (i.e. one case where number == 1, another where
>> number between 2 and 4, another where number greater than 5, etc...), it
>> would be much more difficult to do with mapping.
>>
>> Again, I know this is something I can do with udfs, but it seemed like
>> something light enough to be built into PIG itself, so I was hoping there
>> was a way to do it without needing to write a udf every time I have a new
>> transformation to make.
>>
>> Eli
>>
>> On 9/14/11 5:07 PM, Ryan Hoegg wrote:
>>
>>> What about putting the mappings into their own relation?  I tried this
>>> with
>>> 0.9.0:
>>>
>>> example.txt:
>>> a,1
>>> a,2
>>> b,2
>>> c,1
>>> d,3
>>> d,4
>>>
>>> mapping.txt:
>>> 1,one
>>> 2,two
>>> 3,three
>>> 4,four
>>>
>>> MAPPINGS = LOAD 'mapping.txt' USING PigStorage(',') AS
>>> (number:int,name:chararray);
>>> EXAMPLE_SOURCE = LOAD 'example.txt' USING PigStorage(',') AS
>>> (item:chararray,number:int);
>>> MAPPED = JOIN EXAMPLE_SOURCE BY number LEFT OUTER, MAPPINGS BY number;
>>> PRETTY = FOREACH MAPPED GENERATE item, name;
>>> DUMP PRETTY;
>>> (a,one)
>>> (c,one)
>>> (a,two)
>>> (b,two)
>>> (d,three)
>>> (d,four)
>>>
>>> --
>>> Ryan Hoegg
>>>
>>> On Wed, Sep 14, 2011 at 3:27 PM, Eli Finkelshteyn<ie...@gmail.com>
>>>> wrote:
>>>   Hi,
>>>> I'd like to generate based on exclusive conditions (something like the
>>>> CASE
>>>> statement in SQL). An example:
>>>>
>>>> Say I have data that looks like:
>>>>
>>>> (a, 1)
>>>> (a, 2)
>>>> (b, 2)
>>>> (c, 1)
>>>> (d, 3)
>>>> (d, 4)
>>>>
>>>> And I want to just convert each of the numbers to their written forms to
>>>> get:
>>>>
>>>> (a, one)
>>>> (a, two)
>>>> (b, two)
>>>> (c, one)
>>>> (d, three)
>>>> (d, four)
>>>>
>>>> Would I need to write a udf for that, or is there some simple way to do
>>>> it
>>>> using cases? I know I can do a bunch of bidirectional generates one on
>>>> top
>>>> of the other to achieve this, like:
>>>>
>>>> FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 ==
>>>> 3)
>>>> ? 'three' : 'four')));
>>>>
>>>> but that seems too messy. I'd appreciate any advice.
>>>>
>>>> Thanks!
>>>> Eli
>>>>
>>>>
>>>>
>>>>

Re: Pig Conditionals (Do I have to use UDFs)?

Posted by Ryan Hoegg <ry...@gmail.com>.

What about trying something with SPLIT and UNION:

SPLIT EXAMPLE_SOURCE INTO GOOD IF number>5, BETTER IF (number>=2 AND
number<=4), BEST IF (number>=5);

I did a few FOREACH and a UNION, and got this:
(a,6,best)
(b,5,best)
(d,8,best)
(a,6,good)
(d,8,good)
(a,2,better)
(b,2,better)
(c,3,better)
(d,3,better)
(d,4,better)

--
Ryan Hoegg

On Wed, Sep 14, 2011 at 4:24 PM, Eli Finkelshteyn <ie...@gmail.com>wrote:

> Sorry, bad example, I guess. I want something I can do case statements
> with. In this case I could map instead, but if I wanted to use less
> straight-forward cases (i.e. one case where number == 1, another where
> number between 2 and 4, another where number greater than 5, etc...), it
> would be much more difficult to do with mapping.
>
> Again, I know this is something I can do with udfs, but it seemed like
> something light enough to be built into PIG itself, so I was hoping there
> was a way to do it without needing to write a udf every time I have a new
> transformation to make.
>
> Eli
>
> On 9/14/11 5:07 PM, Ryan Hoegg wrote:
>
>> What about putting the mappings into their own relation?  I tried this
>> with
>> 0.9.0:
>>
>> example.txt:
>> a,1
>> a,2
>> b,2
>> c,1
>> d,3
>> d,4
>>
>> mapping.txt:
>> 1,one
>> 2,two
>> 3,three
>> 4,four
>>
>> MAPPINGS = LOAD 'mapping.txt' USING PigStorage(',') AS
>> (number:int,name:chararray);
>> EXAMPLE_SOURCE = LOAD 'example.txt' USING PigStorage(',') AS
>> (item:chararray,number:int);
>> MAPPED = JOIN EXAMPLE_SOURCE BY number LEFT OUTER, MAPPINGS BY number;
>> PRETTY = FOREACH MAPPED GENERATE item, name;
>> DUMP PRETTY;
>> (a,one)
>> (c,one)
>> (a,two)
>> (b,two)
>> (d,three)
>> (d,four)
>>
>> --
>> Ryan Hoegg
>>
>> On Wed, Sep 14, 2011 at 3:27 PM, Eli Finkelshteyn<ie...@gmail.com>
>> >wrote:
>>
>>  Hi,
>>> I'd like to generate based on exclusive conditions (something like the
>>> CASE
>>> statement in SQL). An example:
>>>
>>> Say I have data that looks like:
>>>
>>> (a, 1)
>>> (a, 2)
>>> (b, 2)
>>> (c, 1)
>>> (d, 3)
>>> (d, 4)
>>>
>>> And I want to just convert each of the numbers to their written forms to
>>> get:
>>>
>>> (a, one)
>>> (a, two)
>>> (b, two)
>>> (c, one)
>>> (d, three)
>>> (d, four)
>>>
>>> Would I need to write a udf for that, or is there some simple way to do
>>> it
>>> using cases? I know I can do a bunch of bidirectional generates one on
>>> top
>>> of the other to achieve this, like:
>>>
>>> FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 ==
>>> 3)
>>> ? 'three' : 'four')));
>>>
>>> but that seems too messy. I'd appreciate any advice.
>>>
>>> Thanks!
>>> Eli
>>>
>>>
>>>
>>>
>

Re: Pig Conditionals (Do I have to use UDFs)?

Posted by Eli Finkelshteyn <ie...@gmail.com>.

Sorry, bad example, I guess. I want something I can do case statements 
with. In this case I could map instead, but if I wanted to use less 
straight-forward cases (i.e. one case where number == 1, another where 
number between 2 and 4, another where number greater than 5, etc...), it 
would be much more difficult to do with mapping.

Again, I know this is something I can do with udfs, but it seemed like 
something light enough to be built into PIG itself, so I was hoping 
there was a way to do it without needing to write a udf every time I 
have a new transformation to make.

Eli

On 9/14/11 5:07 PM, Ryan Hoegg wrote:
> What about putting the mappings into their own relation?  I tried this with
> 0.9.0:
>
> example.txt:
> a,1
> a,2
> b,2
> c,1
> d,3
> d,4
>
> mapping.txt:
> 1,one
> 2,two
> 3,three
> 4,four
>
> MAPPINGS = LOAD 'mapping.txt' USING PigStorage(',') AS
> (number:int,name:chararray);
> EXAMPLE_SOURCE = LOAD 'example.txt' USING PigStorage(',') AS
> (item:chararray,number:int);
> MAPPED = JOIN EXAMPLE_SOURCE BY number LEFT OUTER, MAPPINGS BY number;
> PRETTY = FOREACH MAPPED GENERATE item, name;
> DUMP PRETTY;
> (a,one)
> (c,one)
> (a,two)
> (b,two)
> (d,three)
> (d,four)
>
> --
> Ryan Hoegg
>
> On Wed, Sep 14, 2011 at 3:27 PM, Eli Finkelshteyn<ie...@gmail.com>wrote:
>
>> Hi,
>> I'd like to generate based on exclusive conditions (something like the CASE
>> statement in SQL). An example:
>>
>> Say I have data that looks like:
>>
>> (a, 1)
>> (a, 2)
>> (b, 2)
>> (c, 1)
>> (d, 3)
>> (d, 4)
>>
>> And I want to just convert each of the numbers to their written forms to
>> get:
>>
>> (a, one)
>> (a, two)
>> (b, two)
>> (c, one)
>> (d, three)
>> (d, four)
>>
>> Would I need to write a udf for that, or is there some simple way to do it
>> using cases? I know I can do a bunch of bidirectional generates one on top
>> of the other to achieve this, like:
>>
>> FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 == 3)
>> ? 'three' : 'four')));
>>
>> but that seems too messy. I'd appreciate any advice.
>>
>> Thanks!
>> Eli
>>
>>
>>

Re: Pig Conditionals (Do I have to use UDFs)?

Posted by Ryan Hoegg <ry...@gmail.com>.

What about putting the mappings into their own relation?  I tried this with
0.9.0:

example.txt:
a,1
a,2
b,2
c,1
d,3
d,4

mapping.txt:
1,one
2,two
3,three
4,four

MAPPINGS = LOAD 'mapping.txt' USING PigStorage(',') AS
(number:int,name:chararray);
EXAMPLE_SOURCE = LOAD 'example.txt' USING PigStorage(',') AS
(item:chararray,number:int);
MAPPED = JOIN EXAMPLE_SOURCE BY number LEFT OUTER, MAPPINGS BY number;
PRETTY = FOREACH MAPPED GENERATE item, name;
DUMP PRETTY;
(a,one)
(c,one)
(a,two)
(b,two)
(d,three)
(d,four)

--
Ryan Hoegg

On Wed, Sep 14, 2011 at 3:27 PM, Eli Finkelshteyn <ie...@gmail.com>wrote:

> Hi,
> I'd like to generate based on exclusive conditions (something like the CASE
> statement in SQL). An example:
>
> Say I have data that looks like:
>
> (a, 1)
> (a, 2)
> (b, 2)
> (c, 1)
> (d, 3)
> (d, 4)
>
> And I want to just convert each of the numbers to their written forms to
> get:
>
> (a, one)
> (a, two)
> (b, two)
> (c, one)
> (d, three)
> (d, four)
>
> Would I need to write a udf for that, or is there some simple way to do it
> using cases? I know I can do a bunch of bidirectional generates one on top
> of the other to achieve this, like:
>
> FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 == 3)
> ? 'three' : 'four')));
>
> but that seems too messy. I'd appreciate any advice.
>
> Thanks!
> Eli
>
>
>

Re: Pig Conditionals (Do I have to use UDFs)?

Posted by "Clay B." <cw...@clayb.net>.

I have done mappings in the past using joins and mapping files too.

E.g. generate a file of mappings and load it as a relation, then join. A 
rather heavy weight solution though.

-Clay

On Wed, 14 Sep 2011, Eli Finkelshteyn wrote:

> Hi,
> I'd like to generate based on exclusive conditions (something like the CASE 
> statement in SQL). An example:
>
> Say I have data that looks like:
>
> (a, 1)
> (a, 2)
> (b, 2)
> (c, 1)
> (d, 3)
> (d, 4)
>
> And I want to just convert each of the numbers to their written forms to get:
>
> (a, one)
> (a, two)
> (b, two)
> (c, one)
> (d, three)
> (d, four)
>
> Would I need to write a udf for that, or is there some simple way to do it 
> using cases? I know I can do a bunch of bidirectional generates one on top of 
> the other to achieve this, like:
>
> FOREACH rel GENERATE $0, (($1==1) ? 'one' : (($1 == 2) ? 'two' : (($1 == 3) ? 
> 'three' : 'four')));
>
> but that seems too messy. I'd appreciate any advice.
>
> Thanks!
> Eli
>
>
>