You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Austin Chungath <au...@gmail.com> on 2012/02/20 07:20:10 UTC

Pig script to convert Categorical variables

Consider this scenario:

I have a column named City and it takes 3 possible values: A,B,C

City
A
B
C
A
C
C

I want to convert it into

A             B            C
1              0            0
0              1            0
0              0            1
1              0            0
0              0            1
0              0            1

I am trying to write a pig script that will take two parameters, one
parameter is the data and then the column name, in this case 'City'. The
script should then identify distinct values that it will take and then
create that many columns and populate it with 1 or 0 depending on which one
is true.
Please let me know if you have got any ideas on how to approach this
problem.

Thanks,
Austin

Re: Pig script to convert Categorical variables

Posted by Austin Chungath <au...@gmail.com>.

Thanks Prashant.
I am looking into embedding Pig in Java and UDFs


On Mon, Feb 20, 2012 at 5:26 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> This should work if the values are only A,B,C.
>
> M = load 'input' as (city:chararray);
>
> N = foreach M generate city == 'A' ? 1 : 0 as A, city == 'B' ? 1 : 0 as B,
> city == 'C' ? 1 : 0 as C;
>
> However, if city values vary it might be a good option to do it by
> embedding Pig in Java.
> http://pig.apache.org/docs/r0.9.1/cont.html#embed-java
>
> Thanks,
> Prashant
>
> On Mon, Feb 20, 2012 at 3:16 AM, Austin Chungath <au...@gmail.com>
> wrote:
>
> > Consider this scenario:
> >
> > I have a column named City and it takes 3 possible values: A,B,C
> >
> > City
> > A
> > B
> > C
> > A
> > C
> > C
> >
> > I want to convert it into
> >
> > A             B            C
> > 1              0            0
> > 0              1            0
> > 0              0            1
> > 1              0            0
> > 0              0            1
> > 0              0            1
> >
> > I am trying to write a pig script that will take two parameters, one
> > parameter is the data and then the column name, in this case 'City'. The
> > script should then identify distinct values that it will take and then
> > create that many columns and populate it with 1 or 0 depending on which
> one
> > is true.
> > Please let me know if you have got any ideas on how to approach this
> > problem.
> >
> > Thanks,
> > Austin
> >
>

Re: Pig script to convert Categorical variables

Posted by Eli Finkelshteyn <ie...@gmail.com>.

No problem. Returning a variable schema sounds pretty cool and like 
something that should be doable, but I'm not really sure how to go about 
it. Maybe someone else knows?

Eli

On 2/21/12 1:27 AM, Austin Chungath wrote:
> Thanks Eli,
> That helps and it was exactly what I was doing. I wrote the UDF and it is
> working.
> I wrote a UDF that takes two parameters, first parameter was a bag of
> tuples containing distinct values (ordered ascending)  and the second
> parameter is the original data set. It is working but now I am trying
> to figure out how I can return a schema for the columns created with the
> names of the distinct values.
>
> City
> A
> B
> C
> A
> C
> C
>
> I want to convert it into
>
> A             B            C
> 1              0            0
> 0              1            0
> 0              0            1
> 1              0            0
> 0              0            1
> 0              0            1
> how can the UDF return a schema containing the names of the cities? is it
> possible?
> I should be able to generate A rather than generate $0.
> Thanks,
> Austin
>
> On Tue, Feb 21, 2012 at 10:23 AM, Eli Finkelshteyn<ie...@gmail.com>wrote:
>
>> Interesting problem. What I'm thinking is why not do two steps. First,
>> read in the data, group on the column you care about. Then generate on it
>> so you get just the distinct values for that column left. This would be
>> something like:
>>
>> CITIES_GROUPED=  GROUP  INITIALBY  city;
>> CITIES=  FOREACHCITIES_GROUPED GENERATE group AS city;
>>
>>
>> Once you have that, convert it to a tuple, and then just write a quick udf
>> that goes through the ORIGINAL data set and takes in the row value for the
>> column you care about along with the distinct values tuple you just created
>> as parameters and returns a tuple of 0s and one 1 where the one is in the
>> position in the distinct values tuple that matches the row value for that
>> row for the column you care about. You could write that udf in Java,
>> Python, or one of the other supported udf languages, depending on your
>> requirements.
>>
>> For inputting, you could do it either through a simple bash script (your
>> use case is simple enough, I think), or you could go ahead and embed the
>> PIG script in Java, Python, or one of the other languages that's supported
>> for that functionality, so it's easy to expand if you later need to. I'm
>> personally partial to Python and have had great results embedding in that.
>> Just make sure you're on Pig 9.1+.
>>
>> Hopefully that helps,
>> Eli
>>
>>
>> On 2/20/12 6:56 AM, Prashant Kommireddi wrote:
>>
>>> This should work if the values are only A,B,C.
>>>
>>> M = load 'input' as (city:chararray);
>>>
>>> N = foreach M generate city == 'A' ? 1 : 0 as A, city == 'B' ? 1 : 0 as B,
>>> city == 'C' ? 1 : 0 as C;
>>>
>>> However, if city values vary it might be a good option to do it by
>>> embedding Pig in Java.
>>> http://pig.apache.org/docs/r0.**9.1/cont.html#embed-java<http://pig.apache.org/docs/r0.9.1/cont.html#embed-java>
>>>
>>> Thanks,
>>> Prashant
>>>
>>> On Mon, Feb 20, 2012 at 3:16 AM, Austin Chungath<au...@gmail.com>
>>>   wrote:
>>>
>>> Consider this scenario:
>>>> I have a column named City and it takes 3 possible values: A,B,C
>>>>
>>>> City
>>>> A
>>>> B
>>>> C
>>>> A
>>>> C
>>>> C
>>>>
>>>> I want to convert it into
>>>>
>>>> A             B            C
>>>> 1              0            0
>>>> 0              1            0
>>>> 0              0            1
>>>> 1              0            0
>>>> 0              0            1
>>>> 0              0            1
>>>>
>>>> I am trying to write a pig script that will take two parameters, one
>>>> parameter is the data and then the column name, in this case 'City'. The
>>>> script should then identify distinct values that it will take and then
>>>> create that many columns and populate it with 1 or 0 depending on which
>>>> one
>>>> is true.
>>>> Please let me know if you have got any ideas on how to approach this
>>>> problem.
>>>>
>>>> Thanks,
>>>> Austin
>>>>
>>>>

Re: Pig script to convert Categorical variables

Posted by Austin Chungath <au...@gmail.com>.

Thanks Eli,
That helps and it was exactly what I was doing. I wrote the UDF and it is
working.
I wrote a UDF that takes two parameters, first parameter was a bag of
tuples containing distinct values (ordered ascending)  and the second
parameter is the original data set. It is working but now I am trying
to figure out how I can return a schema for the columns created with the
names of the distinct values.

City
A
B
C
A
C
C

I want to convert it into

A             B            C
1              0            0
0              1            0
0              0            1
1              0            0
0              0            1
0              0            1
how can the UDF return a schema containing the names of the cities? is it
possible?
I should be able to generate A rather than generate $0.
Thanks,
Austin

On Tue, Feb 21, 2012 at 10:23 AM, Eli Finkelshteyn <ie...@gmail.com>wrote:

> Interesting problem. What I'm thinking is why not do two steps. First,
> read in the data, group on the column you care about. Then generate on it
> so you get just the distinct values for that column left. This would be
> something like:
>
> CITIES_GROUPED=  GROUP  INITIALBY  city;
> CITIES=  FOREACHCITIES_GROUPED GENERATE group AS city;
>
>
> Once you have that, convert it to a tuple, and then just write a quick udf
> that goes through the ORIGINAL data set and takes in the row value for the
> column you care about along with the distinct values tuple you just created
> as parameters and returns a tuple of 0s and one 1 where the one is in the
> position in the distinct values tuple that matches the row value for that
> row for the column you care about. You could write that udf in Java,
> Python, or one of the other supported udf languages, depending on your
> requirements.
>
> For inputting, you could do it either through a simple bash script (your
> use case is simple enough, I think), or you could go ahead and embed the
> PIG script in Java, Python, or one of the other languages that's supported
> for that functionality, so it's easy to expand if you later need to. I'm
> personally partial to Python and have had great results embedding in that.
> Just make sure you're on Pig 9.1+.
>
> Hopefully that helps,
> Eli
>
>
> On 2/20/12 6:56 AM, Prashant Kommireddi wrote:
>
>> This should work if the values are only A,B,C.
>>
>> M = load 'input' as (city:chararray);
>>
>> N = foreach M generate city == 'A' ? 1 : 0 as A, city == 'B' ? 1 : 0 as B,
>> city == 'C' ? 1 : 0 as C;
>>
>> However, if city values vary it might be a good option to do it by
>> embedding Pig in Java.
>> http://pig.apache.org/docs/r0.**9.1/cont.html#embed-java<http://pig.apache.org/docs/r0.9.1/cont.html#embed-java>
>>
>> Thanks,
>> Prashant
>>
>> On Mon, Feb 20, 2012 at 3:16 AM, Austin Chungath<au...@gmail.com>
>>  wrote:
>>
>> Consider this scenario:
>>>
>>> I have a column named City and it takes 3 possible values: A,B,C
>>>
>>> City
>>> A
>>> B
>>> C
>>> A
>>> C
>>> C
>>>
>>> I want to convert it into
>>>
>>> A             B            C
>>> 1              0            0
>>> 0              1            0
>>> 0              0            1
>>> 1              0            0
>>> 0              0            1
>>> 0              0            1
>>>
>>> I am trying to write a pig script that will take two parameters, one
>>> parameter is the data and then the column name, in this case 'City'. The
>>> script should then identify distinct values that it will take and then
>>> create that many columns and populate it with 1 or 0 depending on which
>>> one
>>> is true.
>>> Please let me know if you have got any ideas on how to approach this
>>> problem.
>>>
>>> Thanks,
>>> Austin
>>>
>>>
>

Re: Pig script to convert Categorical variables

Posted by Eli Finkelshteyn <ie...@gmail.com>.

Interesting problem. What I'm thinking is why not do two steps. First, 
read in the data, group on the column you care about. Then generate on 
it so you get just the distinct values for that column left. This would 
be something like:

CITIES_GROUPED=  GROUP  INITIALBY  city;
CITIES=  FOREACHCITIES_GROUPED GENERATE group AS city;

Once you have that, convert it to a tuple, and then just write a quick 
udf that goes through the ORIGINAL data set and takes in the row value 
for the column you care about along with the distinct values tuple you 
just created as parameters and returns a tuple of 0s and one 1 where the 
one is in the position in the distinct values tuple that matches the row 
value for that row for the column you care about. You could write that 
udf in Java, Python, or one of the other supported udf languages, 
depending on your requirements.

For inputting, you could do it either through a simple bash script (your 
use case is simple enough, I think), or you could go ahead and embed the 
PIG script in Java, Python, or one of the other languages that's 
supported for that functionality, so it's easy to expand if you later 
need to. I'm personally partial to Python and have had great results 
embedding in that. Just make sure you're on Pig 9.1+.

Hopefully that helps,
Eli

On 2/20/12 6:56 AM, Prashant Kommireddi wrote:
> This should work if the values are only A,B,C.
>
> M = load 'input' as (city:chararray);
>
> N = foreach M generate city == 'A' ? 1 : 0 as A, city == 'B' ? 1 : 0 as B,
> city == 'C' ? 1 : 0 as C;
>
> However, if city values vary it might be a good option to do it by
> embedding Pig in Java.
> http://pig.apache.org/docs/r0.9.1/cont.html#embed-java
>
> Thanks,
> Prashant
>
> On Mon, Feb 20, 2012 at 3:16 AM, Austin Chungath<au...@gmail.com>  wrote:
>
>> Consider this scenario:
>>
>> I have a column named City and it takes 3 possible values: A,B,C
>>
>> City
>> A
>> B
>> C
>> A
>> C
>> C
>>
>> I want to convert it into
>>
>> A             B            C
>> 1              0            0
>> 0              1            0
>> 0              0            1
>> 1              0            0
>> 0              0            1
>> 0              0            1
>>
>> I am trying to write a pig script that will take two parameters, one
>> parameter is the data and then the column name, in this case 'City'. The
>> script should then identify distinct values that it will take and then
>> create that many columns and populate it with 1 or 0 depending on which one
>> is true.
>> Please let me know if you have got any ideas on how to approach this
>> problem.
>>
>> Thanks,
>> Austin
>>

Re: Pig script to convert Categorical variables

Posted by Prashant Kommireddi <pr...@gmail.com>.

This should work if the values are only A,B,C.

M = load 'input' as (city:chararray);

N = foreach M generate city == 'A' ? 1 : 0 as A, city == 'B' ? 1 : 0 as B,
city == 'C' ? 1 : 0 as C;

However, if city values vary it might be a good option to do it by
embedding Pig in Java.
http://pig.apache.org/docs/r0.9.1/cont.html#embed-java

Thanks,
Prashant

On Mon, Feb 20, 2012 at 3:16 AM, Austin Chungath <au...@gmail.com> wrote:

> Consider this scenario:
>
> I have a column named City and it takes 3 possible values: A,B,C
>
> City
> A
> B
> C
> A
> C
> C
>
> I want to convert it into
>
> A             B            C
> 1              0            0
> 0              1            0
> 0              0            1
> 1              0            0
> 0              0            1
> 0              0            1
>
> I am trying to write a pig script that will take two parameters, one
> parameter is the data and then the column name, in this case 'City'. The
> script should then identify distinct values that it will take and then
> create that many columns and populate it with 1 or 0 depending on which one
> is true.
> Please let me know if you have got any ideas on how to approach this
> problem.
>
> Thanks,
> Austin
>

Pig script to convert Categorical variables

Posted by Austin Chungath <au...@gmail.com>.

Consider this scenario:

I have a column named City and it takes 3 possible values: A,B,C

City
A
B
C
A
C
C

I want to convert it into

A             B            C
1              0            0
0              1            0
0              0            1
1              0            0
0              0            1
0              0            1

I am trying to write a pig script that will take two parameters, one
parameter is the data and then the column name, in this case 'City'. The
script should then identify distinct values that it will take and then
create that many columns and populate it with 1 or 0 depending on which one
is true.
Please let me know if you have got any ideas on how to approach this
problem.

Thanks,
Austin