You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by praveenesh kumar <pr...@gmail.com> on 2012/02/10 12:22:24 UTC
Regex expression in FOREACH
Is it possible to specify regex expressions in FOREACH statement to
generate only selected columns as specified by the regex ?
Suppose I want to generate only those columns that ends with 'XYZ' , Is it
possible to do in Pig using some regex?
Thanks,
Praveenesh
Re: Regex expression in FOREACH
Posted by Alan Gates <ga...@hortonworks.com>.
Pig does not support this. But it would be easy enough to write a UDF that took the entire tuple, applied the regex to the column names from the input schema and then returned only those columns.
Alan.
On Feb 10, 2012, at 11:30 AM, praveenesh kumar wrote:
> No, this is not what I was asking for -
> I mean Suppose I have columns names like :
>
> 1. Name
> 2. Update1
> 3. Update50
> 4. Update100
> 5. Total
> 6. Description
>
> I want to generate all those columns that start with Update ?
>
> If I have small number of columns, I can do this by eyeballing. But if I
> have like 100 columns, Its kind of difficult.
> In HIVE we can do this, so as in SQL. I want to know is it possible in PIG
> also , generating columns using some kind of regex ?
>
>
> Thanks,
> Praveenesh
>
> On Fri, Feb 10, 2012 at 11:38 PM, Grig Gheorghiu
> <gr...@gmail.com>wrote:
>
>> You can use EXTRACT.
>>
>> REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
>> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
>>
>> Assume relation A contains tuples with a field called key of the form:
>>
>> id=123232|val=asdsa|
>>
>> Then you can extract the id field like this:
>>
>> B = FOREACH A GENERATE
>> FLATTEN(
>> EXTRACT(key, 'id=([^\\|]+)[\\|]*')
>> )
>> AS (
>> id: chararray
>> );
>>
>> Note that each backslash needs to be escaped, hence the \\.
>>
>> HTH,
>>
>> Grig
>> On Fri, Feb 10, 2012 at 3:22 AM, praveenesh kumar <pr...@gmail.com>
>> wrote:
>>> Is it possible to specify regex expressions in FOREACH statement to
>>> generate only selected columns as specified by the regex ?
>>>
>>> Suppose I want to generate only those columns that ends with 'XYZ' , Is
>> it
>>> possible to do in Pig using some regex?
>>>
>>> Thanks,
>>> Praveenesh
>>
Re: Regex expression in FOREACH
Posted by praveenesh kumar <pr...@gmail.com>.
Any info on this ? Its kind of urgent.
Thanks,
Praveenesh
On Sat, Feb 11, 2012 at 1:02 AM, Grig Gheorghiu <gr...@gmail.com>wrote:
> Ah OK I get it....but I don't know the answer. Hopefully somebody on
> the list will reply, it's an interesting problem.
>
> On Fri, Feb 10, 2012 at 11:30 AM, praveenesh kumar <pr...@gmail.com>
> wrote:
> > No, this is not what I was asking for -
> > I mean Suppose I have columns names like :
> >
> > 1. Name
> > 2. Update1
> > 3. Update50
> > 4. Update100
> > 5. Total
> > 6. Description
> >
> > I want to generate all those columns that start with Update ?
> >
> > If I have small number of columns, I can do this by eyeballing. But if I
> > have like 100 columns, Its kind of difficult.
> > In HIVE we can do this, so as in SQL. I want to know is it possible in
> PIG
> > also , generating columns using some kind of regex ?
> >
> >
> > Thanks,
> > Praveenesh
> >
> >
> > On Fri, Feb 10, 2012 at 11:38 PM, Grig Gheorghiu <
> grig.gheorghiu@gmail.com>
> > wrote:
> >>
> >> You can use EXTRACT.
> >>
> >> REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
> >> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
> >>
> >> Assume relation A contains tuples with a field called key of the form:
> >>
> >> id=123232|val=asdsa|
> >>
> >> Then you can extract the id field like this:
> >>
> >> B = FOREACH A GENERATE
> >> FLATTEN(
> >> EXTRACT(key, 'id=([^\\|]+)[\\|]*')
> >> )
> >> AS (
> >> id: chararray
> >> );
> >>
> >> Note that each backslash needs to be escaped, hence the \\.
> >>
> >> HTH,
> >>
> >> Grig
> >> On Fri, Feb 10, 2012 at 3:22 AM, praveenesh kumar <praveenesh@gmail.com
> >
> >> wrote:
> >> > Is it possible to specify regex expressions in FOREACH statement to
> >> > generate only selected columns as specified by the regex ?
> >> >
> >> > Suppose I want to generate only those columns that ends with 'XYZ' ,
> Is
> >> > it
> >> > possible to do in Pig using some regex?
> >> >
> >> > Thanks,
> >> > Praveenesh
> >
> >
>
Re: Regex expression in FOREACH
Posted by praveenesh kumar <pr...@gmail.com>.
No, this is not what I was asking for -
I mean Suppose I have columns names like :
1. Name
2. Update1
3. Update50
4. Update100
5. Total
6. Description
I want to generate all those columns that start with Update ?
If I have small number of columns, I can do this by eyeballing. But if I
have like 100 columns, Its kind of difficult.
In HIVE we can do this, so as in SQL. I want to know is it possible in PIG
also , generating columns using some kind of regex ?
Thanks,
Praveenesh
On Fri, Feb 10, 2012 at 11:38 PM, Grig Gheorghiu
<gr...@gmail.com>wrote:
> You can use EXTRACT.
>
> REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
>
> Assume relation A contains tuples with a field called key of the form:
>
> id=123232|val=asdsa|
>
> Then you can extract the id field like this:
>
> B = FOREACH A GENERATE
> FLATTEN(
> EXTRACT(key, 'id=([^\\|]+)[\\|]*')
> )
> AS (
> id: chararray
> );
>
> Note that each backslash needs to be escaped, hence the \\.
>
> HTH,
>
> Grig
> On Fri, Feb 10, 2012 at 3:22 AM, praveenesh kumar <pr...@gmail.com>
> wrote:
> > Is it possible to specify regex expressions in FOREACH statement to
> > generate only selected columns as specified by the regex ?
> >
> > Suppose I want to generate only those columns that ends with 'XYZ' , Is
> it
> > possible to do in Pig using some regex?
> >
> > Thanks,
> > Praveenesh
>
Re: Regex expression in FOREACH
Posted by Grig Gheorghiu <gr...@gmail.com>.
You can use EXTRACT.
REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
Assume relation A contains tuples with a field called key of the form:
id=123232|val=asdsa|
Then you can extract the id field like this:
B = FOREACH A GENERATE
FLATTEN(
EXTRACT(key, 'id=([^\\|]+)[\\|]*')
)
AS (
id: chararray
);
Note that each backslash needs to be escaped, hence the \\.
HTH,
Grig
On Fri, Feb 10, 2012 at 3:22 AM, praveenesh kumar <pr...@gmail.com> wrote:
> Is it possible to specify regex expressions in FOREACH statement to
> generate only selected columns as specified by the regex ?
>
> Suppose I want to generate only those columns that ends with 'XYZ' , Is it
> possible to do in Pig using some regex?
>
> Thanks,
> Praveenesh