You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by praveenesh kumar <pr...@gmail.com> on 2012/02/10 12:22:24 UTC

Regex expression in FOREACH

Is it possible to specify regex expressions in FOREACH statement to
generate only selected columns as specified by the regex ?

Suppose I want to generate only those columns that ends with 'XYZ'  , Is it
possible to do in Pig using some regex?

Thanks,
Praveenesh

Re: Regex expression in FOREACH

Posted by Alan Gates <ga...@hortonworks.com>.

Pig does not support this.  But it would be easy enough to write a UDF that took the entire tuple, applied the regex to the column names from the input schema and then returned only those columns.

Alan.

On Feb 10, 2012, at 11:30 AM, praveenesh kumar wrote:

> No, this is not what I was asking for -
> I mean Suppose I have columns names like :
> 
> 1. Name
> 2. Update1
> 3. Update50
> 4. Update100
> 5. Total
> 6. Description
> 
> I want to generate all those columns that start with Update ?
> 
> If I have small number of columns, I can do this by eyeballing. But if I
> have like 100 columns, Its kind of difficult.
> In HIVE we can do this, so as in SQL. I want to know is it possible in PIG
> also , generating columns using some kind of regex ?
> 
> 
> Thanks,
> Praveenesh
> 
> On Fri, Feb 10, 2012 at 11:38 PM, Grig Gheorghiu
> <gr...@gmail.com>wrote:
> 
>> You can use EXTRACT.
>> 
>> REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
>> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
>> 
>> Assume relation A contains tuples with a field called key of the form:
>> 
>> id=123232|val=asdsa|
>> 
>> Then you can extract the id field like this:
>> 
>> B = FOREACH A GENERATE
>>       FLATTEN(
>>               EXTRACT(key, 'id=([^\\|]+)[\\|]*')
>>       )
>>       AS (
>>               id: chararray
>> );
>> 
>> Note that each backslash needs to be escaped, hence the \\.
>> 
>> HTH,
>> 
>> Grig
>> On Fri, Feb 10, 2012 at 3:22 AM, praveenesh kumar <pr...@gmail.com>
>> wrote:
>>> Is it possible to specify regex expressions in FOREACH statement to
>>> generate only selected columns as specified by the regex ?
>>> 
>>> Suppose I want to generate only those columns that ends with 'XYZ'  , Is
>> it
>>> possible to do in Pig using some regex?
>>> 
>>> Thanks,
>>> Praveenesh
>>

Re: Regex expression in FOREACH

Posted by praveenesh kumar <pr...@gmail.com>.

Any info on this ? Its kind of urgent.

Thanks,
Praveenesh


On Sat, Feb 11, 2012 at 1:02 AM, Grig Gheorghiu <gr...@gmail.com>wrote:

> Ah OK I get it....but I don't know the answer. Hopefully somebody on
> the list will reply, it's an interesting problem.
>
> On Fri, Feb 10, 2012 at 11:30 AM, praveenesh kumar <pr...@gmail.com>
> wrote:
> > No, this is not what I was asking for -
> > I mean Suppose I have columns names like :
> >
> > 1. Name
> > 2. Update1
> > 3. Update50
> > 4. Update100
> > 5. Total
> > 6. Description
> >
> > I want to generate all those columns that start with Update ?
> >
> > If I have small number of columns, I can do this by eyeballing. But if I
> > have like 100 columns, Its kind of difficult.
> > In HIVE we can do this, so as in SQL. I want to know is it possible in
> PIG
> > also , generating columns using some kind of regex ?
> >
> >
> > Thanks,
> > Praveenesh
> >
> >
> > On Fri, Feb 10, 2012 at 11:38 PM, Grig Gheorghiu <
> grig.gheorghiu@gmail.com>
> > wrote:
> >>
> >> You can use EXTRACT.
> >>
> >> REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
> >> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
> >>
> >> Assume relation A contains tuples with a field called key of the form:
> >>
> >> id=123232|val=asdsa|
> >>
> >> Then you can extract the id field like this:
> >>
> >> B = FOREACH A GENERATE
> >>        FLATTEN(
> >>                EXTRACT(key, 'id=([^\\|]+)[\\|]*')
> >>        )
> >>        AS (
> >>                id: chararray
> >> );
> >>
> >> Note that each backslash needs to be escaped, hence the \\.
> >>
> >> HTH,
> >>
> >> Grig
> >> On Fri, Feb 10, 2012 at 3:22 AM, praveenesh kumar <praveenesh@gmail.com
> >
> >> wrote:
> >> > Is it possible to specify regex expressions in FOREACH statement to
> >> > generate only selected columns as specified by the regex ?
> >> >
> >> > Suppose I want to generate only those columns that ends with 'XYZ'  ,
> Is
> >> > it
> >> > possible to do in Pig using some regex?
> >> >
> >> > Thanks,
> >> > Praveenesh
> >
> >
>

Re: Regex expression in FOREACH

Posted by praveenesh kumar <pr...@gmail.com>.

No, this is not what I was asking for -
I mean Suppose I have columns names like :

1. Name
2. Update1
3. Update50
4. Update100
5. Total
6. Description

I want to generate all those columns that start with Update ?

If I have small number of columns, I can do this by eyeballing. But if I
have like 100 columns, Its kind of difficult.
In HIVE we can do this, so as in SQL. I want to know is it possible in PIG
also , generating columns using some kind of regex ?

Thanks,
Praveenesh

On Fri, Feb 10, 2012 at 11:38 PM, Grig Gheorghiu
<gr...@gmail.com>wrote:

> You can use EXTRACT.
>
> REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
>
> Assume relation A contains tuples with a field called key of the form:
>
> id=123232|val=asdsa|
>
> Then you can extract the id field like this:
>
> B = FOREACH A GENERATE
>        FLATTEN(
>                EXTRACT(key, 'id=([^\\|]+)[\\|]*')
>        )
>        AS (
>                id: chararray
> );
>
> Note that each backslash needs to be escaped, hence the \\.
>
> HTH,
>
> Grig
> On Fri, Feb 10, 2012 at 3:22 AM, praveenesh kumar <pr...@gmail.com>
> wrote:
> > Is it possible to specify regex expressions in FOREACH statement to
> > generate only selected columns as specified by the regex ?
> >
> > Suppose I want to generate only those columns that ends with 'XYZ'  , Is
> it
> > possible to do in Pig using some regex?
> >
> > Thanks,
> > Praveenesh
>

Re: Regex expression in FOREACH

Posted by Grig Gheorghiu <gr...@gmail.com>.

You can use EXTRACT.

REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();

Assume relation A contains tuples with a field called key of the form:

id=123232|val=asdsa|

Then you can extract the id field like this:

B = FOREACH A GENERATE
        FLATTEN(
                EXTRACT(key, 'id=([^\\|]+)[\\|]*')
        )
        AS (
                id: chararray
);

Note that each backslash needs to be escaped, hence the \\.

HTH,

Grig
On Fri, Feb 10, 2012 at 3:22 AM, praveenesh kumar <pr...@gmail.com> wrote:
> Is it possible to specify regex expressions in FOREACH statement to
> generate only selected columns as specified by the regex ?
>
> Suppose I want to generate only those columns that ends with 'XYZ'  , Is it
> possible to do in Pig using some regex?
>
> Thanks,
> Praveenesh