You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Tamir Kamara <ta...@gmail.com> on 2009/03/26 08:12:00 UTC

Regex operand - chararray only

Hi,

Following a COGROUP I would like to filter results by one of the fields but
I'm getting an error: Operand of Regex can be CharArray only. The relevant
lines in my script are:
x1 = COGROUP p3 BY domain, rdt1 BY from, f4 BY target;
x2 = FILTER x1 BY ( IsEmpty(p3) AND (IsEmpty(rdt1) OR (rdt1.to matches
'.*com')) );
x3 = FOREACH x2 GENERATE flatten(f4);

describe of x1
x1: {group: chararray,p3: {domain: chararray},rdt1: {from: chararray,to:
chararray},f4: {source: chararray,target: chararray}}

I'm not sure why the error occurs. Is it because rdt1 inside x1 is a bag -
multiple rdt1 can exist in the same group ?

I can get around this with this script:
x1 = COGROUP p3 BY domain, rdt1 BY from, f4 BY target parallel 32;
x2 = FOREACH x1 GENERATE flatten(f4), COUNT(p3) as p3_count, COUNT(rdt1) as
rdt1_count, flatten(rdt1.to);
x3 = FILTER x2 BY ( p3_count==0 AND (rdt1_count==0 OR (to matches '.com'))
);
x4 = FOREACH x3 GENERATE source, target;

but it seems to me too complicated. Is there a way to make my first version
work ?

Thanks in advance,
Tamir

RE: Regex operand - chararray only

Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.
Tamir,

x2 = FILTER x1 BY ( IsEmpty(p3) AND (IsEmpty(rdt1) OR (rdt1.to matches
'.*com')) );

Here projecting the column 'to' from the bag 'rdt1' will give you a bag
of chararray.

You could write a UDF that takes this bag, iterate over the contents and
do a regex match on each item.

Thanks,
Santhosh

-----Original Message-----
From: Tamir Kamara [mailto:tamirkamara@gmail.com] 
Sent: Thursday, March 26, 2009 12:12 AM
To: pig-user@hadoop.apache.org
Subject: Regex operand - chararray only

Hi,

Following a COGROUP I would like to filter results by one of the fields
but
I'm getting an error: Operand of Regex can be CharArray only. The
relevant
lines in my script are:
x1 = COGROUP p3 BY domain, rdt1 BY from, f4 BY target;
x2 = FILTER x1 BY ( IsEmpty(p3) AND (IsEmpty(rdt1) OR (rdt1.to matches
'.*com')) );
x3 = FOREACH x2 GENERATE flatten(f4);

describe of x1
x1: {group: chararray,p3: {domain: chararray},rdt1: {from: chararray,to:
chararray},f4: {source: chararray,target: chararray}}

I'm not sure why the error occurs. Is it because rdt1 inside x1 is a bag
-
multiple rdt1 can exist in the same group ?

I can get around this with this script:
x1 = COGROUP p3 BY domain, rdt1 BY from, f4 BY target parallel 32;
x2 = FOREACH x1 GENERATE flatten(f4), COUNT(p3) as p3_count, COUNT(rdt1)
as
rdt1_count, flatten(rdt1.to);
x3 = FILTER x2 BY ( p3_count==0 AND (rdt1_count==0 OR (to matches
'.com'))
);
x4 = FOREACH x3 GENERATE source, target;

but it seems to me too complicated. Is there a way to make my first
version
work ?

Thanks in advance,
Tamir