You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by MiaoMiao <li...@gmail.com> on 2012/09/04 09:37:52 UTC

Re: Re: Weird bug of REPLACE

Pity the document of REPLACE doesn't mention about regex at all. Thank
you so much for your reply, being able to know what's going on is such
a relief. Now I can trust myself with pig a little more.


On Sat, 18 Aug 2012 at 00:05:29 AM, Cheolsoo Park <ch...@cloudera.com> wrote:
> Hi,

> If you look at the source code of REPLACE, what it does is basically:

> String source = "[02/Aug/2012:05:01:17";
> > String target ="[";
> > String replaceWith = "";
> > return source.replaceAll(source, target, replaceWith);


> Note that Java String.replaceAll() takes a regular expression for the 2nd
> parameter (i.e. target), and "[" is a special character. To use it as is,
> you have to escape it, so in your Pig script, you should do:

> REPLACE(date,'\\[','')

> Now regarding the result that you're seeing, it looks like whatever
> exception is thrown inside REPLACE is swallowed rather than makes the job
> fail, and null is returned:

>         try{
> >             ...
> >         }catch(Exception e){
> >             warn("Failed to process input; error - " + e.getMessage(),
> > PigWarning.*UDF_WARNING_1*);
> >             return null;
> >         }


> But I do see the following message at the end of the job status:

> 2012-08-17 16:51:25,061 [main] WARN
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Encountered *Warning UDF_WARNING_1* 1 time(s)

> I must admit that this is not very visible though.

> Thanks,
> Cheolsoo

> On Mon, Aug 13, 2012 at 10:03 PM, MiaoMiao <li...@gmail.com> wrote:

> > I used pig to do some ETL job, but met with a strange bug of the
> > built-in REPLACE function.
> >
> > After I replace '[' with '' in '[02/Aug/2012:05:01:17' , the whole
> > string just went blank.
> >
> > Here I posted some info that may help debug.
> >
> > My pig version is: Apache Pig version 0.11.0-SNAPSHOT (r1364475)
> > compiled Jul 23 2012, 10:30:53
> >
> > The original text file:
> > ip.ip.ip.ip - - [02/Aug/2012:05:01:17 -0600] "GET
> > /player.php/sid/XNDM0Njk3MjEy/v.swf HTTP/1.1" 302 26
> >
> > The whole pig script is :
> > read = load '/home/test/apacheLog'
> > using PigStorage(' ')
> > as (
> >           ip:chararray
> >         , indentity:chararray
> >         , name:chararray
> >         , date:chararray
> >         , timezone:chararray
> >         , method:chararray
> >         , path:chararray
> >         , protocol:chararray
> >         , status:chararray
> >         , size:chararray
> > );
> > dump read;
> >
> > --(ip.ip.ip.ip,-,-,[02/Aug/2012:05:01:17,-0600],"GET,/player.php/sid/XNDM0Njk3MjEy/v.swf,HTTP/1.1",302,26)
> > data = foreach read generate
> >           ip
> >         , REPLACE(date,'[','')
> >         , REPLACE(timezone,']','')
> >         , REPLACE(method,'"','')
> >         , path
> >         , REPLACE(protocol,'"','')
> >         , status
> >         , size;
> > describe data;
> > --data: {ip: chararray,date: chararray,timezone: chararray,method:
> > chararray,path: chararray,protocol: chararray,status: chararray,size:
> > chararray}
> > dump data;
> >
> > --(ip.ip.ip.ip,,-0600,GET,/player.php/sid/XNDM0Njk3MjEy/v.swf,HTTP/1.1,302,26)
> >

Re: Re: Weird bug of REPLACE

Posted by Bill Graham <bi...@gmail.com>.
Opened a JIRA to better clarify the docs here:
https://issues.apache.org/jira/browse/PIG-2905

On Tue, Sep 4, 2012 at 12:37 AM, MiaoMiao <li...@gmail.com> wrote:

> Pity the document of REPLACE doesn't mention about regex at all. Thank
> you so much for your reply, being able to know what's going on is such
> a relief. Now I can trust myself with pig a little more.
>
>
> On Sat, 18 Aug 2012 at 00:05:29 AM, Cheolsoo Park <ch...@cloudera.com>
> wrote:
> > Hi,
>
> > If you look at the source code of REPLACE, what it does is basically:
>
> > String source = "[02/Aug/2012:05:01:17";
> > > String target ="[";
> > > String replaceWith = "";
> > > return source.replaceAll(source, target, replaceWith);
>
>
> > Note that Java String.replaceAll() takes a regular expression for the 2nd
> > parameter (i.e. target), and "[" is a special character. To use it as is,
> > you have to escape it, so in your Pig script, you should do:
>
> > REPLACE(date,'\\[','')
>
> > Now regarding the result that you're seeing, it looks like whatever
> > exception is thrown inside REPLACE is swallowed rather than makes the job
> > fail, and null is returned:
>
> >         try{
> > >             ...
> > >         }catch(Exception e){
> > >             warn("Failed to process input; error - " + e.getMessage(),
> > > PigWarning.*UDF_WARNING_1*);
> > >             return null;
> > >         }
>
>
> > But I do see the following message at the end of the job status:
>
> > 2012-08-17 16:51:25,061 [main] WARN
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Encountered *Warning UDF_WARNING_1* 1 time(s)
>
> > I must admit that this is not very visible though.
>
> > Thanks,
> > Cheolsoo
>
> > On Mon, Aug 13, 2012 at 10:03 PM, MiaoMiao <li...@gmail.com> wrote:
>
> > > I used pig to do some ETL job, but met with a strange bug of the
> > > built-in REPLACE function.
> > >
> > > After I replace '[' with '' in '[02/Aug/2012:05:01:17' , the whole
> > > string just went blank.
> > >
> > > Here I posted some info that may help debug.
> > >
> > > My pig version is: Apache Pig version 0.11.0-SNAPSHOT (r1364475)
> > > compiled Jul 23 2012, 10:30:53
> > >
> > > The original text file:
> > > ip.ip.ip.ip - - [02/Aug/2012:05:01:17 -0600] "GET
> > > /player.php/sid/XNDM0Njk3MjEy/v.swf HTTP/1.1" 302 26
> > >
> > > The whole pig script is :
> > > read = load '/home/test/apacheLog'
> > > using PigStorage(' ')
> > > as (
> > >           ip:chararray
> > >         , indentity:chararray
> > >         , name:chararray
> > >         , date:chararray
> > >         , timezone:chararray
> > >         , method:chararray
> > >         , path:chararray
> > >         , protocol:chararray
> > >         , status:chararray
> > >         , size:chararray
> > > );
> > > dump read;
> > >
> > >
> --(ip.ip.ip.ip,-,-,[02/Aug/2012:05:01:17,-0600],"GET,/player.php/sid/XNDM0Njk3MjEy/v.swf,HTTP/1.1",302,26)
> > > data = foreach read generate
> > >           ip
> > >         , REPLACE(date,'[','')
> > >         , REPLACE(timezone,']','')
> > >         , REPLACE(method,'"','')
> > >         , path
> > >         , REPLACE(protocol,'"','')
> > >         , status
> > >         , size;
> > > describe data;
> > > --data: {ip: chararray,date: chararray,timezone: chararray,method:
> > > chararray,path: chararray,protocol: chararray,status: chararray,size:
> > > chararray}
> > > dump data;
> > >
> > >
> --(ip.ip.ip.ip,,-0600,GET,/player.php/sid/XNDM0Njk3MjEy/v.swf,HTTP/1.1,302,26)
> > >
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*