You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by MiaoMiao <li...@gmail.com> on 2012/08/14 07:03:49 UTC

Weird bug of REPLACE

I used pig to do some ETL job, but met with a strange bug of the
built-in REPLACE function.

After I replace '[' with '' in '[02/Aug/2012:05:01:17' , the whole
string just went blank.

Here I posted some info that may help debug.

My pig version is: Apache Pig version 0.11.0-SNAPSHOT (r1364475)
compiled Jul 23 2012, 10:30:53

The original text file:
ip.ip.ip.ip - - [02/Aug/2012:05:01:17 -0600] "GET
/player.php/sid/XNDM0Njk3MjEy/v.swf HTTP/1.1" 302 26

The whole pig script is :
read = load '/home/test/apacheLog'
using PigStorage(' ')
as (
	  ip:chararray
	, indentity:chararray
	, name:chararray
	, date:chararray
	, timezone:chararray
	, method:chararray
	, path:chararray
	, protocol:chararray
	, status:chararray
	, size:chararray
);
dump read;
--(ip.ip.ip.ip,-,-,[02/Aug/2012:05:01:17,-0600],"GET,/player.php/sid/XNDM0Njk3MjEy/v.swf,HTTP/1.1",302,26)
data = foreach read generate
	  ip
	, REPLACE(date,'[','')
	, REPLACE(timezone,']','')
	, REPLACE(method,'"','')
	, path
	, REPLACE(protocol,'"','')
	, status
	, size;
describe data;
--data: {ip: chararray,date: chararray,timezone: chararray,method:
chararray,path: chararray,protocol: chararray,status: chararray,size:
chararray}
dump data;
--(ip.ip.ip.ip,,-0600,GET,/player.php/sid/XNDM0Njk3MjEy/v.swf,HTTP/1.1,302,26)

Re: Weird bug of REPLACE

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi,

If you look at the source code of REPLACE, what it does is basically:

String source = "[02/Aug/2012:05:01:17";
> String target ="[";
> String replaceWith = "";
> return source.replaceAll(source, target, replaceWith);


Note that Java String.replaceAll() takes a regular expression for the 2nd
parameter (i.e. target), and "[" is a special character. To use it as is,
you have to escape it, so in your Pig script, you should do:

REPLACE(date,'\\[','')

Now regarding the result that you're seeing, it looks like whatever
exception is thrown inside REPLACE is swallowed rather than makes the job
fail, and null is returned:

        try{
>             ...
>         }catch(Exception e){
>             warn("Failed to process input; error - " + e.getMessage(),
> PigWarning.*UDF_WARNING_1*);
>             return null;
>         }


But I do see the following message at the end of the job status:

2012-08-17 16:51:25,061 [main] WARN
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Encountered *Warning UDF_WARNING_1* 1 time(s)

I must admit that this is not very visible though.

Thanks,
Cheolsoo

On Mon, Aug 13, 2012 at 10:03 PM, MiaoMiao <li...@gmail.com> wrote:

> I used pig to do some ETL job, but met with a strange bug of the
> built-in REPLACE function.
>
> After I replace '[' with '' in '[02/Aug/2012:05:01:17' , the whole
> string just went blank.
>
> Here I posted some info that may help debug.
>
> My pig version is: Apache Pig version 0.11.0-SNAPSHOT (r1364475)
> compiled Jul 23 2012, 10:30:53
>
> The original text file:
> ip.ip.ip.ip - - [02/Aug/2012:05:01:17 -0600] "GET
> /player.php/sid/XNDM0Njk3MjEy/v.swf HTTP/1.1" 302 26
>
> The whole pig script is :
> read = load '/home/test/apacheLog'
> using PigStorage(' ')
> as (
>           ip:chararray
>         , indentity:chararray
>         , name:chararray
>         , date:chararray
>         , timezone:chararray
>         , method:chararray
>         , path:chararray
>         , protocol:chararray
>         , status:chararray
>         , size:chararray
> );
> dump read;
>
> --(ip.ip.ip.ip,-,-,[02/Aug/2012:05:01:17,-0600],"GET,/player.php/sid/XNDM0Njk3MjEy/v.swf,HTTP/1.1",302,26)
> data = foreach read generate
>           ip
>         , REPLACE(date,'[','')
>         , REPLACE(timezone,']','')
>         , REPLACE(method,'"','')
>         , path
>         , REPLACE(protocol,'"','')
>         , status
>         , size;
> describe data;
> --data: {ip: chararray,date: chararray,timezone: chararray,method:
> chararray,path: chararray,protocol: chararray,status: chararray,size:
> chararray}
> dump data;
>
> --(ip.ip.ip.ip,,-0600,GET,/player.php/sid/XNDM0Njk3MjEy/v.swf,HTTP/1.1,302,26)
>