You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Michael Harris <Mi...@Telespree.com> on 2008/04/04 18:45:25 UTC

Escape characters in Pig Queries

I guess my last message was obvious/stupid since I am not getting any
responses, but hopefully I won't be 0/2.

I love using Pig and I think it's a fantastic tool for creating complex,
map-reduce programs quickly, but that said I am having 2 problems in
addition to the one below. Hopefully I am just missing something easy
and someone can shoot me a quick response.

I have written my own eval func that extracts events from our event log.
It then splits the event by some arbitrary regex and then finds the last
match from that event that does not match another regex. The queries are
as follows.

eventlog = LOAD
'/user/hadoop/index8mbGZnotes/{1205478000254_1205857683529.gz,1205857686
408_1206295646386.gz,1206295646442_1206757710701.gz,1206757712403_120711
3039900.gz,1207113039930_1207205997234.gz}' USING PigStorage('	');
filterDate = FILTER eventlog BY $1 >= '1204358400000' AND $1 <=
'1209625200000';
filterCh = FILTER filterDate BY $15 eq 'Sony'  OR $15 eq 'Dell'  OR $15
eq 'HP' ;
filter1 = FILTER filterCh BY  ($5 == 11 AND $6 == 15 AND $7 == 406 ) ;
filtered = FOREACH filter1 GENERATE
LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/error.*)','[0-9]{2}:[0-9]{2}
:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;
grouped = GROUP filtered BY ($0, $1);
resultUnordered = FOREACH grouped GENERATE FLATTEN(group),
FLATTEN(COUNT(filtered)) PARALLEL 14;

The func is LastPageExtractor(inputValue, excludeRegex, splitRegex)

This all works fine, but I would like to change my split regex to
\\|+[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4} , however when
I do that I get this :

Exception in thread "Thread-6"
org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at
line 1, column 93.  Encountered: "|" (124), after : "\'\\"

Is there some special escape sequence I should know about? I searched
escape in PigLatin Wiki and found nothing.

The second problem I have is I am not able to register jars/funcs
without packaging them into the pig.jar in the
org.apache.pig.impl.builtin package. I have tried everything I can think
of and everything in the documentation. I register the jar with
PigServer.registerJar and try to use the fully qualified function name
all the task trackers fail with:

java.lang.RuntimeException: could not instantiate
'telespree.analytics.pig.LastPageExtractor' with arguments '[]'

I do:

server.registerJar("c:\\telespree.jar");

and

filtered = FOREACH filter1 GENERATE
telespree.analytics.pig.LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/erro
r.*)','[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;");

I even tried to put these functions in the default package in pig.jar
since I saw in the code you do lookups with 
        packageImportList.add("");
        packageImportList.add("org.apache.pig.builtin.");
        packageImportList.add("com.yahoo.pig.yst.sds.ULT.");
        packageImportList.add("org.apache.pig.impl.builtin.");     

So I figured using the "" import would find my function, however alas I
get the same error : 
java.lang.RuntimeException: could not instantiate 'LastPageExtractor'
with arguments '[]'

However if I package them in org.apache.pig.impl.builtin it all works
fine.

Any help on these 3 areas would be much appreciated!

-Michael




-----Original Message-----
From: Michael Harris [mailto:MichaelH@Telespree.com] 
Sent: Wednesday, April 02, 2008 10:47 AM
To: pig-user@incubator.apache.org
Subject: MapReduceLauncher static fields

Hello,

 

I have written a pig application that does a fixed set of queries
on-demand through a web interface. I am trying to get the progress of
the queries from the PigServer, but I have noticed that the source of
the progress data is all static fields in the MapReduceLauncher. Clearly
my webapp must be able to handle multiple concurrent pig queries (and be
thread-safe) and I would like to report the progress of each individual
query (job set) to the end user.  Do these static fields indicate that I
would get the progress of multiple concurrent queries initiated by
different PigServer instances? or would I get the overall progress of
the MapReduceLauncher for all queries currently being executed?

 

Thanks,
Michael


Re: Escape characters in Pig Queries

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Mridul Muralidharan wrote:
> 
> Hi Michael,
> 
>   Not sure about the character escaping, but I do have my UDF's in jars 
> independent of pig jars - and that works fine for me. You might want to 
> check for path issues ?

And if there is an empty constructor (or no constructor) for the udf.
iirc pig uses the null constructor to create the udf.

Mridul

> 
> Regards,
> Mridul
> 
> Michael Harris wrote:
>> I guess my last message was obvious/stupid since I am not getting any
>> responses, but hopefully I won't be 0/2.
>>
>> I love using Pig and I think it's a fantastic tool for creating complex,
>> map-reduce programs quickly, but that said I am having 2 problems in
>> addition to the one below. Hopefully I am just missing something easy
>> and someone can shoot me a quick response.
>>
>> I have written my own eval func that extracts events from our event log.
>> It then splits the event by some arbitrary regex and then finds the last
>> match from that event that does not match another regex. The queries are
>> as follows.
>>
>> eventlog = LOAD
>> '/user/hadoop/index8mbGZnotes/{1205478000254_1205857683529.gz,1205857686
>> 408_1206295646386.gz,1206295646442_1206757710701.gz,1206757712403_120711
>> 3039900.gz,1207113039930_1207205997234.gz}' USING PigStorage('    ');
>> filterDate = FILTER eventlog BY $1 >= '1204358400000' AND $1 <=
>> '1209625200000';
>> filterCh = FILTER filterDate BY $15 eq 'Sony'  OR $15 eq 'Dell'  OR $15
>> eq 'HP' ;
>> filter1 = FILTER filterCh BY  ($5 == 11 AND $6 == 15 AND $7 == 406 ) ;
>> filtered = FOREACH filter1 GENERATE
>> LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/error.*)','[0-9]{2}:[0-9]{2}
>> :[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;
>> grouped = GROUP filtered BY ($0, $1);
>> resultUnordered = FOREACH grouped GENERATE FLATTEN(group),
>> FLATTEN(COUNT(filtered)) PARALLEL 14;
>>
>> The func is LastPageExtractor(inputValue, excludeRegex, splitRegex)
>>
>> This all works fine, but I would like to change my split regex to
>> \\|+[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4} , however when
>> I do that I get this :
>>
>> Exception in thread "Thread-6"
>> org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at
>> line 1, column 93.  Encountered: "|" (124), after : "\'\\"
>>
>> Is there some special escape sequence I should know about? I searched
>> escape in PigLatin Wiki and found nothing.
>>
>> The second problem I have is I am not able to register jars/funcs
>> without packaging them into the pig.jar in the
>> org.apache.pig.impl.builtin package. I have tried everything I can think
>> of and everything in the documentation. I register the jar with
>> PigServer.registerJar and try to use the fully qualified function name
>> all the task trackers fail with:
>>
>> java.lang.RuntimeException: could not instantiate
>> 'telespree.analytics.pig.LastPageExtractor' with arguments '[]'
>>
>> I do:
>>
>> server.registerJar("c:\\telespree.jar");
>>
>> and
>>
>> filtered = FOREACH filter1 GENERATE
>> telespree.analytics.pig.LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/erro
>> r.*)','[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;");
>>
>> I even tried to put these functions in the default package in pig.jar
>> since I saw in the code you do lookups with         
>> packageImportList.add("");
>>         packageImportList.add("org.apache.pig.builtin.");
>>         packageImportList.add("com.yahoo.pig.yst.sds.ULT.");
>>         packageImportList.add("org.apache.pig.impl.builtin.");    
>> So I figured using the "" import would find my function, however alas I
>> get the same error : java.lang.RuntimeException: could not instantiate 
>> 'LastPageExtractor'
>> with arguments '[]'
>>
>> However if I package them in org.apache.pig.impl.builtin it all works
>> fine.
>>
>> Any help on these 3 areas would be much appreciated!
>>
>> -Michael
>>
>>
>>
>>
>> -----Original Message-----
>> From: Michael Harris [mailto:MichaelH@Telespree.com] Sent: Wednesday, 
>> April 02, 2008 10:47 AM
>> To: pig-user@incubator.apache.org
>> Subject: MapReduceLauncher static fields
>>
>> Hello,
>>
>>  
>>
>> I have written a pig application that does a fixed set of queries
>> on-demand through a web interface. I am trying to get the progress of
>> the queries from the PigServer, but I have noticed that the source of
>> the progress data is all static fields in the MapReduceLauncher. Clearly
>> my webapp must be able to handle multiple concurrent pig queries (and be
>> thread-safe) and I would like to report the progress of each individual
>> query (job set) to the end user.  Do these static fields indicate that I
>> would get the progress of multiple concurrent queries initiated by
>> different PigServer instances? or would I get the overall progress of
>> the MapReduceLauncher for all queries currently being executed?
>>
>>  
>>
>> Thanks,
>> Michael
>>
> 


Re: Escape characters in Pig Queries

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Hi Michael,

   Not sure about the character escaping, but I do have my UDF's in jars 
independent of pig jars - and that works fine for me. You might want to 
check for path issues ?

Regards,
Mridul

Michael Harris wrote:
> I guess my last message was obvious/stupid since I am not getting any
> responses, but hopefully I won't be 0/2.
> 
> I love using Pig and I think it's a fantastic tool for creating complex,
> map-reduce programs quickly, but that said I am having 2 problems in
> addition to the one below. Hopefully I am just missing something easy
> and someone can shoot me a quick response.
> 
> I have written my own eval func that extracts events from our event log.
> It then splits the event by some arbitrary regex and then finds the last
> match from that event that does not match another regex. The queries are
> as follows.
> 
> eventlog = LOAD
> '/user/hadoop/index8mbGZnotes/{1205478000254_1205857683529.gz,1205857686
> 408_1206295646386.gz,1206295646442_1206757710701.gz,1206757712403_120711
> 3039900.gz,1207113039930_1207205997234.gz}' USING PigStorage('	');
> filterDate = FILTER eventlog BY $1 >= '1204358400000' AND $1 <=
> '1209625200000';
> filterCh = FILTER filterDate BY $15 eq 'Sony'  OR $15 eq 'Dell'  OR $15
> eq 'HP' ;
> filter1 = FILTER filterCh BY  ($5 == 11 AND $6 == 15 AND $7 == 406 ) ;
> filtered = FOREACH filter1 GENERATE
> LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/error.*)','[0-9]{2}:[0-9]{2}
> :[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;
> grouped = GROUP filtered BY ($0, $1);
> resultUnordered = FOREACH grouped GENERATE FLATTEN(group),
> FLATTEN(COUNT(filtered)) PARALLEL 14;
> 
> The func is LastPageExtractor(inputValue, excludeRegex, splitRegex)
> 
> This all works fine, but I would like to change my split regex to
> \\|+[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4} , however when
> I do that I get this :
> 
> Exception in thread "Thread-6"
> org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at
> line 1, column 93.  Encountered: "|" (124), after : "\'\\"
> 
> Is there some special escape sequence I should know about? I searched
> escape in PigLatin Wiki and found nothing.
> 
> The second problem I have is I am not able to register jars/funcs
> without packaging them into the pig.jar in the
> org.apache.pig.impl.builtin package. I have tried everything I can think
> of and everything in the documentation. I register the jar with
> PigServer.registerJar and try to use the fully qualified function name
> all the task trackers fail with:
> 
> java.lang.RuntimeException: could not instantiate
> 'telespree.analytics.pig.LastPageExtractor' with arguments '[]'
> 
> I do:
> 
> server.registerJar("c:\\telespree.jar");
> 
> and
> 
> filtered = FOREACH filter1 GENERATE
> telespree.analytics.pig.LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/erro
> r.*)','[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;");
> 
> I even tried to put these functions in the default package in pig.jar
> since I saw in the code you do lookups with 
>         packageImportList.add("");
>         packageImportList.add("org.apache.pig.builtin.");
>         packageImportList.add("com.yahoo.pig.yst.sds.ULT.");
>         packageImportList.add("org.apache.pig.impl.builtin.");     
> 
> So I figured using the "" import would find my function, however alas I
> get the same error : 
> java.lang.RuntimeException: could not instantiate 'LastPageExtractor'
> with arguments '[]'
> 
> However if I package them in org.apache.pig.impl.builtin it all works
> fine.
> 
> Any help on these 3 areas would be much appreciated!
> 
> -Michael
> 
> 
> 
> 
> -----Original Message-----
> From: Michael Harris [mailto:MichaelH@Telespree.com] 
> Sent: Wednesday, April 02, 2008 10:47 AM
> To: pig-user@incubator.apache.org
> Subject: MapReduceLauncher static fields
> 
> Hello,
> 
>  
> 
> I have written a pig application that does a fixed set of queries
> on-demand through a web interface. I am trying to get the progress of
> the queries from the PigServer, but I have noticed that the source of
> the progress data is all static fields in the MapReduceLauncher. Clearly
> my webapp must be able to handle multiple concurrent pig queries (and be
> thread-safe) and I would like to report the progress of each individual
> query (job set) to the end user.  Do these static fields indicate that I
> would get the progress of multiple concurrent queries initiated by
> different PigServer instances? or would I get the overall progress of
> the MapReduceLauncher for all queries currently being executed?
> 
>  
> 
> Thanks,
> Michael
> 


Re: Escape characters in Pig Queries

Posted by Alan Gates <ga...@yahoo-inc.com>.
The issue with not being able to escape regular expressions looks like a 
bug, you should file a JIRA so that it gets addressed.

On the not being able to instantiate your function when it's in another 
jar, we have not seen this in this situation.  But we have not tested it 
extensively on windows either.  Could you post your jar file (or one 
that reproduces it with a simple function if your function is complex)?

Alan.


Michael Harris wrote:
> I guess my last message was obvious/stupid since I am not getting any
> responses, but hopefully I won't be 0/2.
>
> I love using Pig and I think it's a fantastic tool for creating complex,
> map-reduce programs quickly, but that said I am having 2 problems in
> addition to the one below. Hopefully I am just missing something easy
> and someone can shoot me a quick response.
>
> I have written my own eval func that extracts events from our event log.
> It then splits the event by some arbitrary regex and then finds the last
> match from that event that does not match another regex. The queries are
> as follows.
>
> eventlog = LOAD
> '/user/hadoop/index8mbGZnotes/{1205478000254_1205857683529.gz,1205857686
> 408_1206295646386.gz,1206295646442_1206757710701.gz,1206757712403_120711
> 3039900.gz,1207113039930_1207205997234.gz}' USING PigStorage('	');
> filterDate = FILTER eventlog BY $1 >= '1204358400000' AND $1 <=
> '1209625200000';
> filterCh = FILTER filterDate BY $15 eq 'Sony'  OR $15 eq 'Dell'  OR $15
> eq 'HP' ;
> filter1 = FILTER filterCh BY  ($5 == 11 AND $6 == 15 AND $7 == 406 ) ;
> filtered = FOREACH filter1 GENERATE
> LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/error.*)','[0-9]{2}:[0-9]{2}
> :[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;
> grouped = GROUP filtered BY ($0, $1);
> resultUnordered = FOREACH grouped GENERATE FLATTEN(group),
> FLATTEN(COUNT(filtered)) PARALLEL 14;
>
> The func is LastPageExtractor(inputValue, excludeRegex, splitRegex)
>
> This all works fine, but I would like to change my split regex to
> \\|+[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4} , however when
> I do that I get this :
>
> Exception in thread "Thread-6"
> org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at
> line 1, column 93.  Encountered: "|" (124), after : "\'\\"
>
> Is there some special escape sequence I should know about? I searched
> escape in PigLatin Wiki and found nothing.
>
> The second problem I have is I am not able to register jars/funcs
> without packaging them into the pig.jar in the
> org.apache.pig.impl.builtin package. I have tried everything I can think
> of and everything in the documentation. I register the jar with
> PigServer.registerJar and try to use the fully qualified function name
> all the task trackers fail with:
>
> java.lang.RuntimeException: could not instantiate
> 'telespree.analytics.pig.LastPageExtractor' with arguments '[]'
>
> I do:
>
> server.registerJar("c:\\telespree.jar");
>
> and
>
> filtered = FOREACH filter1 GENERATE
> telespree.analytics.pig.LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/erro
> r.*)','[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;");
>
> I even tried to put these functions in the default package in pig.jar
> since I saw in the code you do lookups with 
>         packageImportList.add("");
>         packageImportList.add("org.apache.pig.builtin.");
>         packageImportList.add("com.yahoo.pig.yst.sds.ULT.");
>         packageImportList.add("org.apache.pig.impl.builtin.");     
>
> So I figured using the "" import would find my function, however alas I
> get the same error : 
> java.lang.RuntimeException: could not instantiate 'LastPageExtractor'
> with arguments '[]'
>
> However if I package them in org.apache.pig.impl.builtin it all works
> fine.
>
> Any help on these 3 areas would be much appreciated!
>
> -Michael
>
>
>
>
> -----Original Message-----
> From: Michael Harris [mailto:MichaelH@Telespree.com] 
> Sent: Wednesday, April 02, 2008 10:47 AM
> To: pig-user@incubator.apache.org
> Subject: MapReduceLauncher static fields
>
> Hello,
>
>  
>
> I have written a pig application that does a fixed set of queries
> on-demand through a web interface. I am trying to get the progress of
> the queries from the PigServer, but I have noticed that the source of
> the progress data is all static fields in the MapReduceLauncher. Clearly
> my webapp must be able to handle multiple concurrent pig queries (and be
> thread-safe) and I would like to report the progress of each individual
> query (job set) to the end user.  Do these static fields indicate that I
> would get the progress of multiple concurrent queries initiated by
> different PigServer instances? or would I get the overall progress of
> the MapReduceLauncher for all queries currently being executed?
>
>  
>
> Thanks,
> Michael
>
>