You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dylan Sather <dy...@gmail.com> on 2013/04/16 07:03:24 UTC

Pig on EMR: how to include semicolon in regex argument of EXTRACT function

Hi y'all,

First time on this list, and hoping you might be able to help me with a
(possible) issue.

I'm working with some data in Pig that includes strings of interest,
optionally separated by semicolons and in random order, e.g.

    test=12345;foo=bar
    test=12345
    foo=bar;test=12345

The following code should extract the value of the string for the test
'key':

    blah =
      FOREACH
        data
      GENERATE
        FLATTEN (
          EXTRACT (
            str_of_interest,
            'test=(\\S+);?'
          )
        )
        AS (
          test: chararray
        )
      ;

However, when running the code, I encounter the following error:

    <line 46, column 0>  mismatched character '<EOF>' expecting '''
    2013-04-16 04:46:05,245 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: <line 46, column 0>  mismatched character '<EOF>' expecting '''

I thought I had my regex escape syntax off at first, but that doesn't
appear to be the problem. The only information I get from a Google search
is a bug report (https://issues.apache.org/jira/browse/PIG-2507) that
appears to have been recently fixed, but it's still an issue on the Amazon
EMR cluster I'm running (spun up ad hoc, just now, for this analysis).

As in the bug report and as suggested elsewhere, replacing the semicolon
with its Unicode equivalent (\u003B) yields the same error.

I could be crazy and this could be a syntax issue, so I'm hoping someone
might be able to point me in the right direction or confirm that this is an
existing problem. If the latter, are there any workarounds (either in Pig,
or for matching the string I want)?

Cheers.
Dylan

Re: Pig on EMR: how to include semicolon in regex argument of EXTRACT function

Posted by Dylan Sather <dy...@gmail.com>.
Thanks, Marcos! I ended up going with a UDF and it's working great.


On Tue, Apr 16, 2013 at 4:06 AM, MARCOS MEDRADO RUBINELLI <
marcosm@buscapecompany.com> wrote:

> Dylan,
>
> It seems my first message fell through a crack, so I apologize if you
> receive it twice, but: yes it is a known issu, and there isn't a stable
> version with the fix yet. I see two ways to work around it:
>
> 1. write a UDF that encapsulates the regex
>
> 2. load the regex from a file
>
> I actually tested number 2. I ran it on 0.10.0, but it should work on a
> recent version of EMR too:
>
> $ echo "test=(\\S+);?" > testregex.txt
> $ hadoop fs -put testregex.txt /tmp
>
> B = LOAD '/tmp/testregex.txt' as (regex :chararray);
>
> blah =
>        FOREACH
>          data
>        GENERATE
>          FLATTEN (
>            REGEX_EXTRACT (
>              str_of_interest, B.regex, 1
>            )
>          )
>          AS (
>            test: chararray
>          )
>        ;
>
> Cheers,
> Marcos
>
> On 16-04-2013 02:03, Dylan Sather wrote:
> > Hi y'all,
> >
> > First time on this list, and hoping you might be able to help me with a
> > (possible) issue.
> >
> > I'm working with some data in Pig that includes strings of interest,
> > optionally separated by semicolons and in random order, e.g.
> >
> >      test=12345;foo=bar
> >      test=12345
> >      foo=bar;test=12345
> >
> > The following code should extract the value of the string for the test
> > 'key':
> >
> >      blah =
> >        FOREACH
> >          data
> >        GENERATE
> >          FLATTEN (
> >            EXTRACT (
> >              str_of_interest,
> >              'test=(\\S+);?'
> >            )
> >          )
> >          AS (
> >            test: chararray
> >          )
> >        ;
> >
> > However, when running the code, I encounter the following error:
> >
> >      <line 46, column 0>  mismatched character '<EOF>' expecting '''
> >      2013-04-16 04:46:05,245 [main] ERROR
> org.apache.pig.tools.grunt.Grunt -
> > ERROR 1200: <line 46, column 0>  mismatched character '<EOF>' expecting
> '''
> >
> > I thought I had my regex escape syntax off at first, but that doesn't
> > appear to be the problem. The only information I get from a Google search
> > is a bug report (https://issues.apache.org/jira/browse/PIG-2507) that
> > appears to have been recently fixed, but it's still an issue on the
> Amazon
> > EMR cluster I'm running (spun up ad hoc, just now, for this analysis).
> >
> > As in the bug report and as suggested elsewhere, replacing the semicolon
> > with its Unicode equivalent (\u003B) yields the same error.
> >
> > I could be crazy and this could be a syntax issue, so I'm hoping someone
> > might be able to point me in the right direction or confirm that this is
> an
> > existing problem. If the latter, are there any workarounds (either in
> Pig,
> > or for matching the string I want)?
> >
> > Cheers.
> > Dylan
> >
>

Re: Pig on EMR: how to include semicolon in regex argument of EXTRACT function

Posted by MARCOS MEDRADO RUBINELLI <ma...@buscapecompany.com>.
Dylan,

It seems my first message fell through a crack, so I apologize if you 
receive it twice, but: yes it is a known issu, and there isn't a stable 
version with the fix yet. I see two ways to work around it:

1. write a UDF that encapsulates the regex

2. load the regex from a file

I actually tested number 2. I ran it on 0.10.0, but it should work on a 
recent version of EMR too:

$ echo "test=(\\S+);?" > testregex.txt
$ hadoop fs -put testregex.txt /tmp

B = LOAD '/tmp/testregex.txt' as (regex :chararray);

blah =
       FOREACH
         data
       GENERATE
         FLATTEN (
           REGEX_EXTRACT (
             str_of_interest, B.regex, 1
           )
         )
         AS (
           test: chararray
         )
       ;

Cheers,
Marcos

On 16-04-2013 02:03, Dylan Sather wrote:
> Hi y'all,
>
> First time on this list, and hoping you might be able to help me with a
> (possible) issue.
>
> I'm working with some data in Pig that includes strings of interest,
> optionally separated by semicolons and in random order, e.g.
>
>      test=12345;foo=bar
>      test=12345
>      foo=bar;test=12345
>
> The following code should extract the value of the string for the test
> 'key':
>
>      blah =
>        FOREACH
>          data
>        GENERATE
>          FLATTEN (
>            EXTRACT (
>              str_of_interest,
>              'test=(\\S+);?'
>            )
>          )
>          AS (
>            test: chararray
>          )
>        ;
>
> However, when running the code, I encounter the following error:
>
>      <line 46, column 0>  mismatched character '<EOF>' expecting '''
>      2013-04-16 04:46:05,245 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: <line 46, column 0>  mismatched character '<EOF>' expecting '''
>
> I thought I had my regex escape syntax off at first, but that doesn't
> appear to be the problem. The only information I get from a Google search
> is a bug report (https://issues.apache.org/jira/browse/PIG-2507) that
> appears to have been recently fixed, but it's still an issue on the Amazon
> EMR cluster I'm running (spun up ad hoc, just now, for this analysis).
>
> As in the bug report and as suggested elsewhere, replacing the semicolon
> with its Unicode equivalent (\u003B) yields the same error.
>
> I could be crazy and this could be a syntax issue, so I'm hoping someone
> might be able to point me in the right direction or confirm that this is an
> existing problem. If the latter, are there any workarounds (either in Pig,
> or for matching the string I want)?
>
> Cheers.
> Dylan
>