You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by "Daniel F. Savarese" <df...@savarese.org> on 2001/07/20 14:45:26 UTC

Re: Perl5 vs Awk: what's the problem in this regex?

In message <80...@smtp_gateway.bankofscotland.co.uk>, peter_shil
lan@firstbs.com writes:
...
>((\w+\.\w+|\w+)+)\@((\w+\.\w+|\w+)+)
>
>I tried this with Perl5 matching on the e-mail address:
>
>whats.going.on@someplace123.co.uk
>
>and it failed. When I tried Awk matching however, it succeeds.

You can see what's going on by running it through the demo applet.  Perl
expressions match the longest leftmost expression:

Match 1: whats.going.on@someplace123.co
    Subgroups:
    1: whats.going.on
    2: g.on
    3: someplace123.co
    4: someplace123.co

Because of the greedy leftmost matching, you won't force the backtracking
necessary to to match the @foo.bar.com part of your input.  If you add
a $ at the end of the expression and try the match, you'll find the
backtracking is forced and the expression matches.  The basic problem
with your expression (as far as Perl matching goes) is that you're trying
to match a.b.c with a.b and b.c.  For example, if you add the $ at the end
of your expression but then try to match whats.going.on@1.2.3 you won't
get a match.  I'm not expressing myself very well, but notice how the
second group matches g.on.  That's what I'm talking about.  For the subset
of email addresses that you're trying to match, a better expression would be:

\w+(\.\w+)*\@\w+(\.\w+)*

AWK doesn't have the same issues because AWK expressions match the
longest match, period.  As far as the implementation goes, the AWK
expressions are reduced to a DFA, which means no backtracking issues.
That doesn't mean you can't implement AWK expressions with an NFA.
At any rate, the rules for matching AWK expressions and Perl expressions
are different, which is why there's a discrepancy.

daniel