You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Carlos Balduz Bernal <cb...@gmail.com> on 2015/04/09 11:47:17 UTC

REGEX_EXTRACT unclear documentation

Hello everyone,

I am trying to use the REGEX_EXTRACT UDF to get all the correct emails in
an alias, and according to the official documentation (in
http://pig.apache.org/docs/r0.14.0/func.html#regex-extract), the third
parameter must be the position, which is a 1-based parameter.

However, putting 1 returns a null even if the email matches the regex,
because the condition in the UDF is:

if (!mUseMatches&&m.find()||mUseMatches&&m.matches())
{
    if (m.groupCount()>=mIndex)
    {
        return m.group(mIndex);
    }
}

Since in this case, when the email matches the regex, it matches the
whole pattern, Matcher's groupCount returns 0 by convention:
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#groupCount%28%29

"Group zero denotes the entire pattern by convention. It is not
included in this count."

After putting a 0 in the last parameter instead of 1, the problem was
solved... But in the documentation it asks for a 1-based parameter,
which is not 100% accurate. Or perhaps there is a much simpler way to
discard elements which do not match a regex, in which case I would
appreciate it if someone could tell me. I know I could use
REGEX_EXTRACT_ALL, but I do not want a tuple.


Thanks,

Carlos Balduz.