You are viewing a plain text version of this content. The canonical link for it is here.
Posted to regexp-user@jakarta.apache.org by Daniel Serodio <da...@ibnetwork.com.br> on 2001/03/30 21:37:19 UTC
POSIX character classes
Hi! I've been using the Regexp package, and while it great, I think it's syntax
for character classes is non-standard. I've never seen the POSIX specification,
but most tools implement it differently. Here's the issue:
In POSIX, the syntax for a 'named' character class is, e.g. '[:alpha:]', but
only if it comes inside brackets (that is, inside a character class). Thus, to
match a single alphabetic char, you use '[[:alpha:]]'. To illustrate, this piece
is from GNU grep(1) man page:
"(Note that the brackets in these class
names are part of the symbolic names, and must be included
in addition to the brackets delimiting the bracket list.)"
In Regexp, the syntax '[[:alpha:]]' is not accepted:
org.apache.regexp.RESyntaxException
Syntax error: Mismatched class
I must use '[:alpha:]' instead. This way, if I want to match an alphabetic char
or a space, I must use '([:alpha:]| )' instead of the usual '[[:alpha:] ]'
(which is supposed to be faster, BTW).
Can someone explain this strange behaviour? Bug?
Thanks,
Daniel Serodio
Re: POSIX character classes
Posted by Michael McCallum <gh...@xtra.co.nz>.
On Friday 30 March 2001 19:37, Daniel Serodio wrote:
} Hi! I've been using the Regexp package, and while it great, I think it's
syntax } for character classes is non-standard. I've never seen the POSIX
specification, } but most tools implement it differently. Here's the issue:
}
} In POSIX, the syntax for a 'named' character class is, e.g. '[:alpha:]',
but } only if it comes inside brackets (that is, inside a character class).
Thus, to } match a single alphabetic char, you use '[[:alpha:]]'. To
illustrate, this piece } is from GNU grep(1) man page:
}
} "(Note that the brackets in these class
} names are part of the symbolic names, and must be included
} in addition to the brackets delimiting the bracket list.)"
}
} In Regexp, the syntax '[[:alpha:]]' is not accepted:
}
} org.apache.regexp.RESyntaxException
} Syntax error: Mismatched class
}
} I must use '[:alpha:]' instead. This way, if I want to match an alphabetic
char } or a space, I must use '([:alpha:]| )' instead of the usual
'[[:alpha:] ]' } (which is supposed to be faster, BTW).
}
} Can someone explain this strange behaviour? Bug?
}
} Thanks,
} Daniel Serodio
Unfortunately this is a feature that has not been implemented yet.
The POSIX character class parsing will does not check for nested classes :(
Michael