You are viewing a plain text version of this content. The canonical link for it is here.
Posted to regexp-user@jakarta.apache.org by Daniel Serodio <da...@ibnetwork.com.br> on 2001/03/30 21:37:19 UTC

POSIX character classes

	Hi! I've been using the Regexp package, and while it great, I think it's syntax
for character classes is non-standard. I've never seen the POSIX specification,
but most tools implement it differently. Here's the issue:

	In POSIX, the syntax for a 'named' character class is, e.g. '[:alpha:]', but
only if it comes inside brackets (that is, inside a character class). Thus, to
match a single alphabetic char, you use '[[:alpha:]]'. To illustrate, this piece
is from GNU grep(1) man page: 

"(Note that the  brackets  in  these  class
names are part of the symbolic names, and must be included
in addition to the brackets delimiting the bracket list.)"

	In Regexp, the syntax '[[:alpha:]]' is not accepted:

org.apache.regexp.RESyntaxException
Syntax error: Mismatched class

	I must use '[:alpha:]' instead. This way, if I want to match an alphabetic char
or a space, I must use '([:alpha:]| )' instead of the usual '[[:alpha:] ]'
(which is supposed to be faster, BTW).

	Can someone explain this strange behaviour? Bug?

Thanks,
Daniel Serodio

Re: POSIX character classes

Posted by Michael McCallum <gh...@xtra.co.nz>.
On Friday 30 March 2001 19:37, Daniel Serodio wrote:
} Hi! I've been using the Regexp package, and while it great, I think it's
 syntax } for character classes is non-standard. I've never seen the POSIX
 specification, } but most tools implement it differently. Here's the issue:
}
} 	In POSIX, the syntax for a 'named' character class is, e.g. '[:alpha:]',
 but } only if it comes inside brackets (that is, inside a character class).
 Thus, to } match a single alphabetic char, you use '[[:alpha:]]'. To
 illustrate, this piece } is from GNU grep(1) man page:
}
} "(Note that the  brackets  in  these  class
} names are part of the symbolic names, and must be included
} in addition to the brackets delimiting the bracket list.)"
}
} 	In Regexp, the syntax '[[:alpha:]]' is not accepted:
}
} org.apache.regexp.RESyntaxException
} Syntax error: Mismatched class
}
} 	I must use '[:alpha:]' instead. This way, if I want to match an alphabetic
 char } or a space, I must use '([:alpha:]| )' instead of the usual
 '[[:alpha:] ]' } (which is supposed to be faster, BTW).
}
} 	Can someone explain this strange behaviour? Bug?
}
} Thanks,
} Daniel Serodio
Unfortunately this is a feature that has not been implemented yet.
The POSIX character class parsing will does not check for nested classes :(

Michael