You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Jernej Tuljak <je...@gmail.com> on 2013/08/14 09:41:17 UTC

RegularExpression 'X' option oddity

Hi,

we're abusing org.apache.xerces.impl.xpath.regex.RegularExpression to
validate XSD flavor regular expression strings and later matching test
strings against them. It seemingly worked, until someone tried to use a
very specific regex.

Here's the code:

    import org.apache.xerces.impl.xpath.regex.RegularExpression;

    public class XercesRegexTest {

        public static void main(String[] args) {
            String regexString = "([a-zA-Z][^ ]*)";
            RegularExpression regex = new RegularExpression(regexString,
"x");
            System.out.println(regex.toString());
        }

    }

The `x` option is supposed to make the regex engine conform to XSD regular
expressions. But if you run this code, you'll end up with

    Exception in thread "main"
org.apache.xerces.impl.xpath.regex.ParseException: Unexpected end of the
pattern in a character class.
        at org.apache.xerces.impl.xpath.regex.RegexParser.ex(Unknown Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.parseCharacterClass(Unknown
Source)
        at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown Source)
        at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex(Unknown Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.processParen(Unknown Source)
        at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown Source)
        at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex(Unknown Source)
        at org.apache.xerces.impl.xpath.regex.RegexParser.parse(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern(Unknown
Source)
        at
org.apache.xerces.impl.xpath.regex.RegularExpression.<init>(Unknown Source)
        at
com.mgsoft.testing.regex.XercesRegexTest.main(XercesRegexTest.java:9)
    Java Result: 1

It first looked like a bug in Xerces' regular expression parser, but after
re-reading the documentation (
http://xerces.apache.org/xerces-j/apiDocs/org/apache/xerces/utils/regex/RegularExpression.html)
of this class, I found out that the `x` option should actually be `X`
(upper case). Thing is...it worked for countless other regular expressions.
In fact it is that space that is causing problems, any other char works
fine. Also removing the option and using the single string constructor of
`RegularExpression` works fine.

Does anyone know why this is happening? I realize that this class is
probably not intended for such usage, but since the spec we're implementing
uses XSD regular expressions, we tried to avoid reinventing the wheel
though re-usage.

We are using xercesImpl.jar that is distributed with xalan-j 2.7.1.

Re: RegularExpression 'X' option oddity

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Jernej,

Jernej Tuljak <je...@gmail.com> wrote on 08/16/2013 03:48:45 AM:

> We did not attempt to use a newer version of Xerces-j before, since 
> we did not want to break the Xalan-j distribution we are using. Not 
> sure if it's license even supports doing that and having two JARs of
> the same thing on classpath didn't seem like a good idea.

Many users upgrade Xerces independently of Xalan. It's backwards 
compatible.

Just remove the old xercesImpl.jar and replace it with the new one. You 
don't need two jars on your classpath. Nothing in the license prevents you 
from doing that.

> Anyways, we can use RegularExpression in a way that suits our use 
> case now. I only wanted to point out that it started behaving in an 
> undefined way when an undefined option was used. I would have 
> expected an unchecked exception being thrown or some sort of other 
> kind of a warning for such cases even if this class is not supposed 
> to be a part of public API.

I would never assume that for an internal class. I understand why people 
sometimes use them but they should have no expectations since they weren't 
intended for their use.

We try to warn people in the Javadoc. You'll see this on most classes in 
the 'impl' package:

"INTERNAL: Usage of this class is not supported. It may be altered or 
removed at any time."

If you must use internal classes, use them with caution. We guarantee 
nothing.

> Since (if I understood correctly) the example I posted works fine 
> for you, this issue might have been mended in the past, so this 
> thread became irrelevant as soon as you mentioned that.. :P

> Thanks, Jernej
> 

> 2013/8/14 Michael Glavassevich <mr...@ca.ibm.com>
> Hi Jernej,
> 
> Jernej Tuljak <je...@gmail.com> wrote on 08/14/2013 03:41:17 AM:
> 
> > Hi,
> >
> > we're abusing org.apache.xerces.impl.xpath.regex.RegularExpression

> Yep. :-)
> 
> > to validate XSD flavor regular expression strings and later matching
> > test strings against them. It seemingly worked, until someone tried
> > to use a very specific regex.
> >
> > Here's the code:
> >
> >     import org.apache.xerces.impl.xpath.regex.RegularExpression;
> >
> >     public class XercesRegexTest {
> >
> >         public static void main(String[] args) {
> >             String regexString = "([a-zA-Z][^ ]*)";
> >             RegularExpression regex = new 
RegularExpression(regexString,
> "x");
> >             System.out.println(regex.toString());
> >         }
> >
> >     }
> >
> > The `x` option is supposed to make the regex engine conform to XSD
> > regular expressions.

> Only 'X' does that. That is the only option which Xerces uses 
internally.
> 
> > But if you run this code, you'll end up with
> >
> >     Exception in thread "main"
> > org.apache.xerces.impl.xpath.regex.ParseException: Unexpected end of
> > the pattern in a character class.
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.ex(Unknown
> Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegexParser.parseCharacterClass
> > (Unknown Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown
> Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm
> > (Unknown Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegexParser.processParen(Unknown
> Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown
> Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm
> > (Unknown Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex
> > (Unknown Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parse
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegularExpression.<init>(Unknown
> Source)
> >         at com.mgsoft.testing.regex.XercesRegexTest.main
> > (XercesRegexTest.java:9)
> >     Java Result: 1
> >
> > It first looked like a bug in Xerces' regular expression parser, but
> > after re-reading the documentation (http://xerces.apache.org/xerces-
> > j/apiDocs/org/apache/xerces/utils/regex/RegularExpression.html) of
> > this class, I found out that the `x` option should actually be `X`
> > (upper case).

> The docs for that class probably haven't changed much over the years but
> worth pointing out that that's the Xerces-J 1.x documentation not 
Xerces-J
> 2.x.
> 
> > Thing is...it worked for countless other regular
> > expressions. In fact it is that space that is causing problems, any
> > other char works fine. Also removing the option and using the single
> > string constructor of `RegularExpression` works fine.

> If you're not specifying 'X' then you're using a mode that isn't XSD and
> that we never use.
> 
> > Does anyone know why this is happening? I realize that this class is
> > probably not intended for such usage, but since the spec we're
> > implementing uses XSD regular expressions, we tried to avoid
> > reinventing the wheel though re-usage.

> Works for me with the current code in SVN.
> 
> > We are using xercesImpl.jar that is distributed with xalan-j 2.7.1.

> Whatever you got out of Xalan-J 2.7.1 would be very old now. Have you
> tried Xerces-J 2.11.0?
> 
> Thanks.
> 
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: RegularExpression 'X' option oddity

Posted by Jernej Tuljak <je...@gmail.com>.
We did not attempt to use a newer version of Xerces-j before, since we did
not want to break the Xalan-j distribution we are using. Not sure if it's
license even supports doing that and having two JARs of the same thing on
classpath didn't seem like a good idea.

Anyways, we can use RegularExpression in a way that suits our use case now.
I only wanted to point out that it started behaving in an undefined way
when an undefined option was used. I would have expected an unchecked
exception being thrown or some sort of other kind of a warning for such
cases even if this class is not supposed to be a part of public API.

Since (if I understood correctly) the example I posted works fine for you,
this issue might have been mended in the past, so this thread became
irrelevant as soon as you mentioned that.. :P

Thanks, Jernej


2013/8/14 Michael Glavassevich <mr...@ca.ibm.com>

> Hi Jernej,
>
> Jernej Tuljak <je...@gmail.com> wrote on 08/14/2013 03:41:17 AM:
>
> > Hi,
> >
> > we're abusing org.apache.xerces.impl.xpath.regex.RegularExpression
>
> Yep. :-)
>
> > to validate XSD flavor regular expression strings and later matching
> > test strings against them. It seemingly worked, until someone tried
> > to use a very specific regex.
> >
> > Here's the code:
> >
> >     import org.apache.xerces.impl.xpath.regex.RegularExpression;
> >
> >     public class XercesRegexTest {
> >
> >         public static void main(String[] args) {
> >             String regexString = "([a-zA-Z][^ ]*)";
> >             RegularExpression regex = new RegularExpression(regexString,
> "x");
> >             System.out.println(regex.toString());
> >         }
> >
> >     }
> >
> > The `x` option is supposed to make the regex engine conform to XSD
> > regular expressions.
>
> Only 'X' does that. That is the only option which Xerces uses internally.
>
> > But if you run this code, you'll end up with
> >
> >     Exception in thread "main"
> > org.apache.xerces.impl.xpath.regex.ParseException: Unexpected end of
> > the pattern in a character class.
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.ex(Unknown
> Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegexParser.parseCharacterClass
> > (Unknown Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown
> Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm
> > (Unknown Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegexParser.processParen(Unknown
> Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown
> Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm
> > (Unknown Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex
> > (Unknown Source)
> >         at org.apache.xerces.impl.xpath.regex.RegexParser.parse
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern
> > (Unknown Source)
> >         at
> > org.apache.xerces.impl.xpath.regex.RegularExpression.<init>(Unknown
> Source)
> >         at com.mgsoft.testing.regex.XercesRegexTest.main
> > (XercesRegexTest.java:9)
> >     Java Result: 1
> >
> > It first looked like a bug in Xerces' regular expression parser, but
> > after re-reading the documentation (http://xerces.apache.org/xerces-
> > j/apiDocs/org/apache/xerces/utils/regex/RegularExpression.html) of
> > this class, I found out that the `x` option should actually be `X`
> > (upper case).
>
> The docs for that class probably haven't changed much over the years but
> worth pointing out that that's the Xerces-J 1.x documentation not Xerces-J
> 2.x.
>
> > Thing is...it worked for countless other regular
> > expressions. In fact it is that space that is causing problems, any
> > other char works fine. Also removing the option and using the single
> > string constructor of `RegularExpression` works fine.
>
> If you're not specifying 'X' then you're using a mode that isn't XSD and
> that we never use.
>
> > Does anyone know why this is happening? I realize that this class is
> > probably not intended for such usage, but since the spec we're
> > implementing uses XSD regular expressions, we tried to avoid
> > reinventing the wheel though re-usage.
>
> Works for me with the current code in SVN.
>
> > We are using xercesImpl.jar that is distributed with xalan-j 2.7.1.
>
> Whatever you got out of Xalan-J 2.7.1 would be very old now. Have you
> tried Xerces-J 2.11.0?
>
> Thanks.
>
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org
>
>

Re: RegularExpression 'X' option oddity

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Jernej,

Jernej Tuljak <je...@gmail.com> wrote on 08/14/2013 03:41:17 AM:

> Hi,
> 
> we're abusing org.apache.xerces.impl.xpath.regex.RegularExpression 

Yep. :-)

> to validate XSD flavor regular expression strings and later matching
> test strings against them. It seemingly worked, until someone tried 
> to use a very specific regex.
> 
> Here's the code:
> 
>     import org.apache.xerces.impl.xpath.regex.RegularExpression;
> 
>     public class XercesRegexTest {
>         
>         public static void main(String[] args) {
>             String regexString = "([a-zA-Z][^ ]*)";
>             RegularExpression regex = new RegularExpression(regexString, 
"x");
>             System.out.println(regex.toString());
>         }
>         
>     }
> 
> The `x` option is supposed to make the regex engine conform to XSD 
> regular expressions.

Only 'X' does that. That is the only option which Xerces uses internally.

> But if you run this code, you'll end up with 
> 
>     Exception in thread "main" 
> org.apache.xerces.impl.xpath.regex.ParseException: Unexpected end of
> the pattern in a character class.
>         at org.apache.xerces.impl.xpath.regex.RegexParser.ex(Unknown 
Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegexParser.parseCharacterClass
> (Unknown Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown 
Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm
> (Unknown Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegexParser.processParen(Unknown 
Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown 
Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm
> (Unknown Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex
> (Unknown Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parse
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegularExpression.<init>(Unknown 
Source)
>         at com.mgsoft.testing.regex.XercesRegexTest.main
> (XercesRegexTest.java:9)
>     Java Result: 1
> 
> It first looked like a bug in Xerces' regular expression parser, but
> after re-reading the documentation (http://xerces.apache.org/xerces-
> j/apiDocs/org/apache/xerces/utils/regex/RegularExpression.html) of 
> this class, I found out that the `x` option should actually be `X` 
> (upper case).

The docs for that class probably haven't changed much over the years but 
worth pointing out that that's the Xerces-J 1.x documentation not Xerces-J 
2.x.

> Thing is...it worked for countless other regular 
> expressions. In fact it is that space that is causing problems, any 
> other char works fine. Also removing the option and using the single
> string constructor of `RegularExpression` works fine.

If you're not specifying 'X' then you're using a mode that isn't XSD and 
that we never use.

> Does anyone know why this is happening? I realize that this class is
> probably not intended for such usage, but since the spec we're 
> implementing uses XSD regular expressions, we tried to avoid 
> reinventing the wheel though re-usage.

Works for me with the current code in SVN.

> We are using xercesImpl.jar that is distributed with xalan-j 2.7.1.

Whatever you got out of Xalan-J 2.7.1 would be very old now. Have you 
tried Xerces-J 2.11.0?

Thanks.

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org