You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Aleksander Slominski <as...@cs.indiana.edu> on 2002/07/22 05:37:45 UTC
xmlpull api [Re: [Announce] The CyberNeko Tools for XNI 2002.07.17 Available]

Andy Clark wrote:

> Elliotte Rusty Harold wrote:
> >> the only thing that could possibly make XMLPULL API not
> >> 100% compatible with XML 1.0 is when PROCESS DOCDECL feature
>  >> [...]
> >
> > That's a very big one. A parser should not be allowed to turn off
> > processing of the internal DTD subset at all. And to make not processing
> > it the default?! That's just wrong.
>
> Well, you gotta look at the intended purpose of these types
> of parsers. If I remember correctly, the XPP work was started
> because of SOAP which subsets XML syntax and doesn't allow a
> DOCTYPE line at all.

that means that those implementations were designed to concentrate
on size (like kXML2) or speed (like MXP1) but there can be many
other implementations ...

> > Worse yet, according to http://www.xmlpull.org/impls.shtml neither of
> > the existing implementations even allows you to set that feature to true.
>
> I think Alexsander has code to use Xerces2 as the driver
> for the push API. So, if used that way then it should be
> able to check the DTD just like Xerces. And when I finish
> my API for the CyberNeko tools, the default impl will be
> driven by Xerces so it should have no problem in that
> regard.

exactly - one thing is API and completely another is implementation.
as long as each implementation is correctly described users can make
informed choices.

> > I've also heard it claimed recently that the parsers aren't doing all
> > the name character checking they're supposed to, though I haven't
>
> No wonder they're so fast. ;) This is one of the big
> checks that implementors would love to remove from their
> inner loop. Xerces, being fully compliant, can't do that
> and suffers some performance hits.

in MXP1 i use lookup table for char values below
and if statement for the rest. i am putting
relevant part of code from MXP1 below and
welcome comments about it (especially if you find anything
wrong with the functions!).

> Just about any XML parser can be written to go fast if
> they don't do all of the work. For example, removing
> character checking, avoiding DTD parsing and processing,
> not implementing XML Schema, etc. But, depending on the
> situation, these are all perfectly acceptable choices.

well - i think that MXP1 do all XML parsing and i am slowly
improving it to the level of non validating parsing - only
remaining incompatibilities i know about is DTD parsing and
add XML 1.0 character set support (i am a bit hesitant about
it as i like XML 1.1 much more ...).

thanks,

alek

ps. here is fragment of MXParser - please comment if you think
that i am missing something important when looking on what is
required in http://www.w3.org/TR/xml11/#sec2.3 (thanks in advance!)

    protected static final int LOOKUP_MAX = 0x400;
    protected static final char LOOKUP_MAX_CHAR = (char)LOOKUP_MAX;
    protected static boolean lookupNameStartChar[] = new boolean[ LOOKUP_MAX ];
    protected static boolean lookupNameChar[] = new boolean[ LOOKUP_MAX ];

    private static final void setName(char ch)
    { lookupNameChar[ ch ] = true; }
    private static final void setNameStart(char ch)
    { lookupNameStartChar[ ch ] = true; setName(ch); }

    static {
        setNameStart(':');
        for (char ch = 'A'; ch <= 'Z'; ++ch) setNameStart(ch);
        setNameStart('_');
        for (char ch = 'a'; ch <= 'z'; ++ch) setNameStart(ch);
        for (char ch = '\u00c0'; ch <= '\u02FF'; ++ch) setNameStart(ch);
        for (char ch = '\u0370'; ch <= '\u037d'; ++ch) setNameStart(ch);
        for (char ch = '\u037f'; ch < '\u0400'; ++ch) setNameStart(ch);

        setName('-');
        setName('.');
        for (char ch = '0'; ch <= '9'; ++ch) setName(ch);
        setName('\u00b7');
        for (char ch = '\u0300'; ch <= '\u036f'; ++ch) setName(ch);
    }

    private final static boolean isNameStartChar(char ch) {
        return (ch < LOOKUP_MAX_CHAR && lookupNameStartChar[ ch ])
            || (ch >= LOOKUP_MAX_CHAR && ch <= '\u2027')
            || (ch >= '\u202A' &&  ch <= '\u218F')
            || (ch >= '\u2800' &&  ch <= '\uFFEF')
            ;
    }

    private final static boolean isNameChar(char ch) {
        return (ch < LOOKUP_MAX_CHAR && lookupNameChar[ ch ])
            || (ch >= LOOKUP_MAX_CHAR && ch <= '\u2027')
            || (ch >= '\u202A' &&  ch <= '\u218F')
            || (ch >= '\u2800' &&  ch <= '\uFFEF')
            ;
    }

    protected boolean isS(char ch) {
        return (ch == ' ' || ch == '\n' || ch == '\r' || ch == '\t');
    }



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org