You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by ne...@ca.ibm.com on 2002/01/08 21:25:35 UTC

[design]: grammar caching

Hi folks,

Since Andy quite rightly points out the need to make XNI as stable as
possible before we roll out a production version of Xerces2, I thought
it might be a good time to bring up once again an issue that took a
lot of bandwidth in the summer but was never resolved--what should
grammar caching in Xerces2 look like?

Right now, we've got the rather ironic situation where you can do a
sort of grammar caching with Xerces2, but you've got to use DOM l3's
AS interfaces to do it.  The irony stems from the fact that if you
want performance badly enough to need grammar caching, you'd probably
rather use an event-based API for parsing rather than the DOM...

So, here goes.  I should note that this is in large measure based on
what Sandy proposed way back when (hope you don't mind too much being
associated with this post Sandy!  :-))

I should first note that this is a pretty heavy proposal.  It's
designed with a view to supporting the needs of the most advanced
applications that require the most flexibility.  I think this makes
sense because it's going to be advanced applications that need
Grammar caching to begin with, and in my view crafting a default
implementation for this proposal should be only somewhat more
difficult than developing one for a more scaled-down version.

Since grammar caches (or pools) outlive parser instances, the grammar
pool must be owned by the application.  It's for this reason that
retaining a GrammarResolver (or a GrammarBucket perhaps?) seems to
make sense--a GrammarResolver is owned by a
particular parser instance, and contains whatever subset of the
grammars in the grammar pool that the application, for whatever
reasons of its own, wishes to make available.  This kind of
architecture also allows the application to manage difficult
situations such as if the parser encounters an imported schema that
does not correspond to a schema with the same namespace that the
application already knows about.  Other sorts of use-cases might be a
webserver which deals with different kinds of XML documents, and, once
the type of a document is identified, wishes to enforce validation of
that document with only a subset of the grammars that it has in its
GrammarPool.  With the rise of ever more complex and integrated XML
and web services technologies, it seems to me that this kind of
functionality will have to be made available somwhwhere-and the
parser-level looks to me to be the appropriate place.  After all,
isn't Xerces2 about flexibility, modularity, and giving tools beyond
what are in standard API's to applications that need them?

Conceptually, the flow would look something like this:  An application
instantiates a certain parser Configuration, and associates its
GrammmarPool implemntation as a property of thatConfiguration.  When a
Validator object in the configuration begins to validate, it requests
that a bucket of grammars of the appropriate kind be filled by the
GrammarPool.  As it parses, the Validator takes grammars from the
bucket if it can, then gives the GrammarPool a chance to prvide the
grammar if it wishes, then the XMLEntityResolver gets a chance to
resolve the request to a file of the appropriate type.  At both these
stages, as much information is provided to the GrammarPool and the
XMLEntityResolver as is likely to prove at all helpful in identifying
the appropriate resource.  At the conclusion of parsing, the validator
will make available the contents of its bucket to the GrammarPool,
which can then determine whether to incorporate the grammars or ignore
them.  I tend to be of the view that parsing should be aborted with a
fatalError if something goes wrong in the grammar retrieval
process--e.g., if the GrammarPool gives a Validator a grammar of the
wrong type--but this is certainly a thorny and multifaceted question.

So, in addition to the interfaces given below, I'd propose to rename
XSGrammarResolver to XSGrammarBucket (sorry about the cheesy name...
:-)).  The same fate should probably befall the GrammarPool class that
XMLDTDValidator makes reference to.

public interface XMLGrammarPool {

    // we are trying to make this XMLGrammarPool work for all kinds of
    // grammars, so we have a parameter "grammarType" for each of the
methods.
    // It could be "schema", "dtd, etc., or it could be recast into an
    // integer.

    // retrieve the initial known set of grammars. this method is
    // called by a validator before the validation starts. the application
    // can provide an initial set of grammars available to the current
    // validation attempt.
    public Grammar[] retrieveInitialGrammarSet(String grammarType);

    // return the final set of grammars that the validator ended up
    // with.
    // This method is called after the
    // validation finishes. The application may then choose to cache some
    // of the returned grammars.
    public void cacheGrammars(String grammarType, Grammar[] grammars);

    // This method requests that the application retrieve a grammar
    // corresponding to the given GrammarDescription from its cache.
    // If it cannot do so it must return null; the parser will then
    // call the EntityResolver.  An application must not call its
    // EntityResolver itself from this method.
    public Grammar retrieveGrammar(String grammarType,
                                             XMLGrammarDescription desc);

} // XMLGrammarPool


public interface XMLGrammarDescription {
    public String getPublicID();
    public String getSystemID();
    public String getBaseURI();
} // XMLGrammarDescription

public class DTDDescription implements GrammarDescription {
    // used to indicate whether it's an internal or external DTD
    public final static int INTERNAL_DTD = 0;
    public final static int EXTERNAL_DTD = 1;

    public int getDTDType();

    // this returns the name of the root element if this is a DOCTYPE
    // entity, or the name of the entity if it's a standard entity
    // declaration.
    public String getEntityName();
}

public class XSDDescription implements GrammarDescription {
    // used to indicate what triggered the call
    // we don't include xsi:schemaLocation/noNamespaceSchemaLocation
    // because we'll defer the loading of schema documents until
    // a component from that namespace is referenced from the instance
    public final static int CONTEXT_INCLUDE   = 0;
    public final static int CONTEXT_REDEFINE  = 1;
    public final static int CONTEXT_IMPORT    = 2;
    public final static int CONTEXT_ELEMENT   = 3;
    public final static int CONTEXT_ATTRIBUTE = 4;
    public final static int CONTEXT_XSITYPE   = 5;

    public int getContextType();

    // for include and redefine, the namespace will be the target
    // namespace of the enclosing document. (or empty string?)
    public String getTargetNamespace();

    // for import and xsi:location attributes, it's possible to have
    // multiple hints for one namespace. so it's an array whose first
    // element will derive from the noNamespaceSchemaLocation or
    // schemaLocation property as the case of the targetNamespace may
    // be:
    public String[] getLocationHints();

    // If it's triggered by the document, the name of the
    // triggering component: element, attribute or xsi:type
    public QName getTriggeringComponent();

    // More information about "other location hint":
    // everything about the enclosing element
    public QName getEnclosingElementName();
    public XMLAttributes getAttributes();
}

/**
 * This interface is used to resolve external parsed entities. The
 * application can register an object that implements this interface
 * with the parser configuration in order to intercept entities and
 * resolve them explicitly. If the registered entity resolver cannot
 * resolve the entity, it should return <code>null</code> so that the
 * parser will try to resolve the entity using a default mechanism.
 *
 * @see XMLParserConfiguration
 *
 * @author Andy Clark, IBM
 *
 * @version $Id: XMLEntityResolver.java,v 1.2 2001/08/23 00:35:37 lehors
Exp $
 */
public interface XMLEntityResolver {

    //
    // XMLEntityResolver methods
    //

    /**
     * Resolves an external parsed entity. If the entity cannot be
     * resolved, this method should return null.
     *
     * @param desc:  contains a description for the type of entity
     *      (grammar, abstract schema) being sought.
     * @throws XNIException Thrown on general error.
     * @throws IOException  Thrown if resolved entity stream cannot be
     *                      opened or some other i/o error occurs.
     */
    public XMLInputSource resolveEntity(XMLGrammarDescription desc)
        throws XNIException, IOException;

} // interface XMLEntityResolver

Very much looking forward to some spirited, focused open-source
discussion!

Cheers,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  neilg@ca.ibm.com


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [design]: grammar caching

Posted by Andy Clark <an...@apache.org>.

"Theodore W. Leung" wrote:
> I assume that there is a default GrammarPool implementation per parser
> configuration.   How does an application store a grammar pool so that it
> can avoid re-creating one?  How can an application create a grammar pool
> that is locked to a particular set of grammars?  Is a grammar pool
> sharable amongs multiple parser instances?

There are all kinds of settings that we could have on the
grammar pool. And of course I would want to be able to share
the same grammar pool among multiple parser instances. What
these options are, exactly, would need to be worked out.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [design]: grammar caching

Posted by "Theodore W. Leung" <tw...@sauria.com>.

On Tue, 2002-01-08 at 12:25, neilg@ca.ibm.com wrote:
> 
 
> Conceptually, the flow would look something like this:  An application
> instantiates a certain parser Configuration, and associates its
> GrammmarPool implemntation as a property of thatConfiguration.  When a
> Validator object in the configuration begins to validate, it requests
> that a bucket of grammars of the appropriate kind be filled by the
> GrammarPool.  As it parses, the Validator takes grammars from the
> bucket if it can, then gives the GrammarPool a chance to prvide the
> grammar if it wishes, then the XMLEntityResolver gets a chance to
> resolve the request to a file of the appropriate type.  At both these
> stages, as much information is provided to the GrammarPool and the
> XMLEntityResolver as is likely to prove at all helpful in identifying
> the appropriate resource.  At the conclusion of parsing, the validator
> will make available the contents of its bucket to the GrammarPool,
> which can then determine whether to incorporate the grammars or ignore
> them.  I tend to be of the view that parsing should be aborted with a
> fatalError if something goes wrong in the grammar retrieval
> process--e.g., if the GrammarPool gives a Validator a grammar of the
> wrong type--but this is certainly a thorny and multifaceted question.

I assume that there is a default GrammarPool implementation per parser
configuration.   How does an application store a grammar pool so that it
can avoid re-creating one?  How can an application create a grammar pool
that is locked to a particular set of grammars?  Is a grammar pool
sharable amongs multiple parser instances?

Ted


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [design]: grammar caching

Posted by Andy Clark <an...@apache.org>.

neilg@ca.ibm.com wrote:
> it might be a good time to bring up once again an issue that took a
> lot of bandwidth in the summer but was never resolved--what should
> grammar caching in Xerces2 look like?
>
> [SNIP]

Neil, thanks for such a detailed posting. I think you've captured
the problems with a decent solution, as well. Of course, as always, 
I have some tweaks to suggest. :)

You introduce an interface called XMLGrammarDescription which is
actually more like an XMLLocator (sans the row and column info).
And then you proceed to use it in a modified XMLEntityResolver
interface. If it's going to be a generic locator, of sorts, then
I think we should definitely change the name to indicate its
more general purpose. (As a side note, we should try to keep 
the method names in line with our existing naming scheme.)

Since we need a resolver for both entities and for locating
grammars, perhaps we should define a more generic location
interface in the core XNI that would then be used by the entity
resolver interface. It could then be extended for use by the
grammar resolution mechanism. But I need to ask if you have a
solution for the grammar resolver needs access to the entity
resolver problem?

The DTDDescription interface has a method called "getEntityName".
The documentation states that this method can return the root
element name *or* the name of an entity declaration. If the
method serves this dual purpose then the name is wrong. I don't
have an alternate suggestion at the moment, though, but we
should change it.

The rest seems fine; I just have a few minor naming nits.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org