You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by SB <st...@cyberspace.org> on 2002/06/13 12:55:54 UTC

Re: Writing back to the InputSource

The SAXWriter of 
http://www.dom4j.org/apidocs/org/dom4j/io/SAXWriter.html
has a few 'write' methods which can trigger SAX events.
While this can also be used for the purpose for adding
elements to the input, I would like to know how this
method compares with what you have written below.

thanks,
--st.

Quoting Andy Clark (andyc@apache.org):
> Scott Stirling wrote:
> > A basic feature of SAX is that it's for parsing and reading XML, not
> > modifying it.  You need DOM or JDOM for that.
> 
> This is not entirely true.
> 
> There is nothing preventing the application from chaining a
> number of SAX filters together where one (or more) of the
> filters generates new events. A simple example would be to
> just uppercase all of the element names. This is a straight
> augmentation of existing document events but filters can
> also be written remove and insert elements and content as 
> they wish.
> 
> However, the specific problem that they need to solve is
> the following: 1) intercept some document events; 2) run
> some kind of process on designated content; 3) generate a
> new stream of XML characters; 4) have the parser scan those
> characters AS IF they appeared directly in the document.
> One application of this would be to replace encrypted content 
> with its decrypted equivalent. Other examples include an
> embedded script (e.g. JavaScript) and XInclude.
> 
> The NekoHTML parser has the ability to do this and so does
> the Xerces-J parser. However, it's a little more complicated
> in the XML parser because there is more to consider. For
> example: do you want this processing to occur before or
> after validation? Plus, it's important that it's done IN
> the document stream so that the newly scanned content gets
> the document's namespace binding information and entity
> declarations.
> 
> Here's (roughly) how to implement this with Xerces2:
> 
> 1) Write a custom XMLDocumentFilter class
> 
>    This class is responsible for detecting the document
>    events of importance, buffering the content, and then
>    processing that content to generate a stream of XML
>    characters.
> 
>    Once the stream of characters is generated, this needs
>    to be pushed onto the entity manager's stack. Once this
>    is done, then the filter can return and the parser will
>    scan the pushed content before continuing where it left
>    off. This is completely transparent to the parser
>    components and the application.
> 
> 2) Write a custom XMLParserConfiguration class
> 
>    Once you have your filter, you need to have a parser
>    configuration that includes that filter in the pipeline.
>    The location is up to the developer, of course.
> 
>    The easiest thing to do is extend the StandardParser-
>    Configuration class and override the "configurePipeline"
>    method to put assemble the new pipeline as needed. You'll
>    find that this method is inherited from the DTD config
>    class.
> 
> 3) Instantiate a parser class using the custom configuration
> 
>    All of the "API generator" parser classes (e.g. DOM, SAX,
>    etc.) have constructors that take a parser configuration
>    as a parameter. So you can instantiate a SAXParser, for
>    example, that uses your custom configuration.
> 
>    This can also be done without code but it would affect
>    *all* of the Xerces2 parser instances within the same
>    JVM. There is a FAQ item in the docs that explains this[1].
> 
> Sounds easy, right? It is, in principal. However, you do
> need to know a few things about how the standard parser
> components work. You'll find a lot of useful information in
> the "XNI Manual" part of the documentation. And you can
> examine the code as well.
> 
> Basically, the parser configuration's pipeline is made of 
> a collection of standard components. For example, the document 
> scanner, DTD scanner, DTD validator, namespace binder, Schema 
> validator, etc. There are other components in the system as 
> well and the one of interest is the entity manager. This 
> component manages the stack of entities being scanned by the 
> document and DTD scanner components.
> 
> The configuration holds all of these components and is
> responsible for initializing their state and connecting
> the pipeline together before parsing. By implementing
> the XMLComponent interface and adding it to the list of
> configurable components by calling "addComponent" within
> the standard parser configuration, your custom component
> has access to all of the configuration's state -- this
> includes the configuration's components.
> 
> To access the entity manager from the component manager
> during the call to "reset(XMLComponentManager)", do the
> following:
> 
>   String ENTITY_MANAGER = "http://apache.org/xml/properties/internal/entity-manager";
>   XMLEntityManager entityManager = componentManager.getProperty(ENTITY_MANAGER);
> 
> So we store this away at initialization time -- we will
> then it when we need to push a new entity onto the stack. 
> The "startEntity(String,XMLInputSource,boolean,boolean)" 
> method is the one we're interested in. The following code
> shows how to push some new XML content onto the stack to
> be scanned by the parser:
> 
>   String XML = "<foo> <bar/> </foo>";
>   Reader reader = new StringReader(XML);
>   XMLInputSource source = new XMLInputSource(null, null, null, reader, "UTF-8");
>   entityManager.startEntity("$fake$", source, false, false);
> 
> Some points to take note of:
> 
> 1) Be careful when you push the new entity 
> 
>    The various components within the pipeline contain
>    state and they may get confused if the entity is pushed
>    at the wrong time. An easy way to prevent this is to
>    simply push the new entity during an END ELEMENT event.
>    The typical algorithm would be like this:
> 
>    between selected startElement and endElement
>      buffer character content
>    on selected endElement
>      process buffer
>      push new entity
> 
> 2) Swallow the original document events
> 
>    For most applications of this kind, the new custom
>    filter should "swallow" the document events that
>    contain the content to be processed. Because the
>    idea is that the processed result will *replace*
>    the original.
> 
> 3) The new entity name is propagated
> 
>    All entities that are started by the entity manager
>    have a name. In my sample code above, I invented an
>    illegal name called "$fake$" when I started the
>    entity. The start/end entity boundaries of this fake
>    entity should also be swallowed by your filter.
> 
>    Why does every entity have to have a name? It's just
>    the way that the standard components are implemented.
>    Deal with it. :)
> 
> Anyway, if you need to do this kind of thing, then it
> IS possible with a little bit of code. All of this should
> be covered in the "XNI Manual" which contains sample programs
> that show how to implement the different parts. So be sure
> to read that. You can also find similar information in 
> chapter 6 of the new Addison Wesley book called _XML and 
> Java: Developing Web Applications (2nd edition)_[2].
> 
> [1] http://xml.apache.org/xerces2-j/faq-xni.html#faq-2
> [2] http://www.aw.com/catalog/academic/product/1,4096,0201770040,00.html
> 
> -- 
> Andy Clark * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Writing back to the InputSource

Posted by Andy Clark <an...@apache.org>.
SB wrote:
> The SAXWriter of 
> http://www.dom4j.org/apidocs/org/dom4j/io/SAXWriter.html
> has a few 'write' methods which can trigger SAX events.
> While this can also be used for the purpose for adding
> elements to the input, I would like to know how this
> method compares with what you have written below.

There's a big difference. With the DOM4J SAXWriter, you
need to already have a document that's been parsed into
a tree.

If the user needs to only intercept a SAX event coming
from the parser and then *generate* new SAX events at
that point, then the user can implement this any way
that makes sense. This would include building a DOM4J
tree manually and then using the SAXWriter that comes
with DOM4J to generate the new SAX events to be sent
downstream.

However, I was talking about the case where a new
*stream* of characters needs to be inserted into the
original stream to be parsed. In this case, when the
new stream is parsed, it needs to pick up the namespace
binding and entity declarations that were defined by
the source document.

-- 
Andy Clark * andyc@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org