You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Andy Clark <an...@apache.org> on 2001/09/20 12:18:16 UTC

[Xerces2] InfoSet Augmentations in XNI

The time has come to think about how we are going to support
InfoSet augmentations in XNI. This issue was raised recently
because, as the new XML Schema validation code is written,
we are looking at the features that we are going to have to
support -- mainly, the post-Schema validation infoset (PSVI).

While I'm generally opposed to adding XML Schema specific
APIs to the XNI framework, this is not the only case where
additional information will need to be communicated through
the document pipeline. So I would like to design a generic
solution on top of which an implementation dependent way to
communicate PSVI items can be written.

My initial suggestion (in case you missed my earlier post
since it was a response to Lisa posting the current Schema
status), is to add an InfoSet parameter to each and every
callback in the XNI handlers. For example:

  void characters(XMLString text, InfoSet infoset);

This has several pros and cons associated to it, of which
I'll list a few in a second. But first, let me just start
off by saying that this is ONLY AN IDEA that I am throwing
out there in order to start some design discussion on this
topic. I'm NOT saying that it has to be done this way, 
merely that it could be done this way. Having said that,
let me enumerate a few pros and cons:

PRO: The associated infoset augmentations are carried
     along through the pipeline with the event.

This is an important point because if we were to develop
a mechanism that was "out-of-band" of the document flow,
then it's very easy for these parallel pipelines to get
out of sync. Since an XNI-compliant parser is modular
and the individual pieces in the pipeline can be arranged
and re-arranged, any specific stage can't know what will
happen once information leaves that stage. What if a
stage further down the pipeline *adds* or *removes*
XNI events? (e.g. namespace binding adds start/endPrefix-
Mapping events) So the easiest way to keep the data and
the augmentations in sync is to pass them together.

PRO: The info in the InfoSet parameter is generic.

By making it generic, we *allow* for things like PSVI,
etc. to be implemented but *don't force* all XNI parsers
and parser configurations to handle any specific type of
augmentation. In addition, if the InfoSet parameter 
defined a kind of map where the augmentations were 
assigned keys, then new kinds of augmentations can be
used together without the need to change the interface 
or requiring a specific kind of configuration.

CON: The associated infoset augmentations are carried
     along through the pipeline with the event.

By passing an additional parameter in every callback,
we will incur the extra overhead of an extra push and
pop of this parameter on the call stack at each stage
in the pipeline. Regardless of this penalty, though,
my initial opinion is that we have to take this hit
in order to make infoset augmentations work within 
the XNI framework.

Thoughts?

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] InfoSet Augmentations in XNI

Posted by Andy Clark <an...@apache.org>.

Andy Clark wrote:
> Thoughts?

Yes, I have one but I must attribute it to Glenn because he
made the comment to me last week or so. It's taken me this
long to remember it long enough to post it to the mailing
list! Oh well.

To recap: my proposal was to add an infoset parameter to
the most useful methods in the document handler. These
being: startElement, characters, ignorableWhitespace, and
endElement.

Glenn asked: what about infoset additions associated to
attributes? Very good question.

At first I was thinking that they could just be part of
the keyed information in the infoset passed in the 
startElement method. But then Glenn countered by saying
that it would be difficult to keep these in sync as 
subsequent stages in the pipeline alter the document's
infoset and infoset augmentations.

So I would suggest adding an infoset to each attribute
in the XMLAttributes interface. Then the infoset passed
as a parameter to the startElement would pertain to the
element itself and the infoset pertaining to an
attribute would be directly associated to that attribute
in the XMLAttributes interface. Make sense? 

Does this sound like a reasonable solution to this
problem? Does anyone have any more thoughts regarding
the addition of infoset augmentations to the document
handler interface?

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] InfoSet Augmentations in XNI

Posted by Andy Clark <an...@apache.org>.

I was a little worried about the performance impact of adding
an infoset parameter to XNI callbacks so I did a little test
to see how much of an impact the added parameter would be.

The results are at the bottom.

What I Did

First, I created an XMLInfoSet interface (which is bascially
just a map) and wrote a simple default implementation. Then I 
added an "infoset" parameter to the following methods in the
XMLDocumentHandler interface: (for completeness, they would
also need to be added to the XMLDocumentFragmentHandler
interface)

  startElement(QName,XMLAttributes,XMLInfoSet)
  emptyElement(QName,XMLAttributes,XMLInfoSet)
  characters(XMLString,XMLInfoSet)
  ignorableWhitespace(XMLString,XMLInfoSet)
  endElement(QName,XMLInfoSet)

I didn't add the parameters to any of the other methods because
it didn't seem necessary. [Does anyone see off-hand any other
method that would need this parameter?]

Then I updated the Xerces2 implementation to pass this infoset
through the pipeline. This took a little bit of time but was
pretty straightforward.

How I Tested

The sax.Counter example in the Xerces2 package was used to get
the "performance" measurements in the next section. The jars
for Xerces 1.x, Xerces2, and Xerces2 (w/ infoset) were compared
for running time against three files of varying sizes and
content. These following table shows the basic nature of these
files:

  File           Elems  Attrs  Spaces    Chars  Tagginess
  personal.xml      20     11      89       76        44%
  simpsons.pgml   1117   4597    8481        0        54%
  ot.xml         71461      0   25170  3236745         7%

The "tagginess" of the file gives you an indication of the
ratio of textual content (characters, ignorable whitespace,
attribute values, etc) vs. the characters required for the 
markup (angle-brackets, element and attribute names, etc). 
A pretty useful number which can be generated by the 
sax.Counter program using the "-t" option.

The Results

The following table shows the time required for Xerces 1.x,
Xerces2, and Xerces2 (w/ infoset) to parse the sample files.
Each parser was "warmed up" by parsing the sample document
once and then the number shown is the average of the next
10 parses performed in a loop by using the "-x" option of
the sax.Counter example.

                 Time (ms)
  File           Xerces 1.x   Xerces2  Xerces2+
  personal.xml           39        29        29
  simpsons.pgml         396       243       240
  ot.xml               2179      2001      2071

"Xerces2+" is Xerces2 with infoset additions. I expected
the parsing time to increase from Xerces2 to Xerces2+
but the "simpsons.pgml" shows a slight decrease. However,
I attribute this to the combination of the small number 
of parses and standard deviations in system performance.

Conclusion

The addition of a single parameter to the most called
methods of the XNI document handler interface doesn't seem
to have adverse affects on performance. Obviously this is
dependent on the nature of the parsed documents and length
of the parsing pipeline. However, I think that this 
approach is reasonable.

What do other people think? Should we add the information?
And if I add the information, should it be on selective
handler methods or every method?

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] InfoSet Augmentations in XNI

Posted by Andy Clark <an...@apache.org>.

I'm inlining a response from Elena that was off of the mailing
list. Since it pertains directly to this conversation, I hope
she doesn't mind me including it here. :)

Elena Litani wrote:
> I don't really like your idea. First of all, we can only gather PSVI
> information during validatation process meaning that this information
> does not have to be carried over through the parser pipeline (Scanner
> does not have to implement this interface).

You're taking a PSVI-only standpoint. My goal is to provide 
a generic facility to propagate all kinds of infoset 
augmentations within XNI. Such a facility would let us 
implement PSVI and be able to accomodate any other infoset 
augmentations in the future without having to invent new 
interfaces or change the XNI framework. It's for this very 
reason that SAX2 included the generic feature and property 
mechanism.

But let's take PSVI as a use-case to see what is needed 
by this particular instance of an infoset augmentation so
that we can learn more about what a generic infoset
mechanism would need. In this scenario, the XML Schema
validator produces "extra" document information called the
Post-Schema Validation Infoset (PSVI) which contains
addition information beyond just the document's structure
and textual content (e.g. an attribute or element's data
type and value). So far so good. 

This PSVI information will be exposed to the application 
in a variety of ways. The most obvious of which is the
upcoming DOM Level 3 Content Model API. But I'm sure that
there will be others.

Okay, so far we have the XML Schema validator producing
these information items and the DOM parser consuming
them to augment the DOM tree with the content model and 
datatype information. (Notice that I'm not making any 
effort here to define what this information is.) So this 
information must be communicated in some fashion -- the 
open question is how.

> Thus, I believe that for PSVI we need to introduce a new XNI interface
> that in a way simillar to XMLDTDHandler interface: smth like
> PSVIHandler. 

I've stated before (and I'll state again ;) that I don't
think XNI should include PSVI interfaces because it's only 
specific to XML Schema. We went to a lot of effort to come 
up with a set of interfaces that are independent of specific
APIs and implementations. Adding PSVI specific interfaces 
would be taking a step backward, in my opinion.

Anytime that I think about what should be in XNI, I ask
myself the question: "does *everyone* need this?" or "is
this a fundamental part of XML?". Clearly, not everyone
needs XML Schema and all of its infoset augmentations. In 
addition, XML Schema sits on top of XML and therefore PSVI 
should sit on top of XNI and not be a direct part of it. 
I'm being firm on this because I feel that it's important.

But let me play devil's advocate: say we introduced an
infoset augmentation interface. To keep it generic and not
tied to XML Schema, let's call it "XMLInfoSetHandler". Now
this handler has some methods to allow individual stages
in the document pipeline to add or remove infoset items,
thus augmenting the information set of the document. How
does the application associate these information items
to the data that is actually going through the pipeline?

This is the crux of the problem; we need some way to
associate the infoset augmentations to the actual XML
data flowing through the pipeline. You can never assume
that there is a one-to-one correspondence between the
number of events emitted from a stage (e.g. from the XML
Schema validator) to the number of events received at
the end of the pipeline. Why? Because subsequent stages
may add or remove events in the process. (e.g. the
namespace binder or an XInclude processor.)

Continuing my advocacy of the devil, how would we solve
the problem where the actual data flowing through the
pipeline becomes out of sync with the associated infoset
items passed via the XMLInfoSetHandler interface? A
unique identifier would work. So let's follow that train
of thought as it slowly derails... ;) 

If we add a key to the infoset items passed to the infoset 
handler, then there must be a key passed along with the 
actual data. This implies that we would need to add a 
parameter to each and every callback in the existing XNI 
handlers so that components "upstream" from the infoset 
augmentation can correctly associate the information. 

However, if you are willing to add a key parameter, then 
there is no need for the out-of-band infoset handler 
interface. Instead of passing the key, we can just pass 
the infoset itself. And now we arrive back to what I had 
proposed in the first place.

> DOM Parser as well as SAX parser will implement this
> interface and the information will be passed from XML Validator to the
> APIs (DOM .. XNI..). This handler might be eventually contributed to the
> SAX API package (as an extention).

I can see how the DOM would use this information to
build Level 3 Content Model nodes but I don't see the
use in the current version of SAX. The SAX interfaces
are already set and it's very difficult to solve the
problem of associating the document information with
the infoset augmentations. Do you have an idea how
you would solve this problem?

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org