You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@xml.apache.org by Michal Mosiewicz <mi...@interdata.com.pl> on 2000/05/11 00:17:20 UTC

SAX

I've sent it already to sax@megginson.com, but I thought that it might
be worth to mention it here. I think that SAX could be much more robust,
if it was extended a little bit, and this list could be a good place to
discuss it.

First, what I notice in SAX that bothers me is that it lacks for
bidirectional communication. I.e. there is a data producer, that sends
the whole document as events, and there is a consumer, that is only
expected to listen for those events, and just to follow them.

IMHO, SAX event handlers also should return some informative codes
similiary to how it is implemented in Sun's taglibs, where tag handlers
may return things like SKIP_BODY or SKIP_PAGE. 

All the parts of the producer-transformer-serializer path could possibly
benefit from this:

1. Some content producers may send possibly more data that is needed to
get the final document, so transformations applied may shorten the
source content, or even get small parts of this. While this is maybe not
so necessary in case of simple sequentially read files, it may enhance
performance if the producer may ommit some parts - for example, if you
wrap database data in your XML or more general - if you can access
source document randomly.

2. We can optimize transformers, especially if you pipeline several
transformations, you can backtrace the events that don't produce any
result, and optimise the whole transformation pipeline. 

3. There is also potentially much larger gain in serializer part,
becouse this could allow for structure level caching of the result, i.e.
you could potentially decide to cache some fragments of the output, not
necessarily whole documents. Currently it's teoretically possible, but
document producer is not able to know if it is not required to provide
some data, becouse the cached output is still valid.

Does anybody know if there is any mailing list related strictly to SAX? 

-- Mike

Re: SAX

Posted by Mike Pogue <mp...@apache.org>.

The xml-dev mailing list (now hosted at OASIS, I think) is the
place where SAX is discussed.  Note that it is a very high volume
mailing list!

Mike

Michal Mosiewicz wrote:
> 
> I've sent it already to sax@megginson.com, but I thought that it might
> be worth to mention it here. I think that SAX could be much more robust,
> if it was extended a little bit, and this list could be a good place to
> discuss it.
> 
> First, what I notice in SAX that bothers me is that it lacks for
> bidirectional communication. I.e. there is a data producer, that sends
> the whole document as events, and there is a consumer, that is only
> expected to listen for those events, and just to follow them.
> 
> IMHO, SAX event handlers also should return some informative codes
> similiary to how it is implemented in Sun's taglibs, where tag handlers
> may return things like SKIP_BODY or SKIP_PAGE.
> 
> All the parts of the producer-transformer-serializer path could possibly
> benefit from this:
> 
> 1. Some content producers may send possibly more data that is needed to
> get the final document, so transformations applied may shorten the
> source content, or even get small parts of this. While this is maybe not
> so necessary in case of simple sequentially read files, it may enhance
> performance if the producer may ommit some parts - for example, if you
> wrap database data in your XML or more general - if you can access
> source document randomly.
> 
> 2. We can optimize transformers, especially if you pipeline several
> transformations, you can backtrace the events that don't produce any
> result, and optimise the whole transformation pipeline.
> 
> 3. There is also potentially much larger gain in serializer part,
> becouse this could allow for structure level caching of the result, i.e.
> you could potentially decide to cache some fragments of the output, not
> necessarily whole documents. Currently it's teoretically possible, but
> document producer is not able to know if it is not required to provide
> some data, becouse the cached output is still valid.
> 
> Does anybody know if there is any mailing list related strictly to SAX?
> 
> -- Mike
> 
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     webmaster@xml.apache.org
> To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
> For additional commands, e-mail: general-help@xml.apache.org

Re: SAX

Posted by Michal Mosiewicz <mi...@interdata.com.pl>.

Stefano Mazzocchi wrote:
> [...]
> When you want to store XML data and do a bunch of querying on top of it,
> you need a read DBMS which is able to index it's XML content is such a
> way that XPath or XQL queries are created _without_ the overhead of
> parsing the entire XML structure.

Just one thing I have forgotten to comment here. You are absolutely
right, that the content producer is not necessary a parser. In fact
there might be different content producers. Possibly those that make a
use of indexed storage. In some cases it would be wise to implement a
database storable DOM structure, and make a use of passive API. But I
emphasise that this method is bound to passive API. But we've got two
different methodologies - one that is pulling data from some content
storage, and one which is pushing data to content consumers.

I'm just pointing that the second method could be possibly optimised to
make a better use of active API, i.e. of method that is pushing content.
But you seem to prove that it's not necessary, becouse if we can't do
something better using active API, let's just back up to passive.

But here you ommit the fact, that sometimes active API may be suited as
good as passive for executing only those fragments of producing code
that are necessary. And what is important - sometimes it can do it
better than passive api becouse data production and consumption may be
better parallelized.

Note that an XML document have some interesting properties. One of them
is that it groups an ordered tree of elements. There is a whole class of
document transformations that is bound to natural source data order
(like presentation layer), and such transformations can be better
implemented with active API, cause document producer can be more
effective in providing data in natural order, cause in case of active
API moving to next element is as easy, as just proceeding to the next
line of code.

Anyway, the bottom line is that I'm talking about improvements in SAX
API to extend the number of cases when it can be better than pulling
API.

-- Mike

Re: SAX

Posted by Stefano Mazzocchi <st...@apache.org>.

Michal Mosiewicz wrote:
> 
> Stefano Mazzocchi wrote:
> > [...]
> > You are clearly identifying a SAX producer as a parser or a XML adapter.
> >
> > If you think at a SAX producer as an XPointer implementation then you
> > ask for
> >
> >  file.xml#xpointer(/news/articles[@author='foo'])
> >
> > or even more powerful
> >
> >  file.xml#xql(whatever-XQL-will-look-like)
> >
> > and what is produces is exactly what you need as for XML random access.
> 
> Ok - say, there are those article documents. Each of them has some
> /article/title, /article/img, /article/intro, and /article/body. Then,
> you want to make an index page, so the most obvious solution would be to
> pass them through some transformation selecting non-body content. Note
> that this not-visible body content may be a larger part of the document.
> The transformation doesn't generate any event neither on /article/body
> element  nor it's subelements.
> 
> How could using XPointer or XQL help here? How would you prevent that
> the parser doesn't do any useless job of generating unnecessary events?

That's the key point! You think that who generates this data is a
parser. I never told you so.

When you want to store XML data and do a bunch of querying on top of it,
you need a read DBMS which is able to index it's XML content is such a
way that XPath or XQL queries are created _without_ the overhead of
parsing the entire XML structure.

It's exactly the same with RDBMS and primary key indexes.

> What is the scenario here to reduce computational cost of the
> transformation that is known to ommit a large portion of a source
> document?

The above, or, in case you have large number of documents and no XML DB,
pregeneration.
 
> Also, this argument is a schizophrenic for me. Once we all agree that
> active API like SAX could do better for us, but then we use passive API
> to proof that improvements of active API is not necessary, becouse the
> above XPointer syntax is nothing else, but getting back to the old
> passive API.

I don't understand this rant.
 
> > In the Cocoon project we did careful estimation of the requirement for
> > fragment caching and we agreed that it's much better to improve XInclude
> > functionalities and to cache entire documents, rather than having
> > document fragment caching.
> 
> Better than careful I like correct estimation. I cannot understand how
> XInclude would be much better improvement. I'm talking about improvement
> that is possible along the whole processing path - from content
> generators, that may be sometimes not required to generate full content,
> to translators, and finally to serializer, which is able to get the
> information about cacheable parts of the document and remember them in
> serialized form. XInclude can only be used to improve one side of the
> transformation (i.e. the producing side), and it requires that you have
> to operate on separate documents that may be included or not, but you
> cannot mark some content cacheable while passing it through some
> transformation.

I'm sorry but I don't understand the usefulness of this. Can you provide
an example?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------

Re: SAX

Posted by Michal Mosiewicz <mi...@interdata.com.pl>.

Stefano Mazzocchi wrote:
> [...]
> You are clearly identifying a SAX producer as a parser or a XML adapter.
> 
> If you think at a SAX producer as an XPointer implementation then you
> ask for
> 
>  file.xml#xpointer(/news/articles[@author='foo'])
> 
> or even more powerful
> 
>  file.xml#xql(whatever-XQL-will-look-like)
> 
> and what is produces is exactly what you need as for XML random access.

Ok - say, there are those article documents. Each of them has some
/article/title, /article/img, /article/intro, and /article/body. Then,
you want to make an index page, so the most obvious solution would be to
pass them through some transformation selecting non-body content. Note
that this not-visible body content may be a larger part of the document.
The transformation doesn't generate any event neither on /article/body
element  nor it's subelements.

How could using XPointer or XQL help here? How would you prevent that
the parser doesn't do any useless job of generating unnecessary events?
What is the scenario here to reduce computational cost of the
transformation that is known to ommit a large portion of a source
document?

Also, this argument is a schizophrenic for me. Once we all agree that
active API like SAX could do better for us, but then we use passive API
to proof that improvements of active API is not necessary, becouse the
above XPointer syntax is nothing else, but getting back to the old
passive API.

> In the Cocoon project we did careful estimation of the requirement for
> fragment caching and we agreed that it's much better to improve XInclude
> functionalities and to cache entire documents, rather than having
> document fragment caching.

Better than careful I like correct estimation. I cannot understand how
XInclude would be much better improvement. I'm talking about improvement
that is possible along the whole processing path - from content
generators, that may be sometimes not required to generate full content,
to translators, and finally to serializer, which is able to get the
information about cacheable parts of the document and remember them in
serialized form. XInclude can only be used to improve one side of the
transformation (i.e. the producing side), and it requires that you have
to operate on separate documents that may be included or not, but you
cannot mark some content cacheable while passing it through some
transformation.

-- Mike

Re: SAX

Posted by Edwin Goei <Ed...@eng.sun.com>.

> it should be xml-dev but I don't remember where it is hosted now.
> (anyone? Norman?)

See http://xml.org/ for info on subscribing to xml-dev.

Re: SAX

Posted by Stefano Mazzocchi <st...@apache.org>.

Michal Mosiewicz wrote:

Hi Michal, nice to see you around here :)
 
> I've sent it already to sax@megginson.com, but I thought that it might
> be worth to mention it here. I think that SAX could be much more robust,
> if it was extended a little bit, and this list could be a good place to
> discuss it.
> 
> First, what I notice in SAX that bothers me is that it lacks for
> bidirectional communication. I.e. there is a data producer, that sends
> the whole document as events, and there is a consumer, that is only
> expected to listen for those events, and just to follow them.

yes, that's the main design decision behind SAX and I must say I find it
perfect for its job.
 
> IMHO, SAX event handlers also should return some informative codes
> similiary to how it is implemented in Sun's taglibs, where tag handlers
> may return things like SKIP_BODY or SKIP_PAGE.

I don't think there is any need for this (read below)
 
> All the parts of the producer-transformer-serializer path could possibly
> benefit from this:
> 
> 1. Some content producers may send possibly more data that is needed to
> get the final document, so transformations applied may shorten the
> source content, or even get small parts of this. While this is maybe not
> so necessary in case of simple sequentially read files, it may enhance
> performance if the producer may ommit some parts - for example, if you
> wrap database data in your XML or more general - if you can access
> source document randomly.

You are clearly identifying a SAX producer as a parser or a XML adapter.

If you think at a SAX producer as an XPointer implementation then you
ask for

 file.xml#xpointer(/news/articles[@author='foo'])

or even more powerful

 file.xml#xql(whatever-XQL-will-look-like)

and what is produces is exactly what you need as for XML random access.
 
> 2. We can optimize transformers, especially if you pipeline several
> transformations, you can backtrace the events that don't produce any
> result, and optimise the whole transformation pipeline.

??? I don't get it. You spend time elaborating the event stream to mark
each of those who doesn't trigger the even generation, then you want to
reuse this information for other calls? This is useless since the
producer may generate other events and you have to do the same all over
again.

> 3. There is also potentially much larger gain in serializer part,
> becouse this could allow for structure level caching of the result, i.e.
> you could potentially decide to cache some fragments of the output, not
> necessarily whole documents. Currently it's teoretically possible, but
> document producer is not able to know if it is not required to provide
> some data, becouse the cached output is still valid.

In the Cocoon project we did careful estimation of the requirement for
fragment caching and we agreed that it's much better to improve XInclude
functionalities and to cache entire documents, rather than having
document fragment caching.
 
> Does anybody know if there is any mailing list related strictly to SAX?

it should be xml-dev but I don't remember where it is hosted now.
(anyone? Norman?)

Anyway, I don't see the need for what you ask.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------