You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-dev@axis.apache.org by Ted Leung <tw...@sauria.com> on 2001/08/14 08:34:40 UTC

Re: [Xerces2] Pull Parsing

I'm copying axis-dev in case they still care.
----- Original Message -----
From: "Andy Clark" <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Sunday, August 12, 2001 10:35 PM
Subject: [Xerces2] Pull Parsing


> [I'm forwarding this message from Ted to the mailing list.]
>
> Ted Leung wrote:
> > I sat down to work on a pull parser atop X2 today, and realized that
> > parseSome, etc are no longer exposed.  As far as I can tell, they got
pushed
> > down into an argument on XMLDocumentScanner.scanDocument.  It seems to
me
> > that the only way to write a pull parser is to create a new parser
> > configuration.  Am I missing something?  If not, I'll start a thread on
this
> > in xerces-j-dev.
>
> Ted, you're not missing anything. You've realized a deficiency
> in XNI. While it *is* possible to write a pull parser, you are
> right that you would have to write a new parser configuration.

Ok, it's nice to know I'm not getting too old...

> While we now have the ability to do pull-parse scanning through
> the new document and DTD scanners, this functionality is hidden
> behind the single parse(XMLInputSource) method in the parser
> configuration. Therefore, I think we need to make a minor change
> to the XMLParserConfiguration interface. So the one method
>
>   parse(XMLInputSource):void
>
> should become the two
>
>   setInputSource(XMLInputSource)
>   parseDocument(boolean):boolean
>
> Which would then cascade to the base parser implementations. And
> the current DOM and SAX parsers which have the parse(InputSource)
> method would first call setInputSource and then call parseDocument
> with a true value to tell the configuration to parse completely.
>
> Whatcha think?

This would be fine by me, because it would solve my problem.  But here's
my concern.  Does it make sense to surface all of these kinds of details up
through XNI?  Or does it make sense to solve this some other way, like via
an object returned as a property

> Ted, have you put some thought into what kind of API that should
> be on a Xerces2 based pull-parser? I would like to see an API
> that is simple enough to use for pull parsing but can communicate
> all of the information that XNI provides through its handler
> interfaces.

I've looked at XPP and KXML as alternative pull parsers.

XPP returns an int based typecode and then you get to call one of
a number of methods and supply a data struture to be filled in.  It's very
C like, and because of that, it's likely to be efficient in nice ways.

KXML returns objects for start and end tags, and it fills in the parent
child relationships for those objects as it goes.

I like the XPP approach in a lot of ways, because it's likely to be
efficient.
But I think the usage is kind of ugly.  I completely agree that it should be
based on
XNI.   I'm still debating over callbacks vs objects.  I'm used to callbacks,
but a
number of people that I've polled don't want to deal with callbacks, they
want
to deal with objects.  But I'm not sure that makes sense.  The nice thing
about a
pull parser is that you can pass the parser around to parts of your program
and
the parts that know what they need from it can ask the parser to get them.
That kind
of gets you out of encoding huge state machines into SAX handlers.

If anybody over in Axis still remembers or cares, I'd love to have some
input
on what kind of API is desirable.

> --
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>


Re: help on implementing pull in X2 [was Re: [Xerces2] Pull Parsing]

Posted by Ted Leung <tw...@sauria.com>.
----- Original Message -----
From: "Andy Clark" <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Friday, August 17, 2001 3:12 AM
Subject: Re: help on implementing pull in X2 [was Re: [Xerces2] Pull
Parsing]


> After looking at the pull-parsing capability in the document
> scanner some more, I realized that it didn't quite work right.
> So I fixed some things and also separated out the scanning of
> the DTD so that when we resolve the issue I raised in a
> recent post, we will be able to support pull-parsing even
> out through the DTD scanner. Neat! :)
>
> But there's still at least one more bug that I still see in
> the pull-parsing document scanner code so I'll try to fix
> that soon. I'll post a response when I've squashed it. But
> the code in CVS now shouldn't cause any regressions as long
> as you use it in the default push mode.
>
> [Q] Are we all in agreement about changing the parse method
>     in the XMLParserConfiguration? If so, I'll make that
>     change over the weekend as well.
+1
> --
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: help on implementing pull in X2 [was Re: [Xerces2] Pull Parsing]

Posted by Andy Clark <an...@apache.org>.
After looking at the pull-parsing capability in the document
scanner some more, I realized that it didn't quite work right.
So I fixed some things and also separated out the scanning of
the DTD so that when we resolve the issue I raised in a
recent post, we will be able to support pull-parsing even
out through the DTD scanner. Neat! :)

But there's still at least one more bug that I still see in
the pull-parsing document scanner code so I'll try to fix 
that soon. I'll post a response when I've squashed it. But
the code in CVS now shouldn't cause any regressions as long
as you use it in the default push mode.

[Q] Are we all in agreement about changing the parse method
    in the XMLParserConfiguration? If so, I'll make that
    change over the weekend as well.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: help on implementing pull in X2 [was Re: [Xerces2] Pull Parsing]

Posted by Andy Clark <an...@apache.org>.
Andy Clark wrote:
> document scanner passes the "complete" parameter to the DTD
> scanner. However, this is a problem because the code will not
> correctly handle pull parsing the DTD (internal or external
> subset) in the case where complete is false. So that code
> will have to be fixed. 

Currently, the scanner can be used as a pull parser but the 
DTD's internal and external subset will be read completely.
I'm working on a fix for to allow the DTD declarations to
also be pull parsed from the document scanner. 

However, it occurs to me that you may want to have separate 
"complete" values for document vs. DTD scanning when you are 
scanning a document. In other words, a pull parser may want 
to stop after each document event *but* read the DTD entirely
without stopping. OR... a pull parser may want to stop after 
each document *and* DTD event.

This all leads me to think that we might need a change to
the XMLDocumentScanner interface. For example:

  - scanDocument(boolean complete):boolean
  + scanDocument(boolean completeDoc, boolean completeDTD):boolean

But, again, this would then cascade up through the config
and the parser instances...

The other option is that the scanning of the DTD's internal
and external subset takes the same "complete" value as the
scanning of the document. And a feature could be used to set 
the "completeDTD" value.

Whatcha think?

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: help on implementing pull in X2 [was Re: [Xerces2] Pull Parsing]

Posted by Andy Clark <an...@apache.org>.
Aleksander Slominski wrote:
> could somebody point me out where in documentation or source code to look for
> how to implement pull parser configuration. i would like to play with it to see
> how incremental parsing is happening (and how it fits into pipeline) - for
> example how next event is obtained? or what needs to be written?

There's no documentation. However, if you look at the scanner
interfaces that are part of the xni.parser package, then you'll
notice that they each have setInputSource and scanXXX methods
that allow you to parse the document pieces at a time. The
granularity of the information is based on the methods in the
XNI handler interfaces. 

It would be the responsibility of the pull parser implementation 
to receive this information and communicate it in whatever 
fashion it wants. But the breakdown of the data is ultimately 
mandated by what information is available in XNI.

The Xerces2 reference implementation has document and DTD
scanners that are capable of doing pull parsing. However, there
is currently no direct API to call this through the parsers or
parser configurations. (That's the discussion we're having
right now with Ted.) So for now you'd have to write a custom
configuration (you can copy the StandardParserConfiguration)
and add methods to allow pull-parsing.

But I just thought of something... Let me check... Okay, the
document scanner passes the "complete" parameter to the DTD
scanner. However, this is a problem because the code will not
correctly handle pull parsing the DTD (internal or external
subset) in the case where complete is false. So that code 
will have to be fixed. Looking for something to do? ;)

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


help on implementing pull in X2 [was Re: [Xerces2] Pull Parsing]

Posted by Aleksander Slominski <as...@cs.indiana.edu>.
Ted Leung wrote:

> Well, let's talk about this how this would work -- You'd have
> to return an object that that had say:
>
> public interface PullParserAPI {
>     public void setInputSource (InputSource);
>     public boolean parseDocument(boolean);
> }
>
> That way you'd define parser configuration, which would have no
> parser API methods.  The usage model then would be to create
> a parser convenience class that implemented PullParserAPI, which
> then used the pull configuration to get the object from the property and
> hook it up as the implementation of the parser convenience class.

hi,

could somebody point me out where in documentation or source code to look for
how to implement pull parser configuration. i would like to play with it to see
how incremental parsing is happening (and how it fits into pipeline) - for
example how next event is obtained? or what needs to be written?

thanks,

alek



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Xerces2] Pull Parsing

Posted by Andy Clark <an...@apache.org>.
Ted Leung wrote:
> I like #2.  Pull is big enough to justify it.

Okay. Since I don't see any violent disagreement and I'm tending
to lean that way anyway I'll add that to the XNI parser package
and implement it in Xerces2. We can always change it later before
the first Xerces2 production release.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Xerces2] Pull Parsing

Posted by Ted Leung <tw...@sauria.com>.
I like #2.  Pull is big enough to justify it.

Ted
----- Original Message ----- 
From: "Andy Clark" <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Sunday, August 19, 2001 9:03 PM
Subject: Re: [Xerces2] Pull Parsing


> Ted Leung wrote:
> > That way everybody to see which control capability was supported.   
> > If course we now have more interfaces than there are atoms in the 
> > universe, but....
> 
> That's what concerns me. XNI is blowing up out of control.
> So do you think we need more discussion before I implement
> something? Which way are you currently leaning? 
> 
>   (1) Replace parse method with setInputSource/parseDocument
>   (2) Extend parser configuration interface to create new
>       "pull" parser configuration
> 
> -- 
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Xerces2] Pull Parsing

Posted by Andy Clark <an...@apache.org>.
Ted Leung wrote:
> That way everybody to see which control capability was supported.   
> If course we now have more interfaces than there are atoms in the 
> universe, but....

That's what concerns me. XNI is blowing up out of control.
So do you think we need more discussion before I implement
something? Which way are you currently leaning? 

  (1) Replace parse method with setInputSource/parseDocument
  (2) Extend parser configuration interface to create new
      "pull" parser configuration

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Xerces2] Pull Parsing

Posted by Ted Leung <tw...@sauria.com>.
----- Original Message -----
From: "Andy Clark" <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Friday, August 17, 2001 12:30 AM
Subject: Re: [Xerces2] Pull Parsing


> Ted Leung wrote:
> > Okay, this is convincing and I don't see another way around this.
> > So I guess we have to add the methods to the default configuration.
>
> So which of the following two choices seems better? The first
> choice is cleaner because in the second option you still have
> that parse(XMLInputSource) method from the base interface. I
> think that it could lead to some confusion about which to use
> and how they interact. However, the second option is nice
> because you know right away if your configuration supports
> pull parsing. In the first choice, you can't really know if
> the configuration is able to perform pull parsing.

I don't think 1 is cleaner because it forces everybody to be a
pull parser.  If what we want is to be able to detect pull-parsing
capability we should do 2.   In the case where an implementation
supports pull-parsing, such as the current SAX and DOM parsers,
we should alter their configurations to implement XMLPullParserConfiguration
not XMLParserConfiguration.

I suppose that if the parser control methods (by this I mean all variants of
parse) are an issue, the we could separate those methods into a separate
interface, and then require actual configurations to implement them, so

interface XMLPushParserControl {
  parse(XMLInputSource);
}

and

interface XMLPullParserControl {
 setInputSource(XMLInputSource);
 parseDocument(boolean):boolean
}

That way everybody to see which control capability was supported.   If
course
we now have more interfaces than there are atoms in the universe, but....

Ted

> CHOICE 1: Change XMLParserConfiguration Interface
>
>   - parse(XMLInputSource)
>   + setInputSource(XMLInputSource)
>   + parseDocument(boolean):boolean
>
> CHOICE 2: Extend XMLParserConfiguration Interface
>
>   + interface XMLPullParserConfiguration : XMLParserConfiguration
>     + setInputSource(XMLInputSource)
>     + parseDocument(boolean):boolean
>
> > I think that we should have one of these as well for the API that Alek
and
> > I are discussing.
>
> If we have pull-parsing capability built into the parser
> configuration interface *and* expose it in the parsers,
> then we don't need this property. However, if it's only
> exposed in the configuration and *not* in the parser
> instances, then we need something like this. Which one
> are you thinking is better?
>
> --
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Xerces2] Pull Parsing

Posted by Andy Clark <an...@apache.org>.
Ted Leung wrote:
> Okay, this is convincing and I don't see another way around this.
> So I guess we have to add the methods to the default configuration.

So which of the following two choices seems better? The first
choice is cleaner because in the second option you still have
that parse(XMLInputSource) method from the base interface. I
think that it could lead to some confusion about which to use
and how they interact. However, the second option is nice
because you know right away if your configuration supports
pull parsing. In the first choice, you can't really know if
the configuration is able to perform pull parsing.

CHOICE 1: Change XMLParserConfiguration Interface

  - parse(XMLInputSource)
  + setInputSource(XMLInputSource)
  + parseDocument(boolean):boolean

CHOICE 2: Extend XMLParserConfiguration Interface

  + interface XMLPullParserConfiguration : XMLParserConfiguration
    + setInputSource(XMLInputSource)
    + parseDocument(boolean):boolean

> I think that we should have one of these as well for the API that Alek and
> I are discussing.

If we have pull-parsing capability built into the parser
configuration interface *and* expose it in the parsers,
then we don't need this property. However, if it's only
exposed in the configuration and *not* in the parser
instances, then we need something like this. Which one
are you thinking is better?

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Xerces2] Pull Parsing

Posted by Ted Leung <tw...@sauria.com>.
----- Original Message ----- 
From: "Andy Clark" <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Wednesday, August 15, 2001 8:42 PM
Subject: Re: [Xerces2] Pull Parsing


> Ted Leung wrote:
> > As I said, I can live with this, but it does seem unclean, 
> > because the pull parser API shouldn't be exposed in the 
> > Configuration.
> 
> I'm still thinking that the parser configuration should 
> have the methods (just like the document and DTD scanner 
> interfaces have this ability). However, all configurations 
> are not required to be able to do pull-parsing. If the 
> configuration doesn't support it, then calling parse with 
> a value of false (in order to parse only the next bit of 
> the document) would do nothing.
> 
> If we end up not exposing the pull parsing through the
> configuration interface (and possibly through the parser 
> implementations), then people can't take a generic 
> configuration and perform pull parsing for the existing 
> DOM and SAX parsers. Then we're back in the situation we 
> had before when we didn't have scanner interfaces: people 
> could build configurations but never *start* parsing them 
> in a generic way.

Okay, this is convincing and I don't see another way around this.
So I guess we have to add the methods to the default configuration.

> > Well, let's talk about this how this would work -- You'd have
> > to return an object that that had say:
> > 
> > public interface PullParserAPI {
> >     public void setInputSource (InputSource);
> >     public boolean parseDocument(boolean);
> > }
> 
> Interesting. I still don't like the property object
> because it can only be checked at run-time when you
> want to know at compile-time that you have this
> capability. Another option would be to extend the
> XMLParserConfiguration interface. For example:
> 
> public interface XMLPullParserConfiguration
>     extends XMLParserConfiguration {
>     public void setInputSource(InputSource) throws ...;
>     public boolean parseDocument(boolean) throws ...;
> }

I think that we should have one of these as well for the API that Alek and
I are discussing.
 
> Then if your configuration that you're using is an
> instanceof XMLPullParserConfiguration, you know that
> the configuration supports that capability.
> 
> -- 
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Xerces2] Pull Parsing

Posted by Andy Clark <an...@apache.org>.
Ted Leung wrote:
> As I said, I can live with this, but it does seem unclean, 
> because the pull parser API shouldn't be exposed in the 
> Configuration.

I'm still thinking that the parser configuration should 
have the methods (just like the document and DTD scanner 
interfaces have this ability). However, all configurations 
are not required to be able to do pull-parsing. If the 
configuration doesn't support it, then calling parse with 
a value of false (in order to parse only the next bit of 
the document) would do nothing.

If we end up not exposing the pull parsing through the
configuration interface (and possibly through the parser 
implementations), then people can't take a generic 
configuration and perform pull parsing for the existing 
DOM and SAX parsers. Then we're back in the situation we 
had before when we didn't have scanner interfaces: people 
could build configurations but never *start* parsing them 
in a generic way.

> Well, let's talk about this how this would work -- You'd have
> to return an object that that had say:
> 
> public interface PullParserAPI {
>     public void setInputSource (InputSource);
>     public boolean parseDocument(boolean);
> }

Interesting. I still don't like the property object
because it can only be checked at run-time when you
want to know at compile-time that you have this
capability. Another option would be to extend the
XMLParserConfiguration interface. For example:

public interface XMLPullParserConfiguration
    extends XMLParserConfiguration {
    public void setInputSource(InputSource) throws ...;
    public boolean parseDocument(boolean) throws ...;
}

Then if your configuration that you're using is an
instanceof XMLPullParserConfiguration, you know that
the configuration supports that capability.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Xerces2] Pull Parsing

Posted by Ted Leung <tw...@sauria.com>.
----- Original Message -----
From: "Andy Clark" <an...@apache.org>
To: <xe...@xml.apache.org>
Sent: Wednesday, August 15, 2001 1:19 AM
Subject: Re: [Xerces2] Pull Parsing


> Ted Leung wrote:
> > This would be fine by me, because it would solve my problem.  But here's
> > my concern.  Does it make sense to surface all of these kinds of details
up
> > through XNI?  Or does it make sense to solve this some other way, like
via
> > an object returned as a property
>
> You may be right in the sense that as long as the ability to do
> it is on the XMLParserConfiguration, then it doesn't need to
> cascade all the way out to the parser. Only the pull parser
> implementation would need to know how to do this.
>

As I said, I can live with this, but it does seem unclean, because the pull
parser API shouldn't be exposed in the Configuration.

>
> However... there are people that will want to do pull parsing
> of DOM and SAX so maybe it's better to make it completely
> visible even though most people will just use parse(String) or
> parse(InputSource) like they're used to doing.
>
> I don't know if returning a special property is the right way
> to solve this problem.

Well, let's talk about this how this would work -- You'd have
to return an object that that had say:

public interface PullParserAPI {
    public void setInputSource (InputSource);
    public boolean parseDocument(boolean);
}

That way you'd define parser configuration, which would have no
parser API methods.  The usage model then would be to create
a parser convenience class that implemented PullParserAPI, which
then used the pull configuration to get the object from the property and
hook it up as the implementation of the parser convenience class.

> --
> Andy Clark * IBM, TRL - Japan * andyc@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Xerces2] Pull Parsing

Posted by Andy Clark <an...@apache.org>.
Ted Leung wrote:
> This would be fine by me, because it would solve my problem.  But here's
> my concern.  Does it make sense to surface all of these kinds of details up
> through XNI?  Or does it make sense to solve this some other way, like via
> an object returned as a property

You may be right in the sense that as long as the ability to do
it is on the XMLParserConfiguration, then it doesn't need to
cascade all the way out to the parser. Only the pull parser
implementation would need to know how to do this. 

However... there are people that will want to do pull parsing 
of DOM and SAX so maybe it's better to make it completely 
visible even though most people will just use parse(String) or 
parse(InputSource) like they're used to doing.

I don't know if returning a special property is the right way
to solve this problem.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


RE: [Xerces2] Pull Parsing

Posted by Jeremy Carroll <jj...@hplb.hpl.hp.com>.
I thought I might draw attention to my use of Xerces1 under this topic.

I wrote an RDF parser with a coroutine architecture.
The principle goals are ability to easily change the RDF grammar, and
ability to understand  the RDF grammar. Efficiency is a non-objective.

Basic design

XML-doc
   ==> Xerces
   ==> SAX
   ==> SAX events
   ==> keyword recogniser
   ==> "infoset events"
   ==> JavaCC RDF parser

Upto the SAX events this is normal.
The sax events are mapped in the following fashion to a finer grained event
set:

  endElement ==> E_END
  characters ==> CD_STRING
  startElement ==> some-token for the startTag,
   pairs of tokens for each attribute value pair.
  The startTag token and the attribute tokens are both subject to "keyword
recognition". i.e. if the tag or attribute is particular significant in the
RDF grammar (e.g. rdf:RDF) then a special token is created, or alternatively
a generic E_OTHER is output.


The implementation of the dataflow is using Java threads, hugely
inefficient, but achieves my design goals. The tokens produced by the
keyword recogniser are stuffed into a pipe (like Doug Lea's BoundedBuffer).

The JavaCC RDF parser then pulls tokens out of the pipe.

The JavaCC definition file.

http://www-uk.hpl.hp.com/people/jjc/arp/arp-1_0_3/src/com/hp/hpl/jena/rdf/ar
p/rdf.jj

Features/bugs of this design:

+ The tokens are smallest items out of infoset.
+ This generates attribute ordering issues (my grammar and keyword
recogniser have an agreed order, application defined, to handle attributes)
+ Extending to full infoset would mean that the application should identify
those parts of infoset which should appear in the token stream, and those
parts that aren't interesting; e.g. in my case comments are discarded.

I have done some experiments with an LALR(1) parser rather than the JavaCC
LL(1) parser. Inverting that produces a very significant speed up (the
thread overhead is huge). I would expect that inverting the XML parsing
(i.e. pull parsing) would also produce such a speed up.

My parser page is:

http://www-uk.hpl.hp.com/people/jjc/arp

Jeremy Carroll
HP Labs











---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org