You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jaxme-dev@ws.apache.org by Daniel Barclay <da...@fgm.com> on 2005/04/25 16:59:34 UTC

recursive includes hang JaxMeXS

It seems JaxMeXS can't handle schema files that recursively include
each other (e.g., A.xsd includes B.xsd, B.xsd includes A.xsd).
JaxMeXS hangs in an endless recursion, reading A, then reading B,
then reading another copy of A, then reading another copy of B, etc.

I've see this with JaxMe 0.3.1.  Does JaxMe 0.4beta still have the
same problem?  The attached files constitute a small test case.

(Whether or not recursive includes are legal in XML Schema, JaxMeXS should
not get stuck in an endless loop.  If they are legal, obvioulys JaxMeXS
should handle them.  If they are not illegal, JaxMeXS still should not
hang and, ideally (if JaxMeXS were meant to be a schema checker), should
report the error.

It seems that JaxMeXS should keep track of which schema documents it has
already read in, and, when processing xsd:include constructs, should check
whether the document to be included has already been read in.)


Daniel






Re: recursive includes hang JaxMeXS

Posted by Jochen Wiedmann <jo...@gmail.com>.
Hi, Daniel,

I have committed another patch to 0.4 and HEAD, which should fix the
problem below. The reason was, that the outer schemas system ID wasn't
remembered.

A unit test is available, see the method
  ParserTest.testRecursiveXsInclude()


Jochen


Daniel Barclay wrote:
> Jochen Wiedmann wrote:
> 
>>> It seems JaxMeXS can't handle schema files that recursively include
>>> each other (e.g., A.xsd includes B.xsd, B.xsd includes A.xsd).
>>
>>
>>
>> You are right. The check was implemented for xs:import, but not for
>> xs:include. Fixed in the 0.4 branch and in HEAD.
> 
> 
> Does the fix just prevent the endless recursion, or does it also handle
> inclusion correctly?
> 
> I tried manually back-porting the changes to 0.3.1.  When I ran my
> original test case, it did prevent endless recursion.
> 
> However, when I add a declaration to _one_ of the schema files, I get
> an error about duplicate definitions.
> 
> Specifically, when I add a declaration to file at which parsing
> starts, I get the error.  When I add the declaration to a file
> that is reached only by inclusion, I don't get the error.
> 
> - A.xsd includes B.xsd
> - B.xsd includes A.xsd
> - A.xsd defines an element E
> - starting parsing at A.xsd yields an error about E being a duplicate
> - starting parsing at B.xsd does not yield such an error
> 
> Assuming I didn't back-port the change wrong, it seems that the
> parser forgets to include the starting file in the set of files
> for which it should skip includes (because they have already been
> read).
> 
> 
> 
> Here is a pair of test files:
> 
> -----------------------------------------------------------------------
> RecursiveIndirectly1.xsd_
> ----------
> <xsd:schema
>   xmlns:xsd="http://www.w3.org/2001/XMLSchema"
>   >
>   <xsd:include schemaLocation="RecursiveIndirectly2.xsd" />
>     
>   <xsd:element name="In1"/>
> 
> </xsd:schema>
> -----------------------------------------------------------------------
> RecursiveIndirectly2.xsd:
> ----------
> <xsd:schema
>   xmlns:xsd="http://www.w3.org/2001/XMLSchema"
>   >
>   <xsd:include schemaLocation="RecursiveIndirectly1.xsd" />
> 
>   <!--xsd:element name="In2"/-->
> 
> </xsd:schema>
> --------------------------------------------------------------------
> 
> 
> Daniel
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: jaxme-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: jaxme-dev-help@ws.apache.org


Re: recursive includes hang JaxMeXS

Posted by Daniel Barclay <da...@fgm.com>.
(Agh!   Mozilla crashed taking my almost-completed reply with it.
Here's another try.)


Jochen,

Jochen Wiedmann wrote:
> Daniel Barclay wrote:
...
>>XML Schema specification says (part 1, section 4.2.1):
>>
>>    applications are allowed,
>>    indeed encouraged, to avoid <include>ing the same schema document more
>>    than once to forestall the necessity of establishing identity component
>>    by component.
...

> JaxMeXS is unable to "establish identity component by component". You
> might consider that as a lack of an important feature, but that's how it
> is. If you volunteer for adding that feature, you are welcome.

No, I don't think JaxMeXS needs to establish identity component by
component, since the schema specification gives applications the option
to skip already-included documents.  However, I do think it needs to
check for same-document inclusion just a bit differently than it
currently does.

In any case, to be correct, the JaxMeXS needs to do _something_ to avoid
reporting errors if they are not errors.

(Of course, whether they actually are errors depends on exactly what
the schema specification really means, which isn't clear yet. I
submitted a question to www-xml-schema-comments@w3.org.)



> As long as that is missing, you've got to make sure, that the parser
> knows, that it is including the "same schema document". Currently, this
> is done by ensuring that the system ID is *the same*. Not the same as in
> "the same file in the filesystem" or as in "referring to the same URL",
> but lexically the same. 

Why do you think the parser should compare the _unresolved_ URI
reference from an include directive, which might be relative, instead
of comparing the _resolved_, non-relative URI that it is about to use
as a system ID to retrieve a document?


The overall parser (JaxMeXS and/or the underlying, lower-level parser)
already has to keep track of the base URI of the current schema
document being parsed, xml:base attributes, and the location of URI
references relative to xml:base attributes, and also has to combine
that information to resolve any relative references into non-relative
URIs in order to read included or imported schema documents.

Since the parser already has to resolve relative references into
non-relative URIs, shouldn't the parser be comparing resolved,
non-relative URIs instead of unresolved URI references that might still
be relative references?


> Again, if you don't like this as it is, you are
> welcome to volunteer for a better solution. 

As I wrote, the solution is simply to use the resolved non-relative
URI (from the resolution you _already_ perform) instead of using the
unresolved URI reference directly from the include directive.
Regarding version 0.3.1, I wrote:

   In fact, I used your URI resolution in getInputSource() to get
   resolved, non-relative URIs and used those resolved URIs (and not
   the original URI reference given in the include directive) with the
   includedSchemas Set.  That seems to work correctly.

I moved the call to getInputSource(...) up to before the code that
checked and then set includedSchemas and used the resolved URI in
the InputSource from getInputSource() (instead of the unresolved
URI reference) when checking and setting includedSchema.


The current CVS version has been rearranged a bit since then, so it's
harder to tell exactly where the changes would go.

You seem to have moved the check for whether a document has already
been parsed to _after_ you have done lower-level parsing.  Although
that doesn't necessarily hurt anything, reading the document (even
just partially) _before_ checking whether it has already been included
seems strange.  Why was that done?

Given that change, it's hard to be specific about the best place for
a fix for the problem at hand.

However, the first thing is that method getInputSource(...) should
probably be split so that your code to resolve a given URI reference
against the base URI is separate from creating an InputSource for a
given resolved URI.

Operations in method parse(XsESchema,String) on field parsedSchemas
should use resolved URIs and not unresolved URI references.  That
probably means that either:
- that method needs a baseURI parameter so that it can resolve any
   relative reference in the pSchemaLocation parameter into non-relative
   URI, or
- callers need to do the resolution before calling that method
- (or the recent code rearrangement needs to be partly undone).


I also think the check for whether to skip a document should occur
_before_ you perform lower-level parsing of the document.

It doesn't make sense to resolve the URI (with getInputSource(...)),
perform low-level parsing, and only then check the URI and ignore what
was parsed.  Shouldn't JaxMeXS resolve the URI and then check the URI
and then skip all parsing?



> Believe it or not,  but what is currently in the code is sufficient to
> achieve what you want, if you are ready to add an EntityResolver and
> make sure, that *the lexically same* system ID's are given to the parser.

EntityResolver seems like the wrong tool for the job.

The purpose of an entity resolver is usually to map a requested URI
reference to the _content_ of a document, usually by getting (via an
InputStream or Reader), or at least pointing to (via a URI string),
a cached copy from somewhere other than the specified location.

However, all that is needed here to map the requested URI reference
to a non-relative URI.


Related to all this, where does XSLogicalParser handle xml:base
attributes?  (Or are they handled in a different class?)


Daniel









---------------------------------------------------------------------
To unsubscribe, e-mail: jaxme-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: jaxme-dev-help@ws.apache.org


Re: recursive includes hang JaxMeXS

Posted by Jochen Wiedmann <jo...@gmail.com>.
Daniel Barclay wrote:

> XML Schema specification says (part 1, section 4.2.1):
> 
>     Note: The above is carefully worded so that multiple <include>ing
>     of the same schema document will not constitute a violation of clause
>     2 of Schema Properties Correct (§3.15.6), but applications are allowed,
>     indeed encouraged, to avoid <include>ing the same schema document more
>     than once to forestall the necessity of establishing identity component
>     by component.
> 
> Although that is only a note, it clearly refers to the concept of the
> "same schema document."

JaxMeXS is unable to "establish identity component by component". You
might consider that as a lack of an important feature, but that's how it
is. If you volunteer for adding that feature, you are welcome.

As long as that is missing, you've got to make sure, that the parser
knows, that it is including the "same schema document". Currently, this
is done by ensuring that the system ID is *the same*. Not the same as in
"the same file in the filesystem" or as in "referring to the same URL",
but lexically the same. Again, if you don't like this as it is, you are
welcome to volunteer for a better solution. However, unlike as in the
case of the Javadocs, I clearly reserve the rights to carefully review
and possible refuse such changes. You might be able to convice another
developer, of course, or even a majority.

Believe it or not, but what is currently in the code is sufficient to
achieve what you want, if you are ready to add an EntityResolver and
make sure, that *the lexically same* system ID's are given to the parser.


Jochen

---------------------------------------------------------------------
To unsubscribe, e-mail: jaxme-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: jaxme-dev-help@ws.apache.org


Re: recursive includes hang JaxMeXS

Posted by Daniel Barclay <da...@fgm.com>.
Oops.  I again forgot to add the JaxMe address when I sent this message.

--------------
Jochen Wiedmann wrote:

> I am sorry, but the problem below is clearly beyond the scope of the
> schema parser. 

I don't think so.  It seems that the XML Schema specification requires
that the parser address it.


XML Schema specification says (part 1, section 4.2.1):

     Note: The above is carefully worded so that multiple <include>ing
     of the same schema document will not constitute a violation of clause
     2 of Schema Properties Correct (§3.15.6), but applications are allowed,
     indeed encouraged, to avoid <include>ing the same schema document more
     than once to forestall the necessity of establishing identity component
     by component.

Although that is only a note, it clearly refers to the concept of the
"same schema document."


It's okay if the parser reads the document multiple times.  However, it
must not erroneously think that a single declaration is actually an
illegal multiple declaration (e.g., <xsd:element name="In1"/> in the
test case in my earlier message).  JaxMeXS currently does think there's
a duplicate declaration when there is not (assuming I correctly ported
the change to 0.3.1).


> The parser will never be able to know, that
> 
>   "file://.../a.xsd"
> 
> and
> 
>   "a.xsd"
> 
> are the same file. 

Why not?

It can certainly know whether the URI references "file://.../a.xsd" and
"a.xsd" resolve to the same (non-relative) URI.  The first is already a
URI, so that's the (non-relative) URI it resolves to.  The second is a
relative reference, so it has to be resolved against the appropriate base
URI (typically, the URI of the document containing the include directive)
to get a (non-relative) URI.

Some part of the parser already has to resolve URI references in order
to read in the referenced document.  (That already seems to be handled
in method XSLogicalParser.getInputSource(...).)

In fact, I used your URI resolution in getInputSource() to get
resolved, non-relative URIs and used those resolved URIs (and not the
original URI reference given in the include directive) with the
includedSchemas Set.  That seems to work correctly.


The question is what definition of "same document" the XML Schema
specification uses or assumes.  I can't check right now, but it
seems it has to be either a direct string comparison of resolved
URIs, or something very close.

(Specifically, I'm sure it would not be a string comparision of
unresolved URI references, and I'm sure it's nothing that requires
network access (e.g., checking whether two DNS host names map to
the same IP address).)




>These things are exactly what an EntityResolver is
> good for.

That's a different level.  I'm not talking about resolution from the
name of a resource to a copy of its data.  I'm talking about resolution
from a possibly-relative name to a non-relative name.



Daniel





---------------------------------------------------------------------
To unsubscribe, e-mail: jaxme-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: jaxme-dev-help@ws.apache.org


Re: recursive includes hang JaxMeXS

Posted by Jochen Wiedmann <jo...@gmail.com>.

Hi, Daniel,

I am sorry, but the problem below is clearly beyond the scope of the
schema parser. The parser will never be able to know, that

  "file://.../a.xsd"

and

  "a.xsd"

are the same file. These things are exactly what an EntityResolver is
good for. Again, see the testRecursiveXsInclude() method for an example,
how the EntityResolver can be used.

Sorry,

Jochen


Daniel Barclay wrote:
> In addition to the problem below (in the quoted part), the parser
> doesn't seem to be resolving URI references to (non-relative) URIs
> before comparing them.
> 
> I tried handling the problem below by adding the URI of the originally
> parsed document to the set of documents to skip.  (When A is parsed,
> A includes B, and where B includes A, the parser should see that A has
> already been read and should be skipped.)
> 
> However, the URI reference used to find A.xsd originally was different
> from the URI reference used to try to include A.xsd from B.xsd.  The
> first URI reference was the (non-relative) URI "file:///.../A.xsd"; the
> second URI reference was the relative reference "A.xsd".  Since those URI
> references were not the same, the parser didn't recognize that the document
> had already been read.
> 
> It seems that when checking an include directive, the parser should
> resolve the given URI reference (a URI or a relative reference) against
> the appropriate base URL, so that if the given URI reference is a relative
> reference, it will be resolved into a full (non-relative) URI.
> 
> Daniel
> 
>> Jochen Wiedmann wrote:
>>
>>>> It seems JaxMeXS can't handle schema files that recursively include
>>>> each other (e.g., A.xsd includes B.xsd, B.xsd includes A.xsd).
>>>
>>>
>>>
>>>
>>> You are right. The check was implemented for xs:import, but not for
>>> xs:include. Fixed in the 0.4 branch and in HEAD.
>>
>>
>>
>> Does the fix just prevent the endless recursion, or does it also handle
>> inclusion correctly?
>>
>> I tried manually back-porting the changes to 0.3.1.  When I ran my
>> original test case, it did prevent endless recursion.
>>
>> However, when I add a declaration to _one_ of the schema files, I get
>> an error about duplicate definitions.
>>
>> Specifically, when I add a declaration to file at which parsing
>> starts, I get the error.  When I add the declaration to a file
>> that is reached only by inclusion, I don't get the error.
>>
>> - A.xsd includes B.xsd
>> - B.xsd includes A.xsd
>> - A.xsd defines an element E
>> - starting parsing at A.xsd yields an error about E being a duplicate
>> - starting parsing at B.xsd does not yield such an error
>>
>> Assuming I didn't back-port the change wrong, it seems that the
>> parser forgets to include the starting file in the set of files
>> for which it should skip includes (because they have already been
>> read).
>>
>>
>>
>> Here is a pair of test files:
>>
>> -----------------------------------------------------------------------
>> RecursiveIndirectly1.xsd_
>> ----------
>> <xsd:schema
>>   xmlns:xsd="http://www.w3.org/2001/XMLSchema"
>>   >
>>   <xsd:include schemaLocation="RecursiveIndirectly2.xsd" />
>>       <xsd:element name="In1"/>
>>
>> </xsd:schema>
>> -----------------------------------------------------------------------
>> RecursiveIndirectly2.xsd:
>> ----------
>> <xsd:schema
>>   xmlns:xsd="http://www.w3.org/2001/XMLSchema"
>>   >
>>   <xsd:include schemaLocation="RecursiveIndirectly1.xsd" />
>>
>>   <!--xsd:element name="In2"/-->
>>
>> </xsd:schema>
>> --------------------------------------------------------------------
>>
>>
>> Daniel
>>
>>
>>
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: jaxme-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: jaxme-dev-help@ws.apache.org


Re: recursive includes hang JaxMeXS

Posted by Daniel Barclay <da...@fgm.com>.
In addition to the problem below (in the quoted part), the parser
doesn't seem to be resolving URI references to (non-relative) URIs
before comparing them.

I tried handling the problem below by adding the URI of the originally
parsed document to the set of documents to skip.  (When A is parsed,
A includes B, and where B includes A, the parser should see that A has
already been read and should be skipped.)

However, the URI reference used to find A.xsd originally was different
from the URI reference used to try to include A.xsd from B.xsd.  The
first URI reference was the (non-relative) URI "file:///.../A.xsd"; the
second URI reference was the relative reference "A.xsd".  Since those URI
references were not the same, the parser didn't recognize that the document
had already been read.

It seems that when checking an include directive, the parser should
resolve the given URI reference (a URI or a relative reference) against
the appropriate base URL, so that if the given URI reference is a relative
reference, it will be resolved into a full (non-relative) URI.

Daniel

> Jochen Wiedmann wrote:
> 
>>> It seems JaxMeXS can't handle schema files that recursively include
>>> each other (e.g., A.xsd includes B.xsd, B.xsd includes A.xsd).
>>
>>
>>
>> You are right. The check was implemented for xs:import, but not for
>> xs:include. Fixed in the 0.4 branch and in HEAD.
> 
> 
> Does the fix just prevent the endless recursion, or does it also handle
> inclusion correctly?
> 
> I tried manually back-porting the changes to 0.3.1.  When I ran my
> original test case, it did prevent endless recursion.
> 
> However, when I add a declaration to _one_ of the schema files, I get
> an error about duplicate definitions.
> 
> Specifically, when I add a declaration to file at which parsing
> starts, I get the error.  When I add the declaration to a file
> that is reached only by inclusion, I don't get the error.
> 
> - A.xsd includes B.xsd
> - B.xsd includes A.xsd
> - A.xsd defines an element E
> - starting parsing at A.xsd yields an error about E being a duplicate
> - starting parsing at B.xsd does not yield such an error
> 
> Assuming I didn't back-port the change wrong, it seems that the
> parser forgets to include the starting file in the set of files
> for which it should skip includes (because they have already been
> read).
> 
> 
> 
> Here is a pair of test files:
> 
> -----------------------------------------------------------------------
> RecursiveIndirectly1.xsd_
> ----------
> <xsd:schema
>   xmlns:xsd="http://www.w3.org/2001/XMLSchema"
>   >
>   <xsd:include schemaLocation="RecursiveIndirectly2.xsd" />
>     
>   <xsd:element name="In1"/>
> 
> </xsd:schema>
> -----------------------------------------------------------------------
> RecursiveIndirectly2.xsd:
> ----------
> <xsd:schema
>   xmlns:xsd="http://www.w3.org/2001/XMLSchema"
>   >
>   <xsd:include schemaLocation="RecursiveIndirectly1.xsd" />
> 
>   <!--xsd:element name="In2"/-->
> 
> </xsd:schema>
> --------------------------------------------------------------------
> 
> 
> Daniel
> 
> 
> 


-- 


---------------------------------------------------------------------
To unsubscribe, e-mail: jaxme-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: jaxme-dev-help@ws.apache.org


Re: recursive includes hang JaxMeXS

Posted by Jochen Wiedmann <jo...@gmail.com>.
Daniel Barclay wrote:

> It seems JaxMeXS can't handle schema files that recursively include
> each other (e.g., A.xsd includes B.xsd, B.xsd includes A.xsd).

You are right. The check was implemented for xs:import, but not for
xs:include. Fixed in the 0.4 branch and in HEAD.


---------------------------------------------------------------------
To unsubscribe, e-mail: jaxme-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: jaxme-dev-help@ws.apache.org