You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by Elisha Berns <e....@computer.org> on 2005/10/31 04:30:45 UTC

Xerces issues handling recursive schema includes

Hi,

I'm trying to determine both what Xerces does when it encounters
recursive schema includes and what to do about it because it causes some
problems.

It appears that the XercesC schema parser creates multiple XSxxx type
objects for the same type if the schema files are included recursively.
In addition it would appear that the load time for a schema is much,
much slower in the presence of recursive includes.

I get one 'proper' globally defined type object but multiple duplicates
when the type appears as a contained type (in a complexType definition).
The only way I know this now is because I get different pointer values
for the XSxxx object when this situation arises, even though they end up
pointing to the same type.

Does anybody know firsthand whether there is any internal mechanism to
prevent this from happening (apparently not), and what can be done, at
present, to prevent this duplication from occuring.

It has occurred to me that it might be a good idea to create a new type
of parser warning specifically regarding the issue of 'recursive
includes'.  This of course only makes sense if there is a strong
consensus that this is a classic anti-pattern of XML Schema development
and should be avoided at all costs.  I can see more or less how to
implement it outside of Xerces by constructing a dependency graph of the
schema files and testing for back-edges.  So my question about this side
of things is whether there is any desire to make this test a built in
part of the parser to make the parser smarter about these things?

Thanks for some feedback here.

Elisha Berns
e.berns@computer.org
tel. (310) 556 - 8332
fax (310) 556 - 2839




---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

Re: Reducing the size of Xerces, static build

Posted by Alberto Massari <am...@datadirect.com>.

Hi Victor,
if you are using Visual C++ .NET 2003, the latest 
Xerces 2.7 distribution has a 'Static build' 
configuration; if you are using MinGW, you should 
be able to get a static library simply by invoking gcc over the *.o files.

Alberto

At 11.30 31/10/2005 -0600, Victor Broto wrote:
>We are releasing a new application making use of 
>Xerces, but still our executable size is too big.
>
>We are trying to ship a final version as small 
>as possible (we'd like to get a compressed size 
><5 MB) and, besides other dependencies, we have 
>one on xerces-c_2_5_0.dll, that is about 4.5 MB.
>
>Can somebody figure out a simple way to reduce the dll size?
>
>Also we are considering to use a static build, 
>that would remove our dependency on the .dll, 
>although increasing the exe size. However, after 
>browsing the documentation and the lists, I 
>haven't been able to know how to build Xerces 
>statically, so we cannot test whether removing 
>the .dll is worthy in that sense.
>
>Is there any way to get a static build of Xerces?
>
>Thanks,
>
>Victor
>
>Elisha Berns wrote:
>
>>Thanks for clearing this up for me, and thanks also for the other item
>>you responded to a few weeks ago.
>>
>>Elisha
>>
>>
>>
>>
>>>-----Original Message-----
>>>From: Alberto Massari [mailto:amassari@datadirect.com]
>>>Sent: Tuesday, November 01, 2005 12:21 AM
>>>To: c-dev@xerces.apache.org
>>>Subject: RE: Xerces issues handling recursive schema includes
>>>
>>>Hi Elisha,
>>>
>>>At 19.39 31/10/2005 -0800, Elisha Berns wrote:
>>>
>>>
>>>>Neil,
>>>>
>>>>I made a naïve implementation of an EntityResolver that only uses the
>>>>absolute paths of the SystemIds it receives, but this doesn't work
>>>>
>>for
>>
>>
>>>>the following reasons:
>>>>
>>>>The main schema file includes ~20 other schema files which are
>>>>
>>located
>>
>>
>>>>in other directories using relative paths and each one of those 20
>>>>
>>files
>>
>>
>>>>includes ~20 files (which can be included multiple times) also using
>>>>relative paths.  So if I use XMLPlatformUtils::weavePaths() using the
>>>>base path from the main schema file being parsed with all of those
>>>>relative paths in the other included schema files, the results are
>>>>invalid paths.
>>>>
>>>>The issue is that the EntityResolver needs to know what base path to
>>>>
>>use
>>
>>
>>>>when it gets a SystemId in order to correctly resolve it to an
>>>>
>>absolute
>>
>>
>>>>path. And the base path keeps changing as the SAX2XMLReader parses
>>>>through the paths it finds in schemaLocation attributes.
>>>>
>>>>Is there any way to get this information (the correct base path to
>>>>
>>use
>>
>>
>>>>per relative path) without having to pre-parse all the schema files
>>>>
>>for
>>
>>
>>>>their schemaLocation attributes?  Surely there must be some simpler
>>>>
>>way
>>
>>
>>>>to prevent the parser from mistaking two or more relative SystemIds
>>>>
>>as
>>
>>
>>>>different SystemIds?
>>>>
>>>To overcome this limitation there is a
>>>XMLEntityResolver interface that you should
>>>register using
>>>SAX2XMLReaderImpl::setXMLEntityResolver (you may
>>>have to cast your SAX2XMLReader to the
>>>implementation class). In your
>>>XMLEntityResolver-derived class you should
>>>implement resolveEntity(XMLResourceIdentifier*)
>>>resolving the entity using the getSystemId() and getBaseURI()
>>>
>>accessors.
>>
>>
>>>Hope this helps,
>>>Alberto
>>>
>>>
>>>
>>>
>>>>Thanks,
>>>>
>>>>Elisha
>>>>
>>>>
>>>>
>>>>>Hi Elisha,
>>>>>
>>>>>Recursive, or circular, includes are supposed to be handled
>>>>>
>>properly
>>
>>
>>>>by a
>>>>
>>>>
>>>>>schema parser.  While I'm not really active anymore on the code
>>>>>
>>base,
>>
>>
>>>>this
>>>>
>>>>
>>>>>question does come up periodically, usually in the context of a
>>>>>
>>set of
>>
>>
>>>>>schemas that get loaded purely via schemaLocation hints, or via a
>>>>>
>>>>user's
>>>>
>>>>
>>>>>EntityResolver which doesn't set system identifiers on the
>>>>>
>>>>InputSources it
>>>>
>>>>
>>>>>returns to the parser.  The usual way to get around this is to
>>>>>
>>>>register a
>>>>
>>>>
>>>>>custom EntityResolver instance, and take good care that system
>>>>>
>>>>identifier
>>>>
>>>>
>>>>>fields are always set to the same value when an InputSource is
>>>>>
>>>>returned.
>>>>
>>>>
>>>>>It's best if this is absolute, but I think a relative URI should
>>>>>
>>work
>>
>>
>>>>too.
>>>>
>>>>
>>>>>The reason this is important is that the parser uses system
>>>>>
>>>>identifiers
>>>>
>>>>
>>>>>internally to figure out whether it's processed a schema document
>>>>>
>>>>before.
>>>>
>>>>
>>>>>Cheers,
>>>>>Neil
>>>>>Neil Graham
>>>>>Manager, C++ Compiler Front-End and Runtime Development
>>>>>IBM Toronto Lab
>>>>>Phone:  905-413-3519, T/L 969-3519
>>>>>E-mail:  neilg@ca.ibm.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>"Elisha Berns" <e....@computer.org>
>>>>>10/30/2005 10:30 PM
>>>>>Please respond to
>>>>>c-dev
>>>>>
>>>>>
>>>>>To
>>>>>"Xerces C++ Development" <c-...@xerces.apache.org>
>>>>>cc
>>>>>
>>>>>Subject
>>>>>Xerces issues handling recursive schema includes
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>Hi,
>>>>>
>>>>>I'm trying to determine both what Xerces does when it encounters
>>>>>recursive schema includes and what to do about it because it
>>>>>
>>causes
>>
>>
>>>>some
>>>>
>>>>
>>>>>problems.
>>>>>
>>>>>It appears that the XercesC schema parser creates multiple XSxxx
>>>>>
>>type
>>
>>
>>>>>objects for the same type if the schema files are included
>>>>>
>>>>recursively.
>>>>
>>>>
>>>>>In addition it would appear that the load time for a schema is
>>>>>
>>much,
>>
>>
>>>>>much slower in the presence of recursive includes.
>>>>>
>>>>>I get one 'proper' globally defined type object but multiple
>>>>>
>>>>duplicates
>>>>
>>>>
>>>>>when the type appears as a contained type (in a complexType
>>>>>
>>>>definition).
>>>>
>>>>
>>>>>The only way I know this now is because I get different pointer
>>>>>
>>values
>>
>>
>>>>>for the XSxxx object when this situation arises, even though they
>>>>>
>>end
>>
>>
>>>>up
>>>>
>>>>
>>>>>pointing to the same type.
>>>>>
>>>>>Does anybody know firsthand whether there is any internal
>>>>>
>>mechanism to
>>
>>
>>>>>prevent this from happening (apparently not), and what can be
>>>>>
>>done, at
>>
>>
>>>>>present, to prevent this duplication from occuring.
>>>>>
>>>>>It has occurred to me that it might be a good idea to create a new
>>>>>
>>>>type
>>>>
>>>>
>>>>>of parser warning specifically regarding the issue of 'recursive
>>>>>includes'.  This of course only makes sense if there is a strong
>>>>>consensus that this is a classic anti-pattern of XML Schema
>>>>>
>>>>development
>>>>
>>>>
>>>>>and should be avoided at all costs.  I can see more or less how to
>>>>>implement it outside of Xerces by constructing a dependency graph
>>>>>
>>of
>>
>>
>>>>the
>>>>
>>>>
>>>>>schema files and testing for back-edges.  So my question about
>>>>>
>>this
>>
>>
>>>>side
>>>>
>>>>
>>>>>of things is whether there is any desire to make this test a built
>>>>>
>>in
>>
>>
>>>>>part of the parser to make the parser smarter about these things?
>>>>>
>>>>>Thanks for some feedback here.
>>>>>
>>>>>Elisha Berns
>>>>>e.berns@computer.org
>>>>>tel. (310) 556 - 8332
>>>>>fax (310) 556 - 2839
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>---------------------------------------------------------------------
>>
>>
>>>>>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>>>>>For additional commands, e-mail: c-dev-help@xerces.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>---------------------------------------------------------------------
>>
>>
>>>>>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>>>>>For additional commands, e-mail: c-dev-help@xerces.apache.org
>>>>>
>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>>>>For additional commands, e-mail: c-dev-help@xerces.apache.org
>>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>>>For additional commands, e-mail: c-dev-help@xerces.apache.org
>>>
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>>For additional commands, e-mail: c-dev-help@xerces.apache.org
>>
>>
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>For additional commands, e-mail: c-dev-help@xerces.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

Reducing the size of Xerces, static build

Posted by Victor Broto <vb...@gmail.com>.

We are releasing a new application making use of Xerces, but still our 
executable size is too big.

We are trying to ship a final version as small as possible (we'd like to 
get a compressed size <5 MB) and, besides other dependencies, we have 
one on xerces-c_2_5_0.dll, that is about 4.5 MB.

Can somebody figure out a simple way to reduce the dll size?

Also we are considering to use a static build, that would remove our 
dependency on the .dll, although increasing the exe size. However, after 
browsing the documentation and the lists, I haven't been able to know 
how to build Xerces statically, so we cannot test whether removing the 
.dll is worthy in that sense.

Is there any way to get a static build of Xerces?

Thanks,

Victor

Elisha Berns wrote:

>Thanks for clearing this up for me, and thanks also for the other item
>you responded to a few weeks ago.
>
>Elisha
>
>
>  
>
>>-----Original Message-----
>>From: Alberto Massari [mailto:amassari@datadirect.com]
>>Sent: Tuesday, November 01, 2005 12:21 AM
>>To: c-dev@xerces.apache.org
>>Subject: RE: Xerces issues handling recursive schema includes
>>
>>Hi Elisha,
>>
>>At 19.39 31/10/2005 -0800, Elisha Berns wrote:
>>    
>>
>>>Neil,
>>>
>>>I made a naïve implementation of an EntityResolver that only uses the
>>>absolute paths of the SystemIds it receives, but this doesn't work
>>>      
>>>
>for
>  
>
>>>the following reasons:
>>>
>>>The main schema file includes ~20 other schema files which are
>>>      
>>>
>located
>  
>
>>>in other directories using relative paths and each one of those 20
>>>      
>>>
>files
>  
>
>>>includes ~20 files (which can be included multiple times) also using
>>>relative paths.  So if I use XMLPlatformUtils::weavePaths() using the
>>>base path from the main schema file being parsed with all of those
>>>relative paths in the other included schema files, the results are
>>>invalid paths.
>>>
>>>The issue is that the EntityResolver needs to know what base path to
>>>      
>>>
>use
>  
>
>>>when it gets a SystemId in order to correctly resolve it to an
>>>      
>>>
>absolute
>  
>
>>>path. And the base path keeps changing as the SAX2XMLReader parses
>>>through the paths it finds in schemaLocation attributes.
>>>
>>>Is there any way to get this information (the correct base path to
>>>      
>>>
>use
>  
>
>>>per relative path) without having to pre-parse all the schema files
>>>      
>>>
>for
>  
>
>>>their schemaLocation attributes?  Surely there must be some simpler
>>>      
>>>
>way
>  
>
>>>to prevent the parser from mistaking two or more relative SystemIds
>>>      
>>>
>as
>  
>
>>>different SystemIds?
>>>      
>>>
>>To overcome this limitation there is a
>>XMLEntityResolver interface that you should
>>register using
>>SAX2XMLReaderImpl::setXMLEntityResolver (you may
>>have to cast your SAX2XMLReader to the
>>implementation class). In your
>>XMLEntityResolver-derived class you should
>>implement resolveEntity(XMLResourceIdentifier*)
>>resolving the entity using the getSystemId() and getBaseURI()
>>    
>>
>accessors.
>  
>
>>Hope this helps,
>>Alberto
>>
>>
>>    
>>
>>>Thanks,
>>>
>>>Elisha
>>>
>>>      
>>>
>>>>Hi Elisha,
>>>>
>>>>Recursive, or circular, includes are supposed to be handled
>>>>        
>>>>
>properly
>  
>
>>>by a
>>>      
>>>
>>>>schema parser.  While I'm not really active anymore on the code
>>>>        
>>>>
>base,
>  
>
>>>this
>>>      
>>>
>>>>question does come up periodically, usually in the context of a
>>>>        
>>>>
>set of
>  
>
>>>>schemas that get loaded purely via schemaLocation hints, or via a
>>>>        
>>>>
>>>user's
>>>      
>>>
>>>>EntityResolver which doesn't set system identifiers on the
>>>>        
>>>>
>>>InputSources it
>>>      
>>>
>>>>returns to the parser.  The usual way to get around this is to
>>>>        
>>>>
>>>register a
>>>      
>>>
>>>>custom EntityResolver instance, and take good care that system
>>>>        
>>>>
>>>identifier
>>>      
>>>
>>>>fields are always set to the same value when an InputSource is
>>>>        
>>>>
>>>returned.
>>>      
>>>
>>>>It's best if this is absolute, but I think a relative URI should
>>>>        
>>>>
>work
>  
>
>>>too.
>>>      
>>>
>>>> The reason this is important is that the parser uses system
>>>>        
>>>>
>>>identifiers
>>>      
>>>
>>>>internally to figure out whether it's processed a schema document
>>>>        
>>>>
>>>before.
>>>      
>>>
>>>>Cheers,
>>>>Neil
>>>>Neil Graham
>>>>Manager, C++ Compiler Front-End and Runtime Development
>>>>IBM Toronto Lab
>>>>Phone:  905-413-3519, T/L 969-3519
>>>>E-mail:  neilg@ca.ibm.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>"Elisha Berns" <e....@computer.org>
>>>>10/30/2005 10:30 PM
>>>>Please respond to
>>>>c-dev
>>>>
>>>>
>>>>To
>>>>"Xerces C++ Development" <c-...@xerces.apache.org>
>>>>cc
>>>>
>>>>Subject
>>>>Xerces issues handling recursive schema includes
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>Hi,
>>>>
>>>>I'm trying to determine both what Xerces does when it encounters
>>>>recursive schema includes and what to do about it because it
>>>>        
>>>>
>causes
>  
>
>>>some
>>>      
>>>
>>>>problems.
>>>>
>>>>It appears that the XercesC schema parser creates multiple XSxxx
>>>>        
>>>>
>type
>  
>
>>>>objects for the same type if the schema files are included
>>>>        
>>>>
>>>recursively.
>>>      
>>>
>>>>In addition it would appear that the load time for a schema is
>>>>        
>>>>
>much,
>  
>
>>>>much slower in the presence of recursive includes.
>>>>
>>>>I get one 'proper' globally defined type object but multiple
>>>>        
>>>>
>>>duplicates
>>>      
>>>
>>>>when the type appears as a contained type (in a complexType
>>>>        
>>>>
>>>definition).
>>>      
>>>
>>>>The only way I know this now is because I get different pointer
>>>>        
>>>>
>values
>  
>
>>>>for the XSxxx object when this situation arises, even though they
>>>>        
>>>>
>end
>  
>
>>>up
>>>      
>>>
>>>>pointing to the same type.
>>>>
>>>>Does anybody know firsthand whether there is any internal
>>>>        
>>>>
>mechanism to
>  
>
>>>>prevent this from happening (apparently not), and what can be
>>>>        
>>>>
>done, at
>  
>
>>>>present, to prevent this duplication from occuring.
>>>>
>>>>It has occurred to me that it might be a good idea to create a new
>>>>        
>>>>
>>>type
>>>      
>>>
>>>>of parser warning specifically regarding the issue of 'recursive
>>>>includes'.  This of course only makes sense if there is a strong
>>>>consensus that this is a classic anti-pattern of XML Schema
>>>>        
>>>>
>>>development
>>>      
>>>
>>>>and should be avoided at all costs.  I can see more or less how to
>>>>implement it outside of Xerces by constructing a dependency graph
>>>>        
>>>>
>of
>  
>
>>>the
>>>      
>>>
>>>>schema files and testing for back-edges.  So my question about
>>>>        
>>>>
>this
>  
>
>>>side
>>>      
>>>
>>>>of things is whether there is any desire to make this test a built
>>>>        
>>>>
>in
>  
>
>>>>part of the parser to make the parser smarter about these things?
>>>>
>>>>Thanks for some feedback here.
>>>>
>>>>Elisha Berns
>>>>e.berns@computer.org
>>>>tel. (310) 556 - 8332
>>>>fax (310) 556 - 2839
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>        
>>>>
>---------------------------------------------------------------------
>  
>
>>>>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>>>>For additional commands, e-mail: c-dev-help@xerces.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>        
>>>>
>---------------------------------------------------------------------
>  
>
>>>>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>>>>For additional commands, e-mail: c-dev-help@xerces.apache.org
>>>>        
>>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>>>For additional commands, e-mail: c-dev-help@xerces.apache.org
>>>      
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>>For additional commands, e-mail: c-dev-help@xerces.apache.org
>>    
>>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>For additional commands, e-mail: c-dev-help@xerces.apache.org
>
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

RE: Xerces issues handling recursive schema includes

Posted by Elisha Berns <e....@computer.org>.

Thanks for clearing this up for me, and thanks also for the other item
you responded to a few weeks ago.

Elisha


> -----Original Message-----
> From: Alberto Massari [mailto:amassari@datadirect.com]
> Sent: Tuesday, November 01, 2005 12:21 AM
> To: c-dev@xerces.apache.org
> Subject: RE: Xerces issues handling recursive schema includes
> 
> Hi Elisha,
> 
> At 19.39 31/10/2005 -0800, Elisha Berns wrote:
> >Neil,
> >
> >I made a naïve implementation of an EntityResolver that only uses the
> >absolute paths of the SystemIds it receives, but this doesn't work
for
> >the following reasons:
> >
> >The main schema file includes ~20 other schema files which are
located
> >in other directories using relative paths and each one of those 20
files
> >includes ~20 files (which can be included multiple times) also using
> >relative paths.  So if I use XMLPlatformUtils::weavePaths() using the
> >base path from the main schema file being parsed with all of those
> >relative paths in the other included schema files, the results are
> >invalid paths.
> >
> >The issue is that the EntityResolver needs to know what base path to
use
> >when it gets a SystemId in order to correctly resolve it to an
absolute
> >path. And the base path keeps changing as the SAX2XMLReader parses
> >through the paths it finds in schemaLocation attributes.
> >
> >Is there any way to get this information (the correct base path to
use
> >per relative path) without having to pre-parse all the schema files
for
> >their schemaLocation attributes?  Surely there must be some simpler
way
> >to prevent the parser from mistaking two or more relative SystemIds
as
> >different SystemIds?
> 
> To overcome this limitation there is a
> XMLEntityResolver interface that you should
> register using
> SAX2XMLReaderImpl::setXMLEntityResolver (you may
> have to cast your SAX2XMLReader to the
> implementation class). In your
> XMLEntityResolver-derived class you should
> implement resolveEntity(XMLResourceIdentifier*)
> resolving the entity using the getSystemId() and getBaseURI()
accessors.
> 
> Hope this helps,
> Alberto
> 
> 
> >Thanks,
> >
> >Elisha
> >
> > > Hi Elisha,
> > >
> > > Recursive, or circular, includes are supposed to be handled
properly
> >by a
> > > schema parser.  While I'm not really active anymore on the code
base,
> >this
> > > question does come up periodically, usually in the context of a
set of
> > > schemas that get loaded purely via schemaLocation hints, or via a
> >user's
> > > EntityResolver which doesn't set system identifiers on the
> >InputSources it
> > > returns to the parser.  The usual way to get around this is to
> >register a
> > > custom EntityResolver instance, and take good care that system
> >identifier
> > > fields are always set to the same value when an InputSource is
> >returned.
> > > It's best if this is absolute, but I think a relative URI should
work
> >too.
> > >  The reason this is important is that the parser uses system
> >identifiers
> > > internally to figure out whether it's processed a schema document
> >before.
> > >
> > > Cheers,
> > > Neil
> > > Neil Graham
> > > Manager, C++ Compiler Front-End and Runtime Development
> > > IBM Toronto Lab
> > > Phone:  905-413-3519, T/L 969-3519
> > > E-mail:  neilg@ca.ibm.com
> > >
> > >
> > >
> > >
> > >
> > > "Elisha Berns" <e....@computer.org>
> > > 10/30/2005 10:30 PM
> > > Please respond to
> > > c-dev
> > >
> > >
> > > To
> > > "Xerces C++ Development" <c-...@xerces.apache.org>
> > > cc
> > >
> > > Subject
> > > Xerces issues handling recursive schema includes
> > >
> > >
> > >
> > >
> > >
> > >
> > > Hi,
> > >
> > > I'm trying to determine both what Xerces does when it encounters
> > > recursive schema includes and what to do about it because it
causes
> >some
> > > problems.
> > >
> > > It appears that the XercesC schema parser creates multiple XSxxx
type
> > > objects for the same type if the schema files are included
> >recursively.
> > > In addition it would appear that the load time for a schema is
much,
> > > much slower in the presence of recursive includes.
> > >
> > > I get one 'proper' globally defined type object but multiple
> >duplicates
> > > when the type appears as a contained type (in a complexType
> >definition).
> > > The only way I know this now is because I get different pointer
values
> > > for the XSxxx object when this situation arises, even though they
end
> >up
> > > pointing to the same type.
> > >
> > > Does anybody know firsthand whether there is any internal
mechanism to
> > > prevent this from happening (apparently not), and what can be
done, at
> > > present, to prevent this duplication from occuring.
> > >
> > > It has occurred to me that it might be a good idea to create a new
> >type
> > > of parser warning specifically regarding the issue of 'recursive
> > > includes'.  This of course only makes sense if there is a strong
> > > consensus that this is a classic anti-pattern of XML Schema
> >development
> > > and should be avoided at all costs.  I can see more or less how to
> > > implement it outside of Xerces by constructing a dependency graph
of
> >the
> > > schema files and testing for back-edges.  So my question about
this
> >side
> > > of things is whether there is any desire to make this test a built
in
> > > part of the parser to make the parser smarter about these things?
> > >
> > > Thanks for some feedback here.
> > >
> > > Elisha Berns
> > > e.berns@computer.org
> > > tel. (310) 556 - 8332
> > > fax (310) 556 - 2839
> > >
> > >
> > >
> > >
> > >
---------------------------------------------------------------------
> > > To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> > > For additional commands, e-mail: c-dev-help@xerces.apache.org
> > >
> > >
> > >
> > >
> > >
---------------------------------------------------------------------
> > > To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> > > For additional commands, e-mail: c-dev-help@xerces.apache.org
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> >For additional commands, e-mail: c-dev-help@xerces.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

RE: Xerces issues handling recursive schema includes

Posted by Alberto Massari <am...@datadirect.com>.

Hi Elisha,

At 19.39 31/10/2005 -0800, Elisha Berns wrote:
>Neil,
>
>I made a naïve implementation of an EntityResolver that only uses the
>absolute paths of the SystemIds it receives, but this doesn't work for
>the following reasons:
>
>The main schema file includes ~20 other schema files which are located
>in other directories using relative paths and each one of those 20 files
>includes ~20 files (which can be included multiple times) also using
>relative paths.  So if I use XMLPlatformUtils::weavePaths() using the
>base path from the main schema file being parsed with all of those
>relative paths in the other included schema files, the results are
>invalid paths.
>
>The issue is that the EntityResolver needs to know what base path to use
>when it gets a SystemId in order to correctly resolve it to an absolute
>path. And the base path keeps changing as the SAX2XMLReader parses
>through the paths it finds in schemaLocation attributes.
>
>Is there any way to get this information (the correct base path to use
>per relative path) without having to pre-parse all the schema files for
>their schemaLocation attributes?  Surely there must be some simpler way
>to prevent the parser from mistaking two or more relative SystemIds as
>different SystemIds?

To overcome this limitation there is a 
XMLEntityResolver interface that you should 
register using 
SAX2XMLReaderImpl::setXMLEntityResolver (you may 
have to cast your SAX2XMLReader to the 
implementation class). In your 
XMLEntityResolver-derived class you should 
implement resolveEntity(XMLResourceIdentifier*) 
resolving the entity using the getSystemId() and getBaseURI() accessors.

Hope this helps,
Alberto


>Thanks,
>
>Elisha
>
> > Hi Elisha,
> >
> > Recursive, or circular, includes are supposed to be handled properly
>by a
> > schema parser.  While I'm not really active anymore on the code base,
>this
> > question does come up periodically, usually in the context of a set of
> > schemas that get loaded purely via schemaLocation hints, or via a
>user's
> > EntityResolver which doesn't set system identifiers on the
>InputSources it
> > returns to the parser.  The usual way to get around this is to
>register a
> > custom EntityResolver instance, and take good care that system
>identifier
> > fields are always set to the same value when an InputSource is
>returned.
> > It's best if this is absolute, but I think a relative URI should work
>too.
> >  The reason this is important is that the parser uses system
>identifiers
> > internally to figure out whether it's processed a schema document
>before.
> >
> > Cheers,
> > Neil
> > Neil Graham
> > Manager, C++ Compiler Front-End and Runtime Development
> > IBM Toronto Lab
> > Phone:  905-413-3519, T/L 969-3519
> > E-mail:  neilg@ca.ibm.com
> >
> >
> >
> >
> >
> > "Elisha Berns" <e....@computer.org>
> > 10/30/2005 10:30 PM
> > Please respond to
> > c-dev
> >
> >
> > To
> > "Xerces C++ Development" <c-...@xerces.apache.org>
> > cc
> >
> > Subject
> > Xerces issues handling recursive schema includes
> >
> >
> >
> >
> >
> >
> > Hi,
> >
> > I'm trying to determine both what Xerces does when it encounters
> > recursive schema includes and what to do about it because it causes
>some
> > problems.
> >
> > It appears that the XercesC schema parser creates multiple XSxxx type
> > objects for the same type if the schema files are included
>recursively.
> > In addition it would appear that the load time for a schema is much,
> > much slower in the presence of recursive includes.
> >
> > I get one 'proper' globally defined type object but multiple
>duplicates
> > when the type appears as a contained type (in a complexType
>definition).
> > The only way I know this now is because I get different pointer values
> > for the XSxxx object when this situation arises, even though they end
>up
> > pointing to the same type.
> >
> > Does anybody know firsthand whether there is any internal mechanism to
> > prevent this from happening (apparently not), and what can be done, at
> > present, to prevent this duplication from occuring.
> >
> > It has occurred to me that it might be a good idea to create a new
>type
> > of parser warning specifically regarding the issue of 'recursive
> > includes'.  This of course only makes sense if there is a strong
> > consensus that this is a classic anti-pattern of XML Schema
>development
> > and should be avoided at all costs.  I can see more or less how to
> > implement it outside of Xerces by constructing a dependency graph of
>the
> > schema files and testing for back-edges.  So my question about this
>side
> > of things is whether there is any desire to make this test a built in
> > part of the parser to make the parser smarter about these things?
> >
> > Thanks for some feedback here.
> >
> > Elisha Berns
> > e.berns@computer.org
> > tel. (310) 556 - 8332
> > fax (310) 556 - 2839
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: c-dev-help@xerces.apache.org
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: c-dev-help@xerces.apache.org
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>For additional commands, e-mail: c-dev-help@xerces.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

RE: Xerces issues handling recursive schema includes

Posted by Elisha Berns <e....@computer.org>.

Neil,

I made a naïve implementation of an EntityResolver that only uses the
absolute paths of the SystemIds it receives, but this doesn't work for
the following reasons:

The main schema file includes ~20 other schema files which are located
in other directories using relative paths and each one of those 20 files
includes ~20 files (which can be included multiple times) also using
relative paths.  So if I use XMLPlatformUtils::weavePaths() using the
base path from the main schema file being parsed with all of those
relative paths in the other included schema files, the results are
invalid paths.

The issue is that the EntityResolver needs to know what base path to use
when it gets a SystemId in order to correctly resolve it to an absolute
path. And the base path keeps changing as the SAX2XMLReader parses
through the paths it finds in schemaLocation attributes.

Is there any way to get this information (the correct base path to use
per relative path) without having to pre-parse all the schema files for
their schemaLocation attributes?  Surely there must be some simpler way
to prevent the parser from mistaking two or more relative SystemIds as
different SystemIds?

Thanks,

Elisha

> Hi Elisha,
> 
> Recursive, or circular, includes are supposed to be handled properly
by a
> schema parser.  While I'm not really active anymore on the code base,
this
> question does come up periodically, usually in the context of a set of
> schemas that get loaded purely via schemaLocation hints, or via a
user's
> EntityResolver which doesn't set system identifiers on the
InputSources it
> returns to the parser.  The usual way to get around this is to
register a
> custom EntityResolver instance, and take good care that system
identifier
> fields are always set to the same value when an InputSource is
returned.
> It's best if this is absolute, but I think a relative URI should work
too.
>  The reason this is important is that the parser uses system
identifiers
> internally to figure out whether it's processed a schema document
before.
> 
> Cheers,
> Neil
> Neil Graham
> Manager, C++ Compiler Front-End and Runtime Development
> IBM Toronto Lab
> Phone:  905-413-3519, T/L 969-3519
> E-mail:  neilg@ca.ibm.com
> 
> 
> 
> 
> 
> "Elisha Berns" <e....@computer.org>
> 10/30/2005 10:30 PM
> Please respond to
> c-dev
> 
> 
> To
> "Xerces C++ Development" <c-...@xerces.apache.org>
> cc
> 
> Subject
> Xerces issues handling recursive schema includes
> 
> 
> 
> 
> 
> 
> Hi,
> 
> I'm trying to determine both what Xerces does when it encounters
> recursive schema includes and what to do about it because it causes
some
> problems.
> 
> It appears that the XercesC schema parser creates multiple XSxxx type
> objects for the same type if the schema files are included
recursively.
> In addition it would appear that the load time for a schema is much,
> much slower in the presence of recursive includes.
> 
> I get one 'proper' globally defined type object but multiple
duplicates
> when the type appears as a contained type (in a complexType
definition).
> The only way I know this now is because I get different pointer values
> for the XSxxx object when this situation arises, even though they end
up
> pointing to the same type.
> 
> Does anybody know firsthand whether there is any internal mechanism to
> prevent this from happening (apparently not), and what can be done, at
> present, to prevent this duplication from occuring.
> 
> It has occurred to me that it might be a good idea to create a new
type
> of parser warning specifically regarding the issue of 'recursive
> includes'.  This of course only makes sense if there is a strong
> consensus that this is a classic anti-pattern of XML Schema
development
> and should be avoided at all costs.  I can see more or less how to
> implement it outside of Xerces by constructing a dependency graph of
the
> schema files and testing for back-edges.  So my question about this
side
> of things is whether there is any desire to make this test a built in
> part of the parser to make the parser smarter about these things?
> 
> Thanks for some feedback here.
> 
> Elisha Berns
> e.berns@computer.org
> tel. (310) 556 - 8332
> fax (310) 556 - 2839
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

RE: Xerces issues handling recursive schema includes

Posted by Elisha Berns <e....@computer.org>.

OK,

Well, ask a stupid question, get a good answer :)

Elisha

> -----Original Message-----
> From: Neil Graham [mailto:neilg@ca.ibm.com]
> Sent: Monday, October 31, 2005 12:54 PM
> To: c-dev@xerces.apache.org
> Subject: RE: Xerces issues handling recursive schema includes
> 
> Hi Elisha,
> 
> 
> 
> 
> "Elisha Berns" <e....@computer.org> wrote on 10/31/2005 02:52:49 PM:
> 
> > Hi Neil,
> >
> > Thanks for the reply.  Just one question about this.  The schema in
> > question includes tens of other schema files using relative URIs
(they
> > all exist in a large directory structure).  So by what you write
> >
> > "that system identifier fields are always set to the same value"
> >
> > presumably you mean that any SystemId will have the same exact URI
and
> > that the test for similarity is to first resolve the absolute path
name
> > for an InputSource?
> >
> > But how would you do this if a schema is retrieved over the web and
it
> > includes other schemas files with relative paths?
> 
> Well, if you can't figure out when you've encountered a document for
the
> second time, you can be assured the parser won't be able to figure it
out
> either.  :)
> 
> The means of using an EntityResolver to map documents to unique system
> URIs does assume that the application knows, or knows how to find out,
> enough about the graph of schemas to be able to tell when it's
encountered
> something before.  Since the application is more special-purpose than
the
> parser, this is usually an acceptable assumption.  Your milage may
vary,
> but there's not much the parser can do at its level to help you if
this
> gets you nowhere at all.
> 
> Cheers!
> Neil
> Neil Graham
> Manager, C++ Compiler Front-End and Runtime Development
> IBM Toronto Lab
> Phone:  905-413-3519, T/L 969-3519
> E-mail:  neilg@ca.ibm.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

RE: Xerces issues handling recursive schema includes

Posted by Neil Graham <ne...@ca.ibm.com>.

Hi Elisha,




"Elisha Berns" <e....@computer.org> wrote on 10/31/2005 02:52:49 PM:

> Hi Neil,
> 
> Thanks for the reply.  Just one question about this.  The schema in
> question includes tens of other schema files using relative URIs (they
> all exist in a large directory structure).  So by what you write
> 
> "that system identifier fields are always set to the same value"
> 
> presumably you mean that any SystemId will have the same exact URI and
> that the test for similarity is to first resolve the absolute path name
> for an InputSource?
> 
> But how would you do this if a schema is retrieved over the web and it
> includes other schemas files with relative paths?

Well, if you can't figure out when you've encountered a document for the 
second time, you can be assured the parser won't be able to figure it out 
either.  :) 

The means of using an EntityResolver to map documents to unique system 
URIs does assume that the application knows, or knows how to find out, 
enough about the graph of schemas to be able to tell when it's encountered 
something before.  Since the application is more special-purpose than the 
parser, this is usually an acceptable assumption.  Your milage may vary, 
but there's not much the parser can do at its level to help you if this 
gets you nowhere at all.

Cheers!
Neil
Neil Graham
Manager, C++ Compiler Front-End and Runtime Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  neilg@ca.ibm.com


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

RE: Xerces issues handling recursive schema includes

Posted by Elisha Berns <e....@computer.org>.

Hi Neil,

Thanks for the reply.  Just one question about this.  The schema in
question includes tens of other schema files using relative URIs (they
all exist in a large directory structure).  So by what you write

"that system identifier fields are always set to the same value"

presumably you mean that any SystemId will have the same exact URI and
that the test for similarity is to first resolve the absolute path name
for an InputSource?

But how would you do this if a schema is retrieved over the web and it
includes other schemas files with relative paths?

Thanks,

Elisha

> -----Original Message-----
> From: Neil Graham [mailto:neilg@ca.ibm.com]
> Sent: Monday, October 31, 2005 11:08 AM
> To: c-dev@xerces.apache.org
> Subject: Re: Xerces issues handling recursive schema includes
> 
> Hi Elisha,
> 
> Recursive, or circular, includes are supposed to be handled properly
by a
> schema parser.  While I'm not really active anymore on the code base,
this
> question does come up periodically, usually in the context of a set of
> schemas that get loaded purely via schemaLocation hints, or via a
user's
> EntityResolver which doesn't set system identifiers on the
InputSources it
> returns to the parser.  The usual way to get around this is to
register a
> custom EntityResolver instance, and take good care that system
identifier
> fields are always set to the same value when an InputSource is
returned.
> It's best if this is absolute, but I think a relative URI should work
too.
>  The reason this is important is that the parser uses system
identifiers
> internally to figure out whether it's processed a schema document
before.
> 
> Cheers,
> Neil
> Neil Graham
> Manager, C++ Compiler Front-End and Runtime Development
> IBM Toronto Lab
> Phone:  905-413-3519, T/L 969-3519
> E-mail:  neilg@ca.ibm.com
> 
> 
> 
> 
> 
> "Elisha Berns" <e....@computer.org>
> 10/30/2005 10:30 PM
> Please respond to
> c-dev
> 
> 
> To
> "Xerces C++ Development" <c-...@xerces.apache.org>
> cc
> 
> Subject
> Xerces issues handling recursive schema includes
> 
> 
> 
> 
> 
> 
> Hi,
> 
> I'm trying to determine both what Xerces does when it encounters
> recursive schema includes and what to do about it because it causes
some
> problems.
> 
> It appears that the XercesC schema parser creates multiple XSxxx type
> objects for the same type if the schema files are included
recursively.
> In addition it would appear that the load time for a schema is much,
> much slower in the presence of recursive includes.
> 
> I get one 'proper' globally defined type object but multiple
duplicates
> when the type appears as a contained type (in a complexType
definition).
> The only way I know this now is because I get different pointer values
> for the XSxxx object when this situation arises, even though they end
up
> pointing to the same type.
> 
> Does anybody know firsthand whether there is any internal mechanism to
> prevent this from happening (apparently not), and what can be done, at
> present, to prevent this duplication from occuring.
> 
> It has occurred to me that it might be a good idea to create a new
type
> of parser warning specifically regarding the issue of 'recursive
> includes'.  This of course only makes sense if there is a strong
> consensus that this is a classic anti-pattern of XML Schema
development
> and should be avoided at all costs.  I can see more or less how to
> implement it outside of Xerces by constructing a dependency graph of
the
> schema files and testing for back-edges.  So my question about this
side
> of things is whether there is any desire to make this test a built in
> part of the parser to make the parser smarter about these things?
> 
> Thanks for some feedback here.
> 
> Elisha Berns
> e.berns@computer.org
> tel. (310) 556 - 8332
> fax (310) 556 - 2839
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

Re: Xerces issues handling recursive schema includes

Posted by Neil Graham <ne...@ca.ibm.com>.

Hi Elisha,

Recursive, or circular, includes are supposed to be handled properly by a 
schema parser.  While I'm not really active anymore on the code base, this 
question does come up periodically, usually in the context of a set of 
schemas that get loaded purely via schemaLocation hints, or via a user's 
EntityResolver which doesn't set system identifiers on the InputSources it 
returns to the parser.  The usual way to get around this is to register a 
custom EntityResolver instance, and take good care that system identifier 
fields are always set to the same value when an InputSource is returned. 
It's best if this is absolute, but I think a relative URI should work too. 
 The reason this is important is that the parser uses system identifiers 
internally to figure out whether it's processed a schema document before.

Cheers,
Neil
Neil Graham
Manager, C++ Compiler Front-End and Runtime Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  neilg@ca.ibm.com





"Elisha Berns" <e....@computer.org> 
10/30/2005 10:30 PM
Please respond to
c-dev


To
"Xerces C++ Development" <c-...@xerces.apache.org>
cc

Subject
Xerces issues handling recursive schema includes






Hi,

I'm trying to determine both what Xerces does when it encounters
recursive schema includes and what to do about it because it causes some
problems.

It appears that the XercesC schema parser creates multiple XSxxx type
objects for the same type if the schema files are included recursively.
In addition it would appear that the load time for a schema is much,
much slower in the presence of recursive includes.

I get one 'proper' globally defined type object but multiple duplicates
when the type appears as a contained type (in a complexType definition).
The only way I know this now is because I get different pointer values
for the XSxxx object when this situation arises, even though they end up
pointing to the same type.

Does anybody know firsthand whether there is any internal mechanism to
prevent this from happening (apparently not), and what can be done, at
present, to prevent this duplication from occuring.

It has occurred to me that it might be a good idea to create a new type
of parser warning specifically regarding the issue of 'recursive
includes'.  This of course only makes sense if there is a strong
consensus that this is a classic anti-pattern of XML Schema development
and should be avoided at all costs.  I can see more or less how to
implement it outside of Xerces by constructing a dependency graph of the
schema files and testing for back-edges.  So my question about this side
of things is whether there is any desire to make this test a built in
part of the parser to make the parser smarter about these things?

Thanks for some feedback here.

Elisha Berns
e.berns@computer.org
tel. (310) 556 - 8332
fax (310) 556 - 2839




---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org