You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Wulf Berschin <be...@dosco.de> on 2009/10/01 13:26:41 UTC

Re: Doctype declarations in fragments

Hi Michael,

I'm just trying to follow your approach and creating a InputStream 
wrapper class which removes the doctype declaration. Afaik that means:

- write an own EntityResolver which resolves file entities

- write a doctype removing reader which creates a FileInputStream for 
the resolved file entity. Analyzes the header (BOM, XML-PI) for setting 
up a InuptStreamReader with the correct encoding. Then skip an 
eventually following doctype declaration.

Is that correct? For the sake of not using XNI as described in my last 
mail I would have to duplicate parser functionality, hmmm.

Or is there an more minimal invasive way to hook into the parser?

Thank you for your help!

Wulf



Michael Glavassevich schrieb:
> Hi Wulf,
> 
> Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 02:48:52 AM:
> 
>  > Hi,
>  >
>  > for ease of editing we have a doctype declaration in each (file)
>  > fragment. When I parse the full master (with resolving fragments) Xerces
>  > throws a fatal error (Doctype not allowed in content) and goes in an
>  > endless loop when setting this continue-after-fatal-error switch.
>  >
>  > How can make Xerces to ignore doctype declarations ocurring in content
>  > (alt. in the header of file entities)?
> 
> You can't. Xerces (or any conformant XML parser for that matter) will 
> not ignore or skip over any malformed / misplaced constructs in the 
> document. Parsers are required to report the fatal error. The 
> "continue-after-fatal-error" feature which allows Xerces to keep going 
> is unreliable and can lead to a catastrophic failure (e.g. NPE, infinite 
> loop, stack overflow, out of memory, etc...) if you turn it on. It's to 
> be used with extreme caution and should never be enabled in a finished 
> component / product.
> 
> You either need to remove these DOCTYPEs from the files or filter them 
> out at a lower level (e.g. a wrapper InputStream which doesn't return 
> the DOCTYPE from read()).
> 
>  > Wulf
>  >
>  > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>  > For additional commands, e-mail: j-users-help@xerces.apache.org
> 
> Thanks.
> 
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: Doctype declarations in fragments

Posted by Wulf Berschin <be...@dosco.de>.
Hi Michael,

I think aour scenario is not uncommon: a modular DTD, documents with 
several hundred pages (XML size 2MB), possibly many authors working on 
the same document which is stored / can be retrieved as fragments either 
in the DMS (where we could solve this problem on file base) or in the 
file system. This is where the problem arises: We have a master document 
which pulls its fragments via file entities.

Since the document is so big the author usually works on fragments 
(thats why the fragments have a doctype too). Only for reference 
checking he needs to open the master. BTW: Our XML editor, Epic, has 
always broadly ignored the fragment doctypes. (Optionally it uses, 
propritary comments instead of real doctypes for fragments)

We will use the mentioned XNI Configuration with a customized 
scanForDoctypeHook. Please consider taking this Configuration over into 
Xerces.

Greetings

Wulf

Michael Glavassevich schrieb:
> Hi Wulf,
> 
> I didn't say it would it be easy, just that you're on shaky ground if 
> your solution involves hooking into or extending Xerces' internals.
> 
> There might be other ways to deal with this, for example using XInclude 
> instead of entity references and/or removing the DOCTYPEs from the files 
> and programmatically inserting them when appropriate through 
> EntityResolver2.getExternalSubset() [1], though I don't know much about 
> your scenario and how much flexibility you have with changing the data.
> 
> Thanks.
> 
> [1] 
> http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ext/EntityResolver2.html
> 
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
> 
> Wulf Berschin <be...@dosco.de> wrote on 10/01/2009 07:26:41 AM:
> 
>  > Hi Michael,
>  >
>  > I'm just trying to follow your approach and creating a InputStream
>  > wrapper class which removes the doctype declaration. Afaik that means:
>  >
>  > - write an own EntityResolver which resolves file entities
>  >
>  > - write a doctype removing reader which creates a FileInputStream for
>  > the resolved file entity. Analyzes the header (BOM, XML-PI) for setting
>  > up a InuptStreamReader with the correct encoding. Then skip an
>  > eventually following doctype declaration.
>  >
>  > Is that correct? For the sake of not using XNI as described in my last
>  > mail I would have to duplicate parser functionality, hmmm.
>  >
>  > Or is there an more minimal invasive way to hook into the parser?
>  >
>  > Thank you for your help!
>  >
>  > Wulf
>  >
>  >
>  >
>  > Michael Glavassevich schrieb:
>  > > Hi Wulf,
>  > >
>  > > Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 02:48:52 AM:
>  > >
>  > >  > Hi,
>  > >  >
>  > >  > for ease of editing we have a doctype declaration in each (file)
>  > >  > fragment. When I parse the full master (with resolving 
> fragments) Xerces
>  > >  > throws a fatal error (Doctype not allowed in content) and goes in an
>  > >  > endless loop when setting this continue-after-fatal-error switch.
>  > >  >
>  > >  > How can make Xerces to ignore doctype declarations ocurring in 
> content
>  > >  > (alt. in the header of file entities)?
>  > >
>  > > You can't. Xerces (or any conformant XML parser for that matter) will
>  > > not ignore or skip over any malformed / misplaced constructs in the
>  > > document. Parsers are required to report the fatal error. The
>  > > "continue-after-fatal-error" feature which allows Xerces to keep going
>  > > is unreliable and can lead to a catastrophic failure (e.g. NPE, 
> infinite
>  > > loop, stack overflow, out of memory, etc...) if you turn it on. 
> It's to
>  > > be used with extreme caution and should never be enabled in a finished
>  > > component / product.
>  > >
>  > > You either need to remove these DOCTYPEs from the files or filter them
>  > > out at a lower level (e.g. a wrapper InputStream which doesn't return
>  > > the DOCTYPE from read()).
>  > >
>  > >  > Wulf
>  > >  >
>  > >  > 
> ---------------------------------------------------------------------
>  > >  > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>  > >  > For additional commands, e-mail: j-users-help@xerces.apache.org
>  > >
>  > > Thanks.
>  > >
>  > > Michael Glavassevich
>  > > XML Parser Development
>  > > IBM Toronto Lab
>  > > E-mail: mrglavas@ca.ibm.com
>  > > E-mail: mrglavas@apache.org
>  >
>  > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>  > For additional commands, e-mail: j-users-help@xerces.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: Doctype declarations in fragments

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Wulf,

I didn't say it would it be easy, just that you're on shaky ground if your
solution involves hooking into or extending Xerces' internals.

There might be other ways to deal with this, for example using XInclude
instead of entity references and/or removing the DOCTYPEs from the files
and programmatically inserting them when appropriate through
EntityResolver2.getExternalSubset() [1], though I don't know much about
your scenario and how much flexibility you have with changing the data.

Thanks.

[1]
http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ext/EntityResolver2.html

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Wulf Berschin <be...@dosco.de> wrote on 10/01/2009 07:26:41 AM:

> Hi Michael,
>
> I'm just trying to follow your approach and creating a InputStream
> wrapper class which removes the doctype declaration. Afaik that means:
>
> - write an own EntityResolver which resolves file entities
>
> - write a doctype removing reader which creates a FileInputStream for
> the resolved file entity. Analyzes the header (BOM, XML-PI) for setting
> up a InuptStreamReader with the correct encoding. Then skip an
> eventually following doctype declaration.
>
> Is that correct? For the sake of not using XNI as described in my last
> mail I would have to duplicate parser functionality, hmmm.
>
> Or is there an more minimal invasive way to hook into the parser?
>
> Thank you for your help!
>
> Wulf
>
>
>
> Michael Glavassevich schrieb:
> > Hi Wulf,
> >
> > Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 02:48:52 AM:
> >
> >  > Hi,
> >  >
> >  > for ease of editing we have a doctype declaration in each (file)
> >  > fragment. When I parse the full master (with resolving fragments)
Xerces
> >  > throws a fatal error (Doctype not allowed in content) and goes in an
> >  > endless loop when setting this continue-after-fatal-error switch.
> >  >
> >  > How can make Xerces to ignore doctype declarations ocurring in
content
> >  > (alt. in the header of file entities)?
> >
> > You can't. Xerces (or any conformant XML parser for that matter) will
> > not ignore or skip over any malformed / misplaced constructs in the
> > document. Parsers are required to report the fatal error. The
> > "continue-after-fatal-error" feature which allows Xerces to keep going
> > is unreliable and can lead to a catastrophic failure (e.g. NPE,
infinite
> > loop, stack overflow, out of memory, etc...) if you turn it on. It's to

> > be used with extreme caution and should never be enabled in a finished
> > component / product.
> >
> > You either need to remove these DOCTYPEs from the files or filter them
> > out at a lower level (e.g. a wrapper InputStream which doesn't return
> > the DOCTYPE from read()).
> >
> >  > Wulf
> >  >
> >  >
---------------------------------------------------------------------
> >  > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> >  > For additional commands, e-mail: j-users-help@xerces.apache.org
> >
> > Thanks.
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org