You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Wulf Berschin <be...@dosco.de> on 2009/09/16 08:48:52 UTC

Doctype declarations in fragments

Hi,

for ease of editing we have a doctype declaration in each (file) 
fragment. When I parse the full master (with resolving fragments) Xerces 
throws a fatal error (Doctype not allowed in content) and goes in an 
endless loop when setting this continue-after-fatal-error switch.

How can make Xerces to ignore doctype declarations ocurring in content 
(alt. in the header of file entities)?

Wulf


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

RE: Doctype declarations in fragments

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

FYI... you need to already be browsing resolved issues [1] for Xerces-J for
this link [2] to work.

[1]
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&pid=10520&status=5
[2]
https://issues.apache.org/jira/secure/IssueNavigator.jspa?sorter/field=updated&sorter/order=DESC

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Michael Glavassevich/Toronto/IBM@IBMCA wrote on 09/16/2009 03:38:18 PM:

> Gary Gregory <GG...@seagullsoftware.com> wrote on 09/16/2009 02:24:35
PM:
>
> > For the curious:
> >
> > I has been, almost to the day, *two* years since the release of 2.9.
> > 1, so the "if one of us modifies or removes..." is a big if :) as
> > far as changes showing up in a new release. Unless the longer it has
> > been, the more likely a new release is...
>
> More likely that we've made lots of changes / additions, including
> many modifications to the internals.
>
> > Are we to expect a release? Since 2.9.1 (9/14/2007):
>
> The developers have been talking about a December release, though we
> haven't voted on that. We've been busy implementing XML Schema 1.1
> among other things.
>
> > 19 issues have been marked "Resolved", of those, 13 are marked "Fixed":
> >
> > https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-
> > printable/temp/SearchRequest.html?
> > pid=10520&resolution=1&customfield_12310221%3Aafter=14%2FSep%
> > 2F07&sorter/field=issuekey&sorter/order=DESC&tempMax=1000
>
> There is something wrong with your query. >75 JIRA issues [1] have
> been resolved since Xerces 2.9.1.
>
> > There are 206 issues reported "Unscheduled".
> >
> > Luckily for us, we have not been bitten by any bugs, which is a
> > testament to Xerces. OTOH, it's been a real pain not having XPath 2.
> > 0 and XSLT 2.0 in Xalan.
> >
> > Gary Gregory
> > Senior Software Engineer
> > Seagull Software
> > ggregory@seagullsoftware.com
> > www.seagullsoftware.com
>
> Thanks.
>
> [1] https://issues.apache.org/jira/secure/IssueNavigator.jspa?
> sorter/field=updated&sorter/order=DESC
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org]

RE: Doctype declarations in fragments

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Gary Gregory <GG...@seagullsoftware.com> wrote on 09/16/2009 02:24:35
PM:

> For the curious:
>
> I has been, almost to the day, *two* years since the release of 2.9.
> 1, so the "if one of us modifies or removes..." is a big if :) as
> far as changes showing up in a new release. Unless the longer it has
> been, the more likely a new release is...

More likely that we've made lots of changes / additions, including many
modifications to the internals.

> Are we to expect a release? Since 2.9.1 (9/14/2007):

The developers have been talking about a December release, though we
haven't voted on that. We've been busy implementing XML Schema 1.1 among
other things.

> 19 issues have been marked "Resolved", of those, 13 are marked "Fixed":
>
> https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-
> printable/temp/SearchRequest.html?
> pid=10520&resolution=1&customfield_12310221%3Aafter=14%2FSep%
> 2F07&sorter/field=issuekey&sorter/order=DESC&tempMax=1000

There is something wrong with your query. >75 JIRA issues [1] have been
resolved since Xerces 2.9.1.

> There are 206 issues reported "Unscheduled".
>
> Luckily for us, we have not been bitten by any bugs, which is a
> testament to Xerces. OTOH, it's been a real pain not having XPath 2.
> 0 and XSLT 2.0 in Xalan.
>
> Gary Gregory
> Senior Software Engineer
> Seagull Software
> ggregory@seagullsoftware.com
> www.seagullsoftware.com

Thanks.

[1]
https://issues.apache.org/jira/secure/IssueNavigator.jspa?sorter/field=updated&sorter/order=DESC


Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

RE: Doctype declarations in fragments

Posted by Gary Gregory <GG...@seagullsoftware.com>.

For the curious:

I has been, almost to the day, *two* years since the release of 2.9.1, so the "if one of us modifies or removes..." is a big if :) as far as changes showing up in a new release. Unless the longer it has been, the more likely a new release is...

Are we to expect a release? Since 2.9.1 (9/14/2007):

19 issues have been marked "Resolved", of those, 13 are marked "Fixed":

https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-printable/temp/SearchRequest.html?pid=10520&resolution=1&customfield_12310221%3Aafter=14%2FSep%2F07&sorter/field=issuekey&sorter/order=DESC&tempMax=1000

There are 206 issues reported "Unscheduled".

Luckily for us, we have not been bitten by any bugs, which is a testament to Xerces. OTOH, it's been a real pain not having XPath 2.0 and XSLT 2.0 in Xalan.

Gary Gregory
Senior Software Engineer
Seagull Software
ggregory@seagullsoftware.com
www.seagullsoftware.com
________________________________
From: Michael Glavassevich [mrglavas@ca.ibm.com]
Sent: Wednesday, September 16, 2009 10:27 AM
To: j-users@xerces.apache.org
Subject: Re: Doctype declarations in fragments


Hi Wulf,

Haven't looked at the specifics of what you did but I'm never fond of any solution that involves extending or hooking into Xerces' internals. I'm not referring to XNI itself (which is a stable Xerces API) but rather the internal implementation classes you've chosen to use / extend. Your code could break at any time in the future if one of us modifies or removes any of those classes / methods.

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 12:16:11 PM:

> Hi Michael,
>
> thank you for your response. I'll try your approach tomorrow...
>
> Meanwhile I was playing a little bit with XNI and the Parser
> Configuration and finally got Xerces to ignore this, indeed misplaced,
> construct. What I did is:
>
> ---
>
> public class FragmentDoctypeSkipParser
>      extends AbstractSAXParser {
>
>      public FragmentDoctypeSkipParser() {
>          super(new FragmentDoctypeSkipConfiguration());
>      }
> }
>
> ---
>
> public class FragmentDoctypeSkipConfiguration extends
>      StandardParserConfiguration
> {
>    protected XMLDocumentScanner createDocumentScanner()
>    {
>      return new FragmentDoctypeSkipScannerImpl();
>    }
> }
>
> ---
>
> public class FragmentDoctypeSkipScannerImpl extends
>      org.apache.xerces.impl.XMLDocumentScannerImpl
> {
>
>    /** Creates a content dispatcher. */
>    protected Dispatcher createContentDispatcher()
>    {
>      return new FDSContentDispatcher();
>    }
>
>    protected class FDSContentDispatcher extends ContentDispatcher
>    {
>      protected boolean scanForDoctypeHook() throws IOException, XNIException
>      {
>        XMLString xString = new XMLString();
>        if (fEntityScanner.skipString("DOCTYPE")) {
>
>
>          // spaces
>          if (!fEntityScanner.skipSpaces()) {
>            reportFatalError(
>                "MSG_SPACE_REQUIRED_BEFORE_ROOT_ELEMENT_TYPE_IN_DOCTYPEDECL",
>                null);
>          }
>          // root element name
>          String doctypeName = fEntityScanner.scanName();
>
>          String publicId = null;
>
>          // external id
>          if (fEntityScanner.skipSpaces()) {
>            // scanExternalID(dtIds, false);
>
>            if (fEntityScanner.skipString("PUBLIC")) {
>              if (!fEntityScanner.skipSpaces()) {
>                reportFatalError("SpaceRequiredAfterPUBLIC", null);
>              }
>              scanPubidLiteral(xString);
>              publicId = xString.toString();
>
>            }
>
>            if (publicId != null || fEntityScanner.skipString("SYSTEM")) {
>              if (publicId == null && !fEntityScanner.skipSpaces()) {
>                reportFatalError("SpaceRequiredAfterSYSTEM", null);
>              }
>              fEntityScanner.skipSpaces();
>              int quote = fEntityScanner.peekChar();
>              if (quote != '\'' && quote != '"') {
>                reportFatalError("QuoteRequiredInSystemID", null);
>              }
>              fEntityScanner.scanChar();
>              if (fEntityScanner.scanLiteral(quote, xString) != quote) {
>                XMLStringBuffer xsb = new XMLStringBuffer();
>                xsb.clear();
>                do {
>                  xsb.append(xString);
>                  int c = fEntityScanner.peekChar();
>                  if (XMLChar.isMarkup(c) || c == ']') {
>                    xsb.append((char) fEntityScanner.scanChar());
>                  }
>                } while (fEntityScanner.scanLiteral(quote, xString) !=
> quote);
>                xsb.append(xString);
>                xString = xsb;
>              }
>              if (!fEntityScanner.skipChar(quote)) {
>                reportFatalError("SystemIDUnterminated", null);
>              }
>            }
>            fEntityScanner.skipSpaces();
>          }
>
>          if (fEntityScanner.skipChar('[')) {
>            // has internal subset
>            while ((char) fEntityScanner.scanChar() != ']') {
>            }
>
>          }
>          fEntityScanner.skipSpaces();
>          if (!fEntityScanner.skipChar('>')) {
>            reportFatalError("DoctypedeclUnterminated",
>                new Object[] { doctypeName });
>          }
>          fEntityScanner.skipSpaces();
>          setScannerState(SCANNER_STATE_CONTENT);
>
>          // undo SCANNER_STATE_START_OF_MARKUP: {    fMarkupDepth++;
>
>          fMarkupDepth--;
>          return true;
>        }
>        return false;
>
>      }
>    }
> }
>
> I tested this code and it works as wanted.
>
> How do you think about this approach? Is the scanForDoctypeHook()
> correctly implemented
>
> It seems to me that it might be easier to use this parser later for our
> XSL conversions too.
>
> Greetings from Heidelberg
>
> Wulf
>
>
>
> Michael Glavassevich schrieb:
> > Hi Wulf,
> >
> > Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 02:48:52 AM:
> >
> >  > Hi,
> >  >
> >  > for ease of editing we have a doctype declaration in each (file)
> >  > fragment. When I parse the full master (with resolving fragments) Xerces
> >  > throws a fatal error (Doctype not allowed in content) and goes in an
> >  > endless loop when setting this continue-after-fatal-error switch.
> >  >
> >  > How can make Xerces to ignore doctype declarations ocurring in content
> >  > (alt. in the header of file entities)?
> >
> > You can't. Xerces (or any conformant XML parser for that matter) will
> > not ignore or skip over any malformed / misplaced constructs in the
> > document. Parsers are required to report the fatal error. The
> > "continue-after-fatal-error" feature which allows Xerces to keep going
> > is unreliable and can lead to a catastrophic failure (e.g. NPE, infinite
> > loop, stack overflow, out of memory, etc...) if you turn it on. It's to
> > be used with extreme caution and should never be enabled in a finished
> > component / product.
> >
> > You either need to remove these DOCTYPEs from the files or filter them
> > out at a lower level (e.g. a wrapper InputStream which doesn't return
> > the DOCTYPE from read()).
> >
> >  > Wulf
> >  >
> >  > ---------------------------------------------------------------------
> >  > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> >  > For additional commands, e-mail: j-users-help@xerces.apache.org
> >
> > Thanks.
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Doctype declarations in fragments

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Wulf,

Haven't looked at the specifics of what you did but I'm never fond of any
solution that involves extending or hooking into Xerces' internals. I'm not
referring to XNI itself (which is a stable Xerces API) but rather the
internal implementation classes you've chosen to use / extend. Your code
could break at any time in the future if one of us modifies or removes any
of those classes / methods.

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 12:16:11 PM:

> Hi Michael,
>
> thank you for your response. I'll try your approach tomorrow...
>
> Meanwhile I was playing a little bit with XNI and the Parser
> Configuration and finally got Xerces to ignore this, indeed misplaced,
> construct. What I did is:
>
> ---
>
> public class FragmentDoctypeSkipParser
>      extends AbstractSAXParser {
>
>      public FragmentDoctypeSkipParser() {
>          super(new FragmentDoctypeSkipConfiguration());
>      }
> }
>
> ---
>
> public class FragmentDoctypeSkipConfiguration extends
>      StandardParserConfiguration
> {
>    protected XMLDocumentScanner createDocumentScanner()
>    {
>      return new FragmentDoctypeSkipScannerImpl();
>    }
> }
>
> ---
>
> public class FragmentDoctypeSkipScannerImpl extends
>      org.apache.xerces.impl.XMLDocumentScannerImpl
> {
>
>    /** Creates a content dispatcher. */
>    protected Dispatcher createContentDispatcher()
>    {
>      return new FDSContentDispatcher();
>    }
>
>    protected class FDSContentDispatcher extends ContentDispatcher
>    {
>      protected boolean scanForDoctypeHook() throws IOException,
XNIException
>      {
>        XMLString xString = new XMLString();
>        if (fEntityScanner.skipString("DOCTYPE")) {
>
>
>          // spaces
>          if (!fEntityScanner.skipSpaces()) {
>            reportFatalError(
>
"MSG_SPACE_REQUIRED_BEFORE_ROOT_ELEMENT_TYPE_IN_DOCTYPEDECL",
>                null);
>          }
>          // root element name
>          String doctypeName = fEntityScanner.scanName();
>
>          String publicId = null;
>
>          // external id
>          if (fEntityScanner.skipSpaces()) {
>            // scanExternalID(dtIds, false);
>
>            if (fEntityScanner.skipString("PUBLIC")) {
>              if (!fEntityScanner.skipSpaces()) {
>                reportFatalError("SpaceRequiredAfterPUBLIC", null);
>              }
>              scanPubidLiteral(xString);
>              publicId = xString.toString();
>
>            }
>
>            if (publicId != null || fEntityScanner.skipString("SYSTEM")) {
>              if (publicId == null && !fEntityScanner.skipSpaces()) {
>                reportFatalError("SpaceRequiredAfterSYSTEM", null);
>              }
>              fEntityScanner.skipSpaces();
>              int quote = fEntityScanner.peekChar();
>              if (quote != '\'' && quote != '"') {
>                reportFatalError("QuoteRequiredInSystemID", null);
>              }
>              fEntityScanner.scanChar();
>              if (fEntityScanner.scanLiteral(quote, xString) != quote) {
>                XMLStringBuffer xsb = new XMLStringBuffer();
>                xsb.clear();
>                do {
>                  xsb.append(xString);
>                  int c = fEntityScanner.peekChar();
>                  if (XMLChar.isMarkup(c) || c == ']') {
>                    xsb.append((char) fEntityScanner.scanChar());
>                  }
>                } while (fEntityScanner.scanLiteral(quote, xString) !=
> quote);
>                xsb.append(xString);
>                xString = xsb;
>              }
>              if (!fEntityScanner.skipChar(quote)) {
>                reportFatalError("SystemIDUnterminated", null);
>              }
>            }
>            fEntityScanner.skipSpaces();
>          }
>
>          if (fEntityScanner.skipChar('[')) {
>            // has internal subset
>            while ((char) fEntityScanner.scanChar() != ']') {
>            }
>
>          }
>          fEntityScanner.skipSpaces();
>          if (!fEntityScanner.skipChar('>')) {
>            reportFatalError("DoctypedeclUnterminated",
>                new Object[] { doctypeName });
>          }
>          fEntityScanner.skipSpaces();
>          setScannerState(SCANNER_STATE_CONTENT);
>
>          // undo SCANNER_STATE_START_OF_MARKUP: {    fMarkupDepth++;
>
>          fMarkupDepth--;
>          return true;
>        }
>        return false;
>
>      }
>    }
> }
>
> I tested this code and it works as wanted.
>
> How do you think about this approach? Is the scanForDoctypeHook()
> correctly implemented
>
> It seems to me that it might be easier to use this parser later for our
> XSL conversions too.
>
> Greetings from Heidelberg
>
> Wulf
>
>
>
> Michael Glavassevich schrieb:
> > Hi Wulf,
> >
> > Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 02:48:52 AM:
> >
> >  > Hi,
> >  >
> >  > for ease of editing we have a doctype declaration in each (file)
> >  > fragment. When I parse the full master (with resolving fragments)
Xerces
> >  > throws a fatal error (Doctype not allowed in content) and goes in an
> >  > endless loop when setting this continue-after-fatal-error switch.
> >  >
> >  > How can make Xerces to ignore doctype declarations ocurring in
content
> >  > (alt. in the header of file entities)?
> >
> > You can't. Xerces (or any conformant XML parser for that matter) will
> > not ignore or skip over any malformed / misplaced constructs in the
> > document. Parsers are required to report the fatal error. The
> > "continue-after-fatal-error" feature which allows Xerces to keep going
> > is unreliable and can lead to a catastrophic failure (e.g. NPE,
infinite
> > loop, stack overflow, out of memory, etc...) if you turn it on. It's to

> > be used with extreme caution and should never be enabled in a finished
> > component / product.
> >
> > You either need to remove these DOCTYPEs from the files or filter them
> > out at a lower level (e.g. a wrapper InputStream which doesn't return
> > the DOCTYPE from read()).
> >
> >  > Wulf
> >  >
> >  >
---------------------------------------------------------------------
> >  > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> >  > For additional commands, e-mail: j-users-help@xerces.apache.org
> >
> > Thanks.
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Doctype declarations in fragments

Posted by Wulf Berschin <be...@dosco.de>.

Hi Michael,

thank you for your response. I'll try your approach tomorrow...

Meanwhile I was playing a little bit with XNI and the Parser 
Configuration and finally got Xerces to ignore this, indeed misplaced, 
construct. What I did is:

---

public class FragmentDoctypeSkipParser
     extends AbstractSAXParser {

     public FragmentDoctypeSkipParser() {
         super(new FragmentDoctypeSkipConfiguration());
     }
}

---

public class FragmentDoctypeSkipConfiguration extends
     StandardParserConfiguration
{
   protected XMLDocumentScanner createDocumentScanner()
   {
     return new FragmentDoctypeSkipScannerImpl();
   }
}

---

public class FragmentDoctypeSkipScannerImpl extends
     org.apache.xerces.impl.XMLDocumentScannerImpl
{

   /** Creates a content dispatcher. */
   protected Dispatcher createContentDispatcher()
   {
     return new FDSContentDispatcher();
   }

   protected class FDSContentDispatcher extends ContentDispatcher
   {
     protected boolean scanForDoctypeHook() throws IOException, XNIException
     {
       XMLString xString = new XMLString();
       if (fEntityScanner.skipString("DOCTYPE")) {


         // spaces
         if (!fEntityScanner.skipSpaces()) {
           reportFatalError(
               "MSG_SPACE_REQUIRED_BEFORE_ROOT_ELEMENT_TYPE_IN_DOCTYPEDECL",
               null);
         }
         // root element name
         String doctypeName = fEntityScanner.scanName();

         String publicId = null;

         // external id
         if (fEntityScanner.skipSpaces()) {
           // scanExternalID(dtIds, false);

           if (fEntityScanner.skipString("PUBLIC")) {
             if (!fEntityScanner.skipSpaces()) {
               reportFatalError("SpaceRequiredAfterPUBLIC", null);
             }
             scanPubidLiteral(xString);
             publicId = xString.toString();

           }

           if (publicId != null || fEntityScanner.skipString("SYSTEM")) {
             if (publicId == null && !fEntityScanner.skipSpaces()) {
               reportFatalError("SpaceRequiredAfterSYSTEM", null);
             }
             fEntityScanner.skipSpaces();
             int quote = fEntityScanner.peekChar();
             if (quote != '\'' && quote != '"') {
               reportFatalError("QuoteRequiredInSystemID", null);
             }
             fEntityScanner.scanChar();
             if (fEntityScanner.scanLiteral(quote, xString) != quote) {
               XMLStringBuffer xsb = new XMLStringBuffer();
               xsb.clear();
               do {
                 xsb.append(xString);
                 int c = fEntityScanner.peekChar();
                 if (XMLChar.isMarkup(c) || c == ']') {
                   xsb.append((char) fEntityScanner.scanChar());
                 }
               } while (fEntityScanner.scanLiteral(quote, xString) != 
quote);
               xsb.append(xString);
               xString = xsb;
             }
             if (!fEntityScanner.skipChar(quote)) {
               reportFatalError("SystemIDUnterminated", null);
             }
           }
           fEntityScanner.skipSpaces();
         }

         if (fEntityScanner.skipChar('[')) {
           // has internal subset
           while ((char) fEntityScanner.scanChar() != ']') {
           }

         }
         fEntityScanner.skipSpaces();
         if (!fEntityScanner.skipChar('>')) {
           reportFatalError("DoctypedeclUnterminated",
               new Object[] { doctypeName });
         }
         fEntityScanner.skipSpaces();
         setScannerState(SCANNER_STATE_CONTENT);

         // undo SCANNER_STATE_START_OF_MARKUP: {    fMarkupDepth++;

         fMarkupDepth--;
         return true;
       }
       return false;

     }
   }
}

I tested this code and it works as wanted.

How do you think about this approach? Is the scanForDoctypeHook() 
correctly implemented

It seems to me that it might be easier to use this parser later for our 
XSL conversions too.

Greetings from Heidelberg

Wulf



Michael Glavassevich schrieb:
> Hi Wulf,
> 
> Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 02:48:52 AM:
> 
>  > Hi,
>  >
>  > for ease of editing we have a doctype declaration in each (file)
>  > fragment. When I parse the full master (with resolving fragments) Xerces
>  > throws a fatal error (Doctype not allowed in content) and goes in an
>  > endless loop when setting this continue-after-fatal-error switch.
>  >
>  > How can make Xerces to ignore doctype declarations ocurring in content
>  > (alt. in the header of file entities)?
> 
> You can't. Xerces (or any conformant XML parser for that matter) will 
> not ignore or skip over any malformed / misplaced constructs in the 
> document. Parsers are required to report the fatal error. The 
> "continue-after-fatal-error" feature which allows Xerces to keep going 
> is unreliable and can lead to a catastrophic failure (e.g. NPE, infinite 
> loop, stack overflow, out of memory, etc...) if you turn it on. It's to 
> be used with extreme caution and should never be enabled in a finished 
> component / product.
> 
> You either need to remove these DOCTYPEs from the files or filter them 
> out at a lower level (e.g. a wrapper InputStream which doesn't return 
> the DOCTYPE from read()).
> 
>  > Wulf
>  >
>  > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>  > For additional commands, e-mail: j-users-help@xerces.apache.org
> 
> Thanks.
> 
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Doctype declarations in fragments

Posted by Wulf Berschin <be...@dosco.de>.

Hi Michael,

I think aour scenario is not uncommon: a modular DTD, documents with 
several hundred pages (XML size 2MB), possibly many authors working on 
the same document which is stored / can be retrieved as fragments either 
in the DMS (where we could solve this problem on file base) or in the 
file system. This is where the problem arises: We have a master document 
which pulls its fragments via file entities.

Since the document is so big the author usually works on fragments 
(thats why the fragments have a doctype too). Only for reference 
checking he needs to open the master. BTW: Our XML editor, Epic, has 
always broadly ignored the fragment doctypes. (Optionally it uses, 
propritary comments instead of real doctypes for fragments)

We will use the mentioned XNI Configuration with a customized 
scanForDoctypeHook. Please consider taking this Configuration over into 
Xerces.

Greetings

Wulf

Michael Glavassevich schrieb:
> Hi Wulf,
> 
> I didn't say it would it be easy, just that you're on shaky ground if 
> your solution involves hooking into or extending Xerces' internals.
> 
> There might be other ways to deal with this, for example using XInclude 
> instead of entity references and/or removing the DOCTYPEs from the files 
> and programmatically inserting them when appropriate through 
> EntityResolver2.getExternalSubset() [1], though I don't know much about 
> your scenario and how much flexibility you have with changing the data.
> 
> Thanks.
> 
> [1] 
> http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ext/EntityResolver2.html
> 
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
> 
> Wulf Berschin <be...@dosco.de> wrote on 10/01/2009 07:26:41 AM:
> 
>  > Hi Michael,
>  >
>  > I'm just trying to follow your approach and creating a InputStream
>  > wrapper class which removes the doctype declaration. Afaik that means:
>  >
>  > - write an own EntityResolver which resolves file entities
>  >
>  > - write a doctype removing reader which creates a FileInputStream for
>  > the resolved file entity. Analyzes the header (BOM, XML-PI) for setting
>  > up a InuptStreamReader with the correct encoding. Then skip an
>  > eventually following doctype declaration.
>  >
>  > Is that correct? For the sake of not using XNI as described in my last
>  > mail I would have to duplicate parser functionality, hmmm.
>  >
>  > Or is there an more minimal invasive way to hook into the parser?
>  >
>  > Thank you for your help!
>  >
>  > Wulf
>  >
>  >
>  >
>  > Michael Glavassevich schrieb:
>  > > Hi Wulf,
>  > >
>  > > Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 02:48:52 AM:
>  > >
>  > >  > Hi,
>  > >  >
>  > >  > for ease of editing we have a doctype declaration in each (file)
>  > >  > fragment. When I parse the full master (with resolving 
> fragments) Xerces
>  > >  > throws a fatal error (Doctype not allowed in content) and goes in an
>  > >  > endless loop when setting this continue-after-fatal-error switch.
>  > >  >
>  > >  > How can make Xerces to ignore doctype declarations ocurring in 
> content
>  > >  > (alt. in the header of file entities)?
>  > >
>  > > You can't. Xerces (or any conformant XML parser for that matter) will
>  > > not ignore or skip over any malformed / misplaced constructs in the
>  > > document. Parsers are required to report the fatal error. The
>  > > "continue-after-fatal-error" feature which allows Xerces to keep going
>  > > is unreliable and can lead to a catastrophic failure (e.g. NPE, 
> infinite
>  > > loop, stack overflow, out of memory, etc...) if you turn it on. 
> It's to
>  > > be used with extreme caution and should never be enabled in a finished
>  > > component / product.
>  > >
>  > > You either need to remove these DOCTYPEs from the files or filter them
>  > > out at a lower level (e.g. a wrapper InputStream which doesn't return
>  > > the DOCTYPE from read()).
>  > >
>  > >  > Wulf
>  > >  >
>  > >  > 
> ---------------------------------------------------------------------
>  > >  > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>  > >  > For additional commands, e-mail: j-users-help@xerces.apache.org
>  > >
>  > > Thanks.
>  > >
>  > > Michael Glavassevich
>  > > XML Parser Development
>  > > IBM Toronto Lab
>  > > E-mail: mrglavas@ca.ibm.com
>  > > E-mail: mrglavas@apache.org
>  >
>  > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>  > For additional commands, e-mail: j-users-help@xerces.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Doctype declarations in fragments

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Wulf,

I didn't say it would it be easy, just that you're on shaky ground if your
solution involves hooking into or extending Xerces' internals.

There might be other ways to deal with this, for example using XInclude
instead of entity references and/or removing the DOCTYPEs from the files
and programmatically inserting them when appropriate through
EntityResolver2.getExternalSubset() [1], though I don't know much about
your scenario and how much flexibility you have with changing the data.

Thanks.

[1]
http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ext/EntityResolver2.html

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Wulf Berschin <be...@dosco.de> wrote on 10/01/2009 07:26:41 AM:

> Hi Michael,
>
> I'm just trying to follow your approach and creating a InputStream
> wrapper class which removes the doctype declaration. Afaik that means:
>
> - write an own EntityResolver which resolves file entities
>
> - write a doctype removing reader which creates a FileInputStream for
> the resolved file entity. Analyzes the header (BOM, XML-PI) for setting
> up a InuptStreamReader with the correct encoding. Then skip an
> eventually following doctype declaration.
>
> Is that correct? For the sake of not using XNI as described in my last
> mail I would have to duplicate parser functionality, hmmm.
>
> Or is there an more minimal invasive way to hook into the parser?
>
> Thank you for your help!
>
> Wulf
>
>
>
> Michael Glavassevich schrieb:
> > Hi Wulf,
> >
> > Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 02:48:52 AM:
> >
> >  > Hi,
> >  >
> >  > for ease of editing we have a doctype declaration in each (file)
> >  > fragment. When I parse the full master (with resolving fragments)
Xerces
> >  > throws a fatal error (Doctype not allowed in content) and goes in an
> >  > endless loop when setting this continue-after-fatal-error switch.
> >  >
> >  > How can make Xerces to ignore doctype declarations ocurring in
content
> >  > (alt. in the header of file entities)?
> >
> > You can't. Xerces (or any conformant XML parser for that matter) will
> > not ignore or skip over any malformed / misplaced constructs in the
> > document. Parsers are required to report the fatal error. The
> > "continue-after-fatal-error" feature which allows Xerces to keep going
> > is unreliable and can lead to a catastrophic failure (e.g. NPE,
infinite
> > loop, stack overflow, out of memory, etc...) if you turn it on. It's to

> > be used with extreme caution and should never be enabled in a finished
> > component / product.
> >
> > You either need to remove these DOCTYPEs from the files or filter them
> > out at a lower level (e.g. a wrapper InputStream which doesn't return
> > the DOCTYPE from read()).
> >
> >  > Wulf
> >  >
> >  >
---------------------------------------------------------------------
> >  > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> >  > For additional commands, e-mail: j-users-help@xerces.apache.org
> >
> > Thanks.
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Doctype declarations in fragments

Posted by Wulf Berschin <be...@dosco.de>.

Hi Michael,

I'm just trying to follow your approach and creating a InputStream 
wrapper class which removes the doctype declaration. Afaik that means:

- write an own EntityResolver which resolves file entities

- write a doctype removing reader which creates a FileInputStream for 
the resolved file entity. Analyzes the header (BOM, XML-PI) for setting 
up a InuptStreamReader with the correct encoding. Then skip an 
eventually following doctype declaration.

Is that correct? For the sake of not using XNI as described in my last 
mail I would have to duplicate parser functionality, hmmm.

Or is there an more minimal invasive way to hook into the parser?

Thank you for your help!

Wulf



Michael Glavassevich schrieb:
> Hi Wulf,
> 
> Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 02:48:52 AM:
> 
>  > Hi,
>  >
>  > for ease of editing we have a doctype declaration in each (file)
>  > fragment. When I parse the full master (with resolving fragments) Xerces
>  > throws a fatal error (Doctype not allowed in content) and goes in an
>  > endless loop when setting this continue-after-fatal-error switch.
>  >
>  > How can make Xerces to ignore doctype declarations ocurring in content
>  > (alt. in the header of file entities)?
> 
> You can't. Xerces (or any conformant XML parser for that matter) will 
> not ignore or skip over any malformed / misplaced constructs in the 
> document. Parsers are required to report the fatal error. The 
> "continue-after-fatal-error" feature which allows Xerces to keep going 
> is unreliable and can lead to a catastrophic failure (e.g. NPE, infinite 
> loop, stack overflow, out of memory, etc...) if you turn it on. It's to 
> be used with extreme caution and should never be enabled in a finished 
> component / product.
> 
> You either need to remove these DOCTYPEs from the files or filter them 
> out at a lower level (e.g. a wrapper InputStream which doesn't return 
> the DOCTYPE from read()).
> 
>  > Wulf
>  >
>  > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
>  > For additional commands, e-mail: j-users-help@xerces.apache.org
> 
> Thanks.
> 
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Doctype declarations in fragments

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Wulf,

Wulf Berschin <be...@dosco.de> wrote on 09/16/2009 02:48:52 AM:

> Hi,
>
> for ease of editing we have a doctype declaration in each (file)
> fragment. When I parse the full master (with resolving fragments) Xerces
> throws a fatal error (Doctype not allowed in content) and goes in an
> endless loop when setting this continue-after-fatal-error switch.
>
> How can make Xerces to ignore doctype declarations ocurring in content
> (alt. in the header of file entities)?

You can't. Xerces (or any conformant XML parser for that matter) will not
ignore or skip over any malformed / misplaced constructs in the document.
Parsers are required to report the fatal error. The
"continue-after-fatal-error" feature which allows Xerces to keep going is
unreliable and can lead to a catastrophic failure (e.g. NPE, infinite loop,
stack overflow, out of memory, etc...) if you turn it on. It's to be used
with extreme caution and should never be enabled in a finished component /
product.

You either need to remove these DOCTYPEs from the files or filter them out
at a lower level (e.g. a wrapper InputStream which doesn't return the
DOCTYPE from read()).

> Wulf
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org