You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@xerces.apache.org by Sandy Walsh <sw...@infocast-corp.com> on 2000/02/25 20:28:57 UTC
Trouble parsing HTML4.DTD
Please, need help. I'm sure this is a common request, but I'm obviously
missing something.
I'm looking for a simple way to parse HTML and running into two issues.
First, I can't find a DTD for HTML that Xerces likes, and second, even
without verification it won't parse the most basic HTML (attached
Readme.html)
I tacked the XML header stuff on the Readme.HTML file, but it doesn't
work either way.
Anyone attempt this before?
-Sandy
-- reproduction steps below ---
Using XERCES-C_1_1_0_D05-WIN32 under msdev v6 sp1, Windows 98
Typing the DOMCount.exe example (as distributed exe and recompiled
debug)
Get Assert failure in DOMCOUNT.EXE in DGBDEL.CPP line 47
_BLOCK_TYPE_IS_VALID(pHead->nBlockUse)
First of all, the parse seems to puke on the following expressions:
+ Any comments ... for example
<!ENTITY % ContentType "CDATA"
-- media type, as per [RFC2045]
-->
Seems to kill it on the -- media type ... -- portion, same with any of
the header comments in the file.
The loosehtml4.xml file is attached ... see if it happens for you.
The non-debug version of SaxCount does the same.