You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@xerces.apache.org by Sandy Walsh <sw...@infocast-corp.com> on 2000/02/25 20:28:57 UTC

Trouble parsing HTML4.DTD

Please, need help. I'm sure this is a common request, but I'm obviously
missing something.

I'm looking for a simple way to parse HTML and running into two issues.
First, I can't find a DTD for HTML that Xerces likes, and second, even
without verification it won't parse the most basic HTML (attached
Readme.html)

I tacked the XML header stuff on the Readme.HTML file, but it doesn't
work either way.

Anyone attempt this before?

-Sandy

-- reproduction steps below ---

Using XERCES-C_1_1_0_D05-WIN32 under msdev v6 sp1, Windows 98

Typing the DOMCount.exe example (as distributed exe and recompiled
debug)

Get Assert failure in DOMCOUNT.EXE in DGBDEL.CPP line 47

_BLOCK_TYPE_IS_VALID(pHead->nBlockUse)

First of all, the parse seems to puke on the following expressions:
+ Any comments ... for example

<!ENTITY % ContentType "CDATA"
    -- media type, as per [RFC2045]
    -->

Seems to kill it on the -- media type ... -- portion, same with any of
the header comments in the file.

The loosehtml4.xml file is attached ... see if it happens for you.

The non-debug version of SaxCount does the same.