You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Dan Bennett <da...@interati.co.uk> on 2000/07/06 20:00:04 UTC
RE: Using Xerces to parse HTML
Using Xerces to parse HTMLTo put it simply, HTML is defined using SGML. SGML
is not XML, for instance empty elements in SGML do not have a '/' before the
'>'. (such as the META tag in your sample) Thus, an XML parser cannot parse
HTML.
Have you considered using the parser that ships with JDK1.2 (e.g.
javax.swing.text.html.parser.DocumentParser)
Cheers
Dan Bennett
www.interati.co.uk
-----Original Message-----
From: lior@ecommony.com [mailto:lior@ecommony.com]
Sent: 06 July 2000 19:41
To: xerces-j-dev@xml.apache.org
Subject: Using Xerces to parse HTML
I'm trying to parse an html file using the xerces SAX parser, using
HTMLBuilder as the document handler, i'm getting wierd mistakes(see below).
has anybody done this ? can you tell me what im doing wrong ?
=======================================================================
HTML:
<html>
<head>
<title>Liors document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head> ...
Error message:
org.xml.sax.SAXException: HTM008 State error: mismatch in closing tag name
title
title
======================================================================
thanks,
Lior Shapira
Software Engineer
eCommony Inc.
lior@ecommony.com
+972 (52) 438414
http://www.ecommony.com
RE: Using Xerces to parse HTML
Posted by Dan Bennett <da...@interati.co.uk>.
Using Xerces to parse HTMLEric,
In answer to your questions:
a) See http://forum.java.sun.com/forum?13@@.eec5e2c/0
b) 3.2 I believe.
c) I think so
This is a bit off topic for the Xerces group. If you need more help, try the
java developer connection at
http://developer.java.sun.com/developer/index.html
Regards
Dan Bennett
www.interati.co.uk
-----Original Message-----
From: Eric SCHAEFFER [mailto:eschaeffer@posterconseil.com]
Sent: 07 July 2000 09:42
To: xerces-j-dev@xml.apache.org
Subject: Re: Using Xerces to parse HTML
How do you use it (I don't understand the javadoc) ? What HTML version
does it support ? Can you build a DOM (or SAX) representation of the
document ?
Thank's
Eric.
----- Original Message -----
From: Dan Bennett
To: xerces-j-dev@xml.apache.org
Sent: Thursday, July 06, 2000 8:00 PM
Subject: RE: Using Xerces to parse HTML
To put it simply, HTML is defined using SGML. SGML is not XML, for
instance empty elements in SGML do not have a '/' before the '>'. (such as
the META tag in your sample) Thus, an XML parser cannot parse HTML.
Have you considered using the parser that ships with JDK1.2 (e.g.
javax.swing.text.html.parser.DocumentParser)
Cheers
Dan Bennett
www.interati.co.uk
-----Original Message-----
From: lior@ecommony.com [mailto:lior@ecommony.com]
Sent: 06 July 2000 19:41
To: xerces-j-dev@xml.apache.org
Subject: Using Xerces to parse HTML
I'm trying to parse an html file using the xerces SAX parser, using
HTMLBuilder as the document handler, i'm getting wierd mistakes(see below).
has anybody done this ? can you tell me what im doing wrong ?
=======================================================================
HTML:
<html>
<head>
<title>Liors document</title>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
</head> ...
Error message:
org.xml.sax.SAXException: HTM008 State error: mismatch in closing tag
name title
title
======================================================================
thanks,
Lior Shapira
Software Engineer
eCommony Inc.
lior@ecommony.com
+972 (52) 438414
http://www.ecommony.com
Re: Using Xerces to parse HTML
Posted by Eric SCHAEFFER <es...@posterconseil.com>.
Using Xerces to parse HTMLHow do you use it (I don't understand the javadoc) ? What HTML version does it support ? Can you build a DOM (or SAX) representation of the document ?
Thank's
Eric.
----- Original Message -----
From: Dan Bennett
To: xerces-j-dev@xml.apache.org
Sent: Thursday, July 06, 2000 8:00 PM
Subject: RE: Using Xerces to parse HTML
To put it simply, HTML is defined using SGML. SGML is not XML, for instance empty elements in SGML do not have a '/' before the '>'. (such as the META tag in your sample) Thus, an XML parser cannot parse HTML.
Have you considered using the parser that ships with JDK1.2 (e.g. javax.swing.text.html.parser.DocumentParser)
Cheers
Dan Bennett
www.interati.co.uk
-----Original Message-----
From: lior@ecommony.com [mailto:lior@ecommony.com]
Sent: 06 July 2000 19:41
To: xerces-j-dev@xml.apache.org
Subject: Using Xerces to parse HTML
I'm trying to parse an html file using the xerces SAX parser, using HTMLBuilder as the document handler, i'm getting wierd mistakes(see below). has anybody done this ? can you tell me what im doing wrong ?
=======================================================================
HTML:
<html>
<head>
<title>Liors document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head> ...
Error message:
org.xml.sax.SAXException: HTM008 State error: mismatch in closing tag name title
title
======================================================================
thanks,
Lior Shapira
Software Engineer
eCommony Inc.
lior@ecommony.com
+972 (52) 438414
http://www.ecommony.com