You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Dan Bennett <da...@interati.co.uk> on 2000/07/06 20:00:04 UTC

RE: Using Xerces to parse HTML

Using Xerces to parse HTMLTo put it simply, HTML is defined using SGML. SGML
is not XML, for instance empty elements in SGML do not have a '/' before the
'>'. (such as the META tag in your sample) Thus, an XML parser cannot parse
HTML.

Have you considered using the parser that ships with JDK1.2 (e.g.
javax.swing.text.html.parser.DocumentParser)

Cheers

Dan Bennett
www.interati.co.uk

  -----Original Message-----
  From: lior@ecommony.com [mailto:lior@ecommony.com]
  Sent: 06 July 2000 19:41
  To: xerces-j-dev@xml.apache.org
  Subject: Using Xerces to parse HTML


  I'm trying to parse an html file using the xerces SAX parser, using
HTMLBuilder as the document handler, i'm getting wierd mistakes(see below).
has anybody done this ? can you tell me what im doing wrong ?

  =======================================================================
  HTML:
  <html>
  <head>
  <title>Liors document</title>
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  </head> ...

  Error message:
  org.xml.sax.SAXException: HTM008 State error: mismatch in closing tag name
title
  title
  ======================================================================

  thanks,
  Lior Shapira
  Software Engineer
  eCommony Inc.
  lior@ecommony.com
  +972 (52) 438414
  http://www.ecommony.com


RE: Using Xerces to parse HTML

Posted by Dan Bennett <da...@interati.co.uk>.
Using Xerces to parse HTMLEric,

In answer to your questions:

a) See http://forum.java.sun.com/forum?13@@.eec5e2c/0
b) 3.2 I believe.
c) I think so

This is a bit off topic for the Xerces group. If you need more help, try the
java developer connection at
http://developer.java.sun.com/developer/index.html

Regards

Dan Bennett
www.interati.co.uk

  -----Original Message-----
  From: Eric SCHAEFFER [mailto:eschaeffer@posterconseil.com]
  Sent: 07 July 2000 09:42
  To: xerces-j-dev@xml.apache.org
  Subject: Re: Using Xerces to parse HTML


  How do you use it (I don't understand the javadoc) ? What HTML version
does it support ? Can you build a DOM (or SAX) representation of the
document ?

  Thank's
  Eric.

    ----- Original Message -----
    From: Dan Bennett
    To: xerces-j-dev@xml.apache.org
    Sent: Thursday, July 06, 2000 8:00 PM
    Subject: RE: Using Xerces to parse HTML


    To put it simply, HTML is defined using SGML. SGML is not XML, for
instance empty elements in SGML do not have a '/' before the '>'. (such as
the META tag in your sample) Thus, an XML parser cannot parse HTML.

    Have you considered using the parser that ships with JDK1.2 (e.g.
javax.swing.text.html.parser.DocumentParser)

    Cheers

    Dan Bennett
    www.interati.co.uk

      -----Original Message-----
      From: lior@ecommony.com [mailto:lior@ecommony.com]
      Sent: 06 July 2000 19:41
      To: xerces-j-dev@xml.apache.org
      Subject: Using Xerces to parse HTML


      I'm trying to parse an html file using the xerces SAX parser, using
HTMLBuilder as the document handler, i'm getting wierd mistakes(see below).
has anybody done this ? can you tell me what im doing wrong ?


=======================================================================
      HTML:
      <html>
      <head>
      <title>Liors document</title>
      <meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
      </head> ...

      Error message:
      org.xml.sax.SAXException: HTM008 State error: mismatch in closing tag
name title
      title
      ======================================================================

      thanks,
      Lior Shapira
      Software Engineer
      eCommony Inc.
      lior@ecommony.com
      +972 (52) 438414
      http://www.ecommony.com


Re: Using Xerces to parse HTML

Posted by Eric SCHAEFFER <es...@posterconseil.com>.
Using Xerces to parse HTMLHow do you use it (I don't understand the javadoc) ? What HTML version does it support ? Can you build a DOM (or SAX) representation of the document ?

Thank's
Eric.

  ----- Original Message ----- 
  From: Dan Bennett 
  To: xerces-j-dev@xml.apache.org 
  Sent: Thursday, July 06, 2000 8:00 PM
  Subject: RE: Using Xerces to parse HTML


  To put it simply, HTML is defined using SGML. SGML is not XML, for instance empty elements in SGML do not have a '/' before the '>'. (such as the META tag in your sample) Thus, an XML parser cannot parse HTML. 
   
  Have you considered using the parser that ships with JDK1.2 (e.g. javax.swing.text.html.parser.DocumentParser)
   
  Cheers
   
  Dan Bennett
  www.interati.co.uk

    -----Original Message-----
    From: lior@ecommony.com [mailto:lior@ecommony.com]
    Sent: 06 July 2000 19:41
    To: xerces-j-dev@xml.apache.org
    Subject: Using Xerces to parse HTML


    I'm trying to parse an html file using the xerces SAX parser, using HTMLBuilder as the document handler, i'm getting wierd mistakes(see below). has anybody done this ? can you tell me what im doing wrong ?

    ======================================================================= 
    HTML: 
    <html> 
    <head> 
    <title>Liors document</title> 
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 
    </head> ... 

    Error message: 
    org.xml.sax.SAXException: HTM008 State error: mismatch in closing tag name title 
    title 
    ====================================================================== 

    thanks, 
    Lior Shapira 
    Software Engineer 
    eCommony Inc. 
    lior@ecommony.com 
    +972 (52) 438414 
    http://www.ecommony.com