You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@xml.apache.org by Andy Clark <an...@apache.org> on 2002/02/09 04:15:36 UTC

[ANNOUNCE] Xerces HTML Parser

For a long time users have asked if Xerces can parse HTML files. 
But since most HTML documents are not well-formed XML documents, 
it is generally not possible to use a conforming XML parser to 
read HTML documents. 

However, the Xerces Native Interface (XNI) that is the foundation 
of the Xerces2 implementation defines a framework that allows 
different kinds of parsers to be constructed by connecting a
pipeline of parser components. Therefore, as long as a component 
can be written that generates the appropriate XNI "events", then
it can be used to emit SAX events, build DOM trees, or anything
else that you can think of.

So, as a fun little exercise, I have written a basic HTML parser 
using XNI. It consists of an HTML scanner component that can scan
HTML files and generate XNI events and a tag balancing component.
The tag balancer cleans up the events produced by the scanner,
balancing mismatched tags and adding tags where necessary. And
it does all of this in a streaming manner to minimize the amount
of memory required.

Since I wrote the HTML parser as an example of using XNI and
because the code is considered alpha quality (but it seems to
work quite well, actually!), I am posting the code with a very
limited license. Even though it contains the complete source
code for the HTML parser, the license only allows the user to
experiment but gives no right to actually use the code in a 
product.

If the source isn't "free" or "open", why release it at all?
I want to get an idea of what people think of the code first.
Then, if there's enough interest, I would like to either donate
the code to the Xerces-J project or make it available elsewhere
under a true open source license.

So, if you've been looking for a way to parse HTML documents
please try out the HTML parser and let me know what you think. 
There should be enough information in the documentation to get 
you started. Check out the "NekoHTML" project listed on my
Apache web site: http://www.apache.org/~andyc/

Have fun!

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: [ANNOUNCE] Xerces HTML Parser

Posted by Andy Clark <an...@apache.org>.

It was bugging me that the first version of the NekoHTML parser 
could only handle the character encoding "Cp1252" (which is the
basic Windows encoding), so I updated the code to be able to
automatically handle UTF-8 (w/ BOM) and UTF-16. In addition,
it can detect the presence of a <meta http-equiv='content-type'
content='text/html; charset=XXX'> tag and scan the remaining
document using charset "XXX", assuming that Java has an
appropriate decoder available.

You can download the latest code from the following URL:

  http://www.apache.org/~andyc/

I am very interested in hearing from people to see if the 
code is useful and if they think it should be a standard part 
of Xerces-J. 

Solving the problem of changing the character decoder in the
middle of the stream when the <meta> tag is detected was
rather interesting. If you want to know the technical 
details, read on...

The code isn't that complicated but it turned out to be not
as straightforward as I thought. First, the Java decoders have
a nasty habit of reading 8K of bytes despite only asking for 
as little as a single character! This is annoying, at best, 
because you can't change the decoder because the original
decoder has already consumed more bytes than it should.

Then, even if the Java decoders were written to only consume
as many bytes as needed to return the requested characters,
there's still a problem caused by buffering. Since I buffer
a block of characters to improve performance, this again
consumes bytes *past* the <meta> tag which will destroy any
chance of changing the decoder mid-stream.

So to solve this problem, I wrote a "playback" input stream
which buffers all of the bytes read on the underlying input
stream. If the scanner detects a <meta> tag that changes
the encoding, then the stream is played back again. And if
the <body> tag is found (or a tag whose parent should be
the <body> tag), then the buffer is cleared. So at worst,
just the beginnging of the document is buffered which isn't 
too bad.

You may notice that if the stream is played back, then the
parser will scan document contents that it has already 
seen. This was simple enough to fix, though. When the
character encoding is changed, I note how many elements I
have already seen. Then, when the stream is re-scanned, I
ignore the events until the number of elements is back to
where I was when I detected the <meta> tag.

So there's got to be an easier way to change the decoder
of the stream than to go through all of this trouble,
right? Not unless I want to re-write every known character
decoder. So I'm stuck with this kind of a solution. But
it seems to work very well.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: [ANNOUNCE] Xerces HTML Parser

Posted by Andy Clark <an...@apache.org>.

It was bugging me that the first version of the NekoHTML parser 
could only handle the character encoding "Cp1252" (which is the
basic Windows encoding), so I updated the code to be able to
automatically handle UTF-8 (w/ BOM) and UTF-16. In addition,
it can detect the presence of a <meta http-equiv='content-type'
content='text/html; charset=XXX'> tag and scan the remaining
document using charset "XXX", assuming that Java has an
appropriate decoder available.

You can download the latest code from the following URL:

  http://www.apache.org/~andyc/

I am very interested in hearing from people to see if the 
code is useful and if they think it should be a standard part 
of Xerces-J. 

Solving the problem of changing the character decoder in the
middle of the stream when the <meta> tag is detected was
rather interesting. If you want to know the technical 
details, read on...

The code isn't that complicated but it turned out to be not
as straightforward as I thought. First, the Java decoders have
a nasty habit of reading 8K of bytes despite only asking for 
as little as a single character! This is annoying, at best, 
because you can't change the decoder because the original
decoder has already consumed more bytes than it should.

Then, even if the Java decoders were written to only consume
as many bytes as needed to return the requested characters,
there's still a problem caused by buffering. Since I buffer
a block of characters to improve performance, this again
consumes bytes *past* the <meta> tag which will destroy any
chance of changing the decoder mid-stream.

So to solve this problem, I wrote a "playback" input stream
which buffers all of the bytes read on the underlying input
stream. If the scanner detects a <meta> tag that changes
the encoding, then the stream is played back again. And if
the <body> tag is found (or a tag whose parent should be
the <body> tag), then the buffer is cleared. So at worst,
just the beginnging of the document is buffered which isn't 
too bad.

You may notice that if the stream is played back, then the
parser will scan document contents that it has already 
seen. This was simple enough to fix, though. When the
character encoding is changed, I note how many elements I
have already seen. Then, when the stream is re-scanned, I
ignore the events until the number of elements is back to
where I was when I detected the <meta> tag.

So there's got to be an easier way to change the decoder
of the stream than to go through all of this trouble,
right? Not unless I want to re-write every known character
decoder. So I'm stuck with this kind of a solution. But
it seems to work very well.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: [ANNOUNCE] Xerces HTML Parser

Posted by Andy Clark <an...@apache.org>.

It was bugging me that the first version of the NekoHTML parser 
could only handle the character encoding "Cp1252" (which is the
basic Windows encoding), so I updated the code to be able to
automatically handle UTF-8 (w/ BOM) and UTF-16. In addition,
it can detect the presence of a <meta http-equiv='content-type'
content='text/html; charset=XXX'> tag and scan the remaining
document using charset "XXX", assuming that Java has an
appropriate decoder available.

You can download the latest code from the following URL:

  http://www.apache.org/~andyc/

I am very interested in hearing from people to see if the 
code is useful and if they think it should be a standard part 
of Xerces-J. 

Solving the problem of changing the character decoder in the
middle of the stream when the <meta> tag is detected was
rather interesting. If you want to know the technical 
details, read on...

The code isn't that complicated but it turned out to be not
as straightforward as I thought. First, the Java decoders have
a nasty habit of reading 8K of bytes despite only asking for 
as little as a single character! This is annoying, at best, 
because you can't change the decoder because the original
decoder has already consumed more bytes than it should.

Then, even if the Java decoders were written to only consume
as many bytes as needed to return the requested characters,
there's still a problem caused by buffering. Since I buffer
a block of characters to improve performance, this again
consumes bytes *past* the <meta> tag which will destroy any
chance of changing the decoder mid-stream.

So to solve this problem, I wrote a "playback" input stream
which buffers all of the bytes read on the underlying input
stream. If the scanner detects a <meta> tag that changes
the encoding, then the stream is played back again. And if
the <body> tag is found (or a tag whose parent should be
the <body> tag), then the buffer is cleared. So at worst,
just the beginnging of the document is buffered which isn't 
too bad.

You may notice that if the stream is played back, then the
parser will scan document contents that it has already 
seen. This was simple enough to fix, though. When the
character encoding is changed, I note how many elements I
have already seen. Then, when the stream is re-scanned, I
ignore the events until the number of elements is back to
where I was when I detected the <meta> tag.

So there's got to be an easier way to change the decoder
of the stream than to go through all of this trouble,
right? Not unless I want to re-write every known character
decoder. So I'm stuck with this kind of a solution. But
it seems to work very well.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [ANNOUNCE] Xerces HTML Parser

Posted by Andy Clark <an...@apache.org>.

I've posted a bug fix release for the latest version of the
NekoHTML parser. This release fixes the following bugs:

  * Attributes were being removed from all elements in the
    SAX parser because the HTML parser configuration didn't
    have a symbol table. (I still don't use a symbol table
    but adding a symbol table works around the problem.)
  * The NekoHTML parser couldn't be used with Xalan 2.3.0 as 
    a source for transformations. (Thanks to Fred Yankowski
    for pointing this out and helping me work through it.)

Both of these bugs have been cleaned up in the Xerces
codebase so the next release of Xerces will solve these
problems. You can download the latest code from:

  http://www.apache.org/~andyc/

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [ANNOUNCE] Xerces HTML Parser

Posted by Andy Clark <an...@apache.org>.

I've posted a bug fix release for the latest version of the
NekoHTML parser. This release fixes the following bugs:

  * Attributes were being removed from all elements in the
    SAX parser because the HTML parser configuration didn't
    have a symbol table. (I still don't use a symbol table
    but adding a symbol table works around the problem.)
  * The NekoHTML parser couldn't be used with Xalan 2.3.0 as 
    a source for transformations. (Thanks to Fred Yankowski
    for pointing this out and helping me work through it.)

Both of these bugs have been cleaned up in the Xerces
codebase so the next release of Xerces will solve these
problems. You can download the latest code from:

  http://www.apache.org/~andyc/

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

Re: [ANNOUNCE] Xerces HTML Parser

Posted by Andy Clark <an...@apache.org>.

I've posted a bug fix release for the latest version of the
NekoHTML parser. This release fixes the following bugs:

  * Attributes were being removed from all elements in the
    SAX parser because the HTML parser configuration didn't
    have a symbol table. (I still don't use a symbol table
    but adding a symbol table works around the problem.)
  * The NekoHTML parser couldn't be used with Xalan 2.3.0 as 
    a source for transformations. (Thanks to Fred Yankowski
    for pointing this out and helping me work through it.)

Both of these bugs have been cleaned up in the Xerces
codebase so the next release of Xerces will solve these
problems. You can download the latest code from:

  http://www.apache.org/~andyc/

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org