You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Giles <mg...@visionstudio.com> on 2003/09/18 22:50:53 UTC
HTML Parsing problems...
I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
I also know that it is updated from time to time and performs much better
than the other ones that I have tested. Frustratingly, the very first page
I tried to parse failed
(<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister.co.uk/content/54/32593.html).
It seems to be choking on tags that are being written inside of JavaScript
code (i.e. document.write('</scr' + 'ipt>');. Obviously, the simple
solution (that I am using with another parser) is to just ignore everything
inside of <script> tags. It appears that the parser is ignoring text
inside script tags, but it seems like it needs to be a bit smarter (or
maybe dumber) about how it deals with this (so it doesn't get confused by
such occurrences). I see a bug has been filed regarding trouble parsing
JavaScript, has anyone given it thought?
Outside of the HTML parsing, all is well (and outside of a few pages, the
parser is a champ).
Thanks!
-Mike
Re: HTML Parsing problems...
Posted by Michael Giles <mg...@visionstudio.com>.
Tatu,
Thanks for the reply. See below for comments.
> > just ignore everything inside of <script> tags. It appears that the parser
> > is ignoring text inside script tags, but it seems like it needs to be a bit
> > smarter (or maybe dumber) about how it deals with this (so it doesn't get
>
>I would guess that often ignoring stuff in <script> (for indexing purposes)
>makes sense; exception being if someone wants to create HTML site creation
>IDE (like specifically wants to search for stuff in javascript sections?).
>Nonetheless HTML parser has to be able to handle these I think.
Fortunately, the sole purpose of the parser that ships with Lucene is
indexing HTML documents. As such, I see no reason to worry about
functionality for other use cases (i.e. IDE development). There are plenty
of other parsers out there that try to be complete. It would be great if
this one was optimized for the task at hand (and thus can ignore text
inside <script> tags).
> > confused by such occurrences). I see a bug has been filed regarding
> > trouble parsing JavaScript, has anyone given it thought?
>
>If anyone would be interested I could give the source code and/or (if I have
>time) to implement efficient fault-tolerant indexer.
>Like I said this also works equally well for well-formed XML, but that's
>nothing special.
I'd definitely be interested to see what you did. My application needs to
index "public" documents as users submit requests (eventually 1000's per
day), so I don't have control over the HTML (i.e. it needs to be fault
tolerant) and it needs to be efficient. Parsing a big page (i.e.
http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html)
is another good way to stress the basic parsers (some are frighteningly CPU
intensive).
Even though I think a solid HTML parser that is optimized for the task of
indexing is actually quite important to Lucene, we can take any further
discussions off-line as they are probably not deemed relevant to the Lucene
list.
-Mike
Re: HTML Parsing problems...
Posted by Michael Giles <mg...@visionstudio.com>.
Tatu,
Thanks for the reply. See below for comments.
> > just ignore everything inside of <script> tags. It appears that the parser
> > is ignoring text inside script tags, but it seems like it needs to be a bit
> > smarter (or maybe dumber) about how it deals with this (so it doesn't get
>
>I would guess that often ignoring stuff in <script> (for indexing purposes)
>makes sense; exception being if someone wants to create HTML site creation
>IDE (like specifically wants to search for stuff in javascript sections?).
>Nonetheless HTML parser has to be able to handle these I think.
Fortunately, the sole purpose of the parser that ships with Lucene is
indexing HTML documents. As such, I see no reason to worry about
functionality for other use cases (i.e. IDE development). There are plenty
of other parsers out there that try to be complete. It would be great if
this one was optimized for the task at hand (and thus can ignore text
inside <script> tags).
> > confused by such occurrences). I see a bug has been filed regarding
> > trouble parsing JavaScript, has anyone given it thought?
>
>If anyone would be interested I could give the source code and/or (if I have
>time) to implement efficient fault-tolerant indexer.
>Like I said this also works equally well for well-formed XML, but that's
>nothing special.
I'd definitely be interested to see what you did. My application needs to
index "public" documents as users submit requests (eventually 1000's per
day), so I don't have control over the HTML (i.e. it needs to be fault
tolerant) and it needs to be efficient. Parsing a big page (i.e.
http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html)
is another good way to stress the basic parsers (some are frighteningly CPU
intensive).
Even though I think a solid HTML parser that is optimized for the task of
indexing is actually quite important to Lucene, we can take any further
discussions off-line as they are probably not deemed relevant to the Lucene
list.
-Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: HTML Parsing problems...
Posted by Peter Becker <pb...@dstc.edu.au>.
Tatu Saloranta wrote:
>On Thursday 18 September 2003 14:50, Michael Giles wrote:
>
>
>>I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
>>I also know that it is updated from time to time and performs much better
>>than the other ones that I have tested. Frustratingly, the very first page
>>I tried to parse failed
>>(<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister
>>.co.uk/content/54/32593.html). It seems to be choking on tags that are being
>>written inside of JavaScript code (i.e. document.write('</scr' + 'ipt>');.
>>Obviously, the simple solution (that I am using with another parser) is to
>>just ignore everything inside of <script> tags. It appears that the parser
>>is ignoring text inside script tags, but it seems like it needs to be a bit
>>smarter (or maybe dumber) about how it deals with this (so it doesn't get
>>
>>
>
>I would guess that often ignoring stuff in <script> (for indexing purposes)
>makes sense; exception being if someone wants to create HTML site creation
>IDE (like specifically wants to search for stuff in javascript sections?).
>Nonetheless HTML parser has to be able to handle these I think.
>
>
>
>>confused by such occurrences). I see a bug has been filed regarding
>>trouble parsing JavaScript, has anyone given it thought?
>>
>>
>
>I implemented a rather robust (X[HT])ML parser ("QnD") that was able to work
>through many of such issues (<script> tag, unquoted single '&' and '<' chars,
>in attr values and elements, simplistic approach to optional end tags). Since
>it was dead-optimized for speed (anything fully in memory in a char array,
>optimizing based on that) I thought it might be useful for indexing (even
>more so than for its original purpose which was to be very fast utility for
>filtering [adding and/or removing stuff] of HTML pages).
>
>If anyone would be interested I could give the source code and/or (if I have
>time) to implement efficient fault-tolerant indexer.
>Like I said this also works equally well for well-formed XML, but that's
>nothing special.
>
We had reasonably good experiences with this simple bit of code, using
Swing's HTML parser:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/HtmlDocumentHandler.java?rev=1.4&content-type=text/vnd.viewcvs-markup
We haven't tested it much, but it does grok a local copy of the link given.
Here is our XML parsing code:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/XmlDocumentHandler.java?rev=1.6&content-type=text/vnd.viewcvs-markup
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/SaxTextContentParser.java?rev=1.3&content-type=text/vnd.viewcvs-markup
The XML bit is not too good yet. E.g. it chokes on large XML easily
since it reads all content into memory at once.
Some of the code will require JDK 1.4, though. The XML relies on JAXP, I
don't know about the HTMLEditorKit.
JTidy seems to be another option.
Peter
Re: HTML Parsing problems...
Posted by Peter Becker <pb...@dstc.edu.au>.
Tatu Saloranta wrote:
>On Thursday 18 September 2003 14:50, Michael Giles wrote:
>
>
>>I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
>>I also know that it is updated from time to time and performs much better
>>than the other ones that I have tested. Frustratingly, the very first page
>>I tried to parse failed
>>(<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister
>>.co.uk/content/54/32593.html). It seems to be choking on tags that are being
>>written inside of JavaScript code (i.e. document.write('</scr' + 'ipt>');.
>>Obviously, the simple solution (that I am using with another parser) is to
>>just ignore everything inside of <script> tags. It appears that the parser
>>is ignoring text inside script tags, but it seems like it needs to be a bit
>>smarter (or maybe dumber) about how it deals with this (so it doesn't get
>>
>>
>
>I would guess that often ignoring stuff in <script> (for indexing purposes)
>makes sense; exception being if someone wants to create HTML site creation
>IDE (like specifically wants to search for stuff in javascript sections?).
>Nonetheless HTML parser has to be able to handle these I think.
>
>
>
>>confused by such occurrences). I see a bug has been filed regarding
>>trouble parsing JavaScript, has anyone given it thought?
>>
>>
>
>I implemented a rather robust (X[HT])ML parser ("QnD") that was able to work
>through many of such issues (<script> tag, unquoted single '&' and '<' chars,
>in attr values and elements, simplistic approach to optional end tags). Since
>it was dead-optimized for speed (anything fully in memory in a char array,
>optimizing based on that) I thought it might be useful for indexing (even
>more so than for its original purpose which was to be very fast utility for
>filtering [adding and/or removing stuff] of HTML pages).
>
>If anyone would be interested I could give the source code and/or (if I have
>time) to implement efficient fault-tolerant indexer.
>Like I said this also works equally well for well-formed XML, but that's
>nothing special.
>
We had reasonably good experiences with this simple bit of code, using
Swing's HTML parser:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/HtmlDocumentHandler.java?rev=1.4&content-type=text/vnd.viewcvs-markup
We haven't tested it much, but it does grok a local copy of the link given.
Here is our XML parsing code:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/XmlDocumentHandler.java?rev=1.6&content-type=text/vnd.viewcvs-markup
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/SaxTextContentParser.java?rev=1.3&content-type=text/vnd.viewcvs-markup
The XML bit is not too good yet. E.g. it chokes on large XML easily
since it reads all content into memory at once.
Some of the code will require JDK 1.4, though. The XML relies on JAXP, I
don't know about the HTMLEditorKit.
JTidy seems to be another option.
Peter
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: HTML Parsing problems...
Posted by Tatu Saloranta <ta...@hypermall.net>.
On Thursday 18 September 2003 14:50, Michael Giles wrote:
> I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
> I also know that it is updated from time to time and performs much better
> than the other ones that I have tested. Frustratingly, the very first page
> I tried to parse failed
> (<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister
>.co.uk/content/54/32593.html). It seems to be choking on tags that are being
> written inside of JavaScript code (i.e. document.write('</scr' + 'ipt>');.
> Obviously, the simple solution (that I am using with another parser) is to
> just ignore everything inside of <script> tags. It appears that the parser
> is ignoring text inside script tags, but it seems like it needs to be a bit
> smarter (or maybe dumber) about how it deals with this (so it doesn't get
I would guess that often ignoring stuff in <script> (for indexing purposes)
makes sense; exception being if someone wants to create HTML site creation
IDE (like specifically wants to search for stuff in javascript sections?).
Nonetheless HTML parser has to be able to handle these I think.
> confused by such occurrences). I see a bug has been filed regarding
> trouble parsing JavaScript, has anyone given it thought?
I implemented a rather robust (X[HT])ML parser ("QnD") that was able to work
through many of such issues (<script> tag, unquoted single '&' and '<' chars,
in attr values and elements, simplistic approach to optional end tags). Since
it was dead-optimized for speed (anything fully in memory in a char array,
optimizing based on that) I thought it might be useful for indexing (even
more so than for its original purpose which was to be very fast utility for
filtering [adding and/or removing stuff] of HTML pages).
If anyone would be interested I could give the source code and/or (if I have
time) to implement efficient fault-tolerant indexer.
Like I said this also works equally well for well-formed XML, but that's
nothing special.
-+ Tatu +-
Re: HTML Parsing problems...
Posted by Michael Giles <mg...@visionstudio.com>.
Yeah, I was using HTMLParser for a few days until I tried to parse a 400K
document and it spun at 100% CPU for a very long time. It is tolerant of
bad HTML, but does not appear to scale. TagSoup processed the same
document in a second or less at <25% CPU.
-Mike
At 02:42 PM 9/22/2003 +0200, you wrote:
>TagSoup is great - however, it is not maintained nor developed (the same
>could be said about JTidy as well, but TagSoup's history is much
>shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net) for
>my application, and it also works very well, even for ill-formed input.
>It's also very actively developed.
>
>--
>Best regards,
>Andrzej Bialecki
>
>-------------------------------------------------
>Software Architect, System Integration Specialist
>CEN/ISSS EC Workshop, ECIMF project chair
>EU FP6 E-Commerce Expert/Evaluator
>-------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: HTML Parsing problems...
Posted by Michael Giles <mg...@visionstudio.com>.
Yeah, I was using HTMLParser for a few days until I tried to parse a 400K
document and it spun at 100% CPU for a very long time. It is tolerant of
bad HTML, but does not appear to scale. TagSoup processed the same
document in a second or less at <25% CPU.
-Mike
At 02:42 PM 9/22/2003 +0200, you wrote:
>TagSoup is great - however, it is not maintained nor developed (the same
>could be said about JTidy as well, but TagSoup's history is much
>shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net) for
>my application, and it also works very well, even for ill-formed input.
>It's also very actively developed.
>
>--
>Best regards,
>Andrzej Bialecki
>
>-------------------------------------------------
>Software Architect, System Integration Specialist
>CEN/ISSS EC Workshop, ECIMF project chair
>EU FP6 E-Commerce Expert/Evaluator
>-------------------------------------------------
Re: HTML Parsing problems...
Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael Giles wrote:
> Erik,
>
> Probably a good idea to swap something else in, although Neko introduces
> a dependency on Xerces. I didn't play with Neko because I am currently
> using a different XML parser and didn't want to deal with the conflicts
> (and also find dependencies on specific parsers annoying). However,
> yesterday I downloaded
> TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is great!
> It is small and fast and so far has parsed every page I've thrown at
> it. I wrote a SAX ContentHandler that only grabs the text and does a
> few other little things (like inserting spaces, removing tabs/line
> feeds, grabbing title) and it seems to be a perfect fit for the job. It
> requires the SAX framework, but is parser independent. The only tweak I
> made to the TagSoup code was to add an "else" to deal with a bug where
> it was consuming ";" after entities that it did not deal with.
TagSoup is great - however, it is not maintained nor developed (the same
could be said about JTidy as well, but TagSoup's history is much
shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net)
for my application, and it also works very well, even for ill-formed
input. It's also very actively developed.
--
Best regards,
Andrzej Bialecki
-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)
Re: HTML Parsing problems...
Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael Giles wrote:
> Erik,
>
> Probably a good idea to swap something else in, although Neko introduces
> a dependency on Xerces. I didn't play with Neko because I am currently
> using a different XML parser and didn't want to deal with the conflicts
> (and also find dependencies on specific parsers annoying). However,
> yesterday I downloaded
> TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is great!
> It is small and fast and so far has parsed every page I've thrown at
> it. I wrote a SAX ContentHandler that only grabs the text and does a
> few other little things (like inserting spaces, removing tabs/line
> feeds, grabbing title) and it seems to be a perfect fit for the job. It
> requires the SAX framework, but is parser independent. The only tweak I
> made to the TagSoup code was to add an "else" to deal with a bug where
> it was consuming ";" after entities that it did not deal with.
TagSoup is great - however, it is not maintained nor developed (the same
could be said about JTidy as well, but TagSoup's history is much
shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net)
for my application, and it also works very well, even for ill-formed
input. It's also very actively developed.
--
Best regards,
Andrzej Bialecki
-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: HTML Parsing problems...
Posted by Michael Giles <mg...@visionstudio.com>.
Erik,
Probably a good idea to swap something else in, although Neko introduces a
dependency on Xerces. I didn't play with Neko because I am currently using
a different XML parser and didn't want to deal with the conflicts (and also
find dependencies on specific parsers annoying). However, yesterday I
downloaded TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is
great! It is small and fast and so far has parsed every page I've thrown
at it. I wrote a SAX ContentHandler that only grabs the text and does a
few other little things (like inserting spaces, removing tabs/line feeds,
grabbing title) and it seems to be a perfect fit for the job. It requires
the SAX framework, but is parser independent. The only tweak I made to the
TagSoup code was to add an "else" to deal with a bug where it was consuming
";" after entities that it did not deal with.
If Neko is potentially headed into the Apache fold, that probably makes
sense. But if you are interested in my TagSoup ContentHandler for testing
it out, just let me know.
-Mike
At 08:08 PM 9/19/2003 -0400, you wrote:
>I'm going to swap in the neko HTML parser for the demo refactorings I'm
>doing. I would be all for replacing the demo HTML parser with this.
>
>If you look at the Ant <index> task in the sandbox, you'll see that I
>used JTidy for it and it works well, but I've heard that neko is faster
>and better so I'll give it a try.
>
> Erik
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: HTML Parsing problems...
Posted by Tatu Saloranta <ta...@hypermall.net>.
On Thursday 18 September 2003 14:50, Michael Giles wrote:
> I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
> I also know that it is updated from time to time and performs much better
> than the other ones that I have tested. Frustratingly, the very first page
> I tried to parse failed
> (<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister
>.co.uk/content/54/32593.html). It seems to be choking on tags that are being
> written inside of JavaScript code (i.e. document.write('</scr' + 'ipt>');.
> Obviously, the simple solution (that I am using with another parser) is to
> just ignore everything inside of <script> tags. It appears that the parser
> is ignoring text inside script tags, but it seems like it needs to be a bit
> smarter (or maybe dumber) about how it deals with this (so it doesn't get
I would guess that often ignoring stuff in <script> (for indexing purposes)
makes sense; exception being if someone wants to create HTML site creation
IDE (like specifically wants to search for stuff in javascript sections?).
Nonetheless HTML parser has to be able to handle these I think.
> confused by such occurrences). I see a bug has been filed regarding
> trouble parsing JavaScript, has anyone given it thought?
I implemented a rather robust (X[HT])ML parser ("QnD") that was able to work
through many of such issues (<script> tag, unquoted single '&' and '<' chars,
in attr values and elements, simplistic approach to optional end tags). Since
it was dead-optimized for speed (anything fully in memory in a char array,
optimizing based on that) I thought it might be useful for indexing (even
more so than for its original purpose which was to be very fast utility for
filtering [adding and/or removing stuff] of HTML pages).
If anyone would be interested I could give the source code and/or (if I have
time) to implement efficient fault-tolerant indexer.
Like I said this also works equally well for well-formed XML, but that's
nothing special.
-+ Tatu +-
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: HTML Parsing problems...
Posted by Michael Giles <mg...@visionstudio.com>.
Erik,
Probably a good idea to swap something else in, although Neko introduces a
dependency on Xerces. I didn't play with Neko because I am currently using
a different XML parser and didn't want to deal with the conflicts (and also
find dependencies on specific parsers annoying). However, yesterday I
downloaded TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is
great! It is small and fast and so far has parsed every page I've thrown
at it. I wrote a SAX ContentHandler that only grabs the text and does a
few other little things (like inserting spaces, removing tabs/line feeds,
grabbing title) and it seems to be a perfect fit for the job. It requires
the SAX framework, but is parser independent. The only tweak I made to the
TagSoup code was to add an "else" to deal with a bug where it was consuming
";" after entities that it did not deal with.
If Neko is potentially headed into the Apache fold, that probably makes
sense. But if you are interested in my TagSoup ContentHandler for testing
it out, just let me know.
-Mike
At 08:08 PM 9/19/2003 -0400, you wrote:
>I'm going to swap in the neko HTML parser for the demo refactorings I'm
>doing. I would be all for replacing the demo HTML parser with this.
>
>If you look at the Ant <index> task in the sandbox, you'll see that I
>used JTidy for it and it works well, but I've heard that neko is faster
>and better so I'll give it a try.
>
> Erik
>
Re: HTML Parsing problems...
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
I'm going to swap in the neko HTML parser for the demo refactorings I'm
doing. I would be all for replacing the demo HTML parser with this.
If you look at the Ant <index> task in the sandbox, you'll see that I
used JTidy for it and it works well, but I've heard that neko is faster
and better so I'll give it a try.
Erik
On Thursday, September 18, 2003, at 04:50 PM, Michael Giles wrote:
> I know, I know, the HTML Parser in the demo is just that (i.e. a
> demo), but I also know that it is updated from time to time and
> performs much better than the other ones that I have tested.
> Frustratingly, the very first page I tried to parse failed
> (<http://www.theregister.co.uk/content/54/32593.html>http://
> www.theregister.co.uk/content/54/32593.html). It seems to be choking
> on tags that are being written inside of JavaScript code (i.e.
> document.write('</scr' + 'ipt>');. Obviously, the simple solution
> (that I am using with another parser) is to just ignore everything
> inside of <script> tags. It appears that the parser is ignoring text
> inside script tags, but it seems like it needs to be a bit smarter (or
> maybe dumber) about how it deals with this (so it doesn't get confused
> by such occurrences). I see a bug has been filed regarding trouble
> parsing JavaScript, has anyone given it thought?
>
> Outside of the HTML parsing, all is well (and outside of a few pages,
> the parser is a champ).
>
> Thanks!
> -Mike
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: HTML Parsing problems...
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
I'm going to swap in the neko HTML parser for the demo refactorings I'm
doing. I would be all for replacing the demo HTML parser with this.
If you look at the Ant <index> task in the sandbox, you'll see that I
used JTidy for it and it works well, but I've heard that neko is faster
and better so I'll give it a try.
Erik
On Thursday, September 18, 2003, at 04:50 PM, Michael Giles wrote:
> I know, I know, the HTML Parser in the demo is just that (i.e. a
> demo), but I also know that it is updated from time to time and
> performs much better than the other ones that I have tested.
> Frustratingly, the very first page I tried to parse failed
> (<http://www.theregister.co.uk/content/54/32593.html>http://
> www.theregister.co.uk/content/54/32593.html). It seems to be choking
> on tags that are being written inside of JavaScript code (i.e.
> document.write('</scr' + 'ipt>');. Obviously, the simple solution
> (that I am using with another parser) is to just ignore everything
> inside of <script> tags. It appears that the parser is ignoring text
> inside script tags, but it seems like it needs to be a bit smarter (or
> maybe dumber) about how it deals with this (so it doesn't get confused
> by such occurrences). I see a bug has been filed regarding trouble
> parsing JavaScript, has anyone given it thought?
>
> Outside of the HTML parsing, all is well (and outside of a few pages,
> the parser is a champ).
>
> Thanks!
> -Mike
>