You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael Giles <mg...@visionstudio.com> on 2003/09/18 22:50:53 UTC

HTML Parsing problems...

I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but 
I also know that it is updated from time to time and performs much better 
than the other ones that I have tested.  Frustratingly, the very first page 
I tried to parse failed 
(<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister.co.uk/content/54/32593.html). 
It seems to be choking on tags that are being written inside of JavaScript 
code (i.e. document.write('</scr' + 'ipt>');.  Obviously, the simple 
solution (that I am using with another parser) is to just ignore everything 
inside of <script> tags.  It appears that the parser is ignoring text 
inside script tags, but it seems like it needs to be a bit smarter (or 
maybe dumber) about how it deals with this (so it doesn't get confused by 
such occurrences).  I see a bug has been filed regarding trouble parsing 
JavaScript, has anyone given it thought?

Outside of the HTML parsing, all is well (and outside of a few pages, the 
parser is a champ).

Thanks!
-Mike

Re: HTML Parsing problems...

Posted by Michael Giles <mg...@visionstudio.com>.

Tatu,

Thanks for the reply.  See below for comments.

> > just ignore everything inside of <script> tags.  It appears that the parser
> > is ignoring text inside script tags, but it seems like it needs to be a bit
> > smarter (or maybe dumber) about how it deals with this (so it doesn't get
>
>I would guess that often ignoring stuff in <script> (for indexing purposes)
>makes sense; exception being if someone wants to create HTML site creation
>IDE (like specifically wants to search for stuff in javascript sections?).
>Nonetheless HTML parser has to be able to handle these I think.

Fortunately, the sole purpose of the parser that ships with Lucene is 
indexing HTML documents.  As such, I see no reason to worry about 
functionality for other use cases (i.e. IDE development).  There are plenty 
of other parsers out there that try to be complete.  It would be great if 
this one was optimized for the task at hand (and thus can ignore text 
inside <script> tags).

> > confused by such occurrences).  I see a bug has been filed regarding
> > trouble parsing JavaScript, has anyone given it thought?
>
>If anyone would be interested I could give the source code and/or (if I have
>time) to implement efficient fault-tolerant indexer.
>Like I said this also works equally well for well-formed XML, but that's
>nothing special.

I'd definitely be interested to see what you did.  My application needs to 
index "public" documents as users submit requests (eventually 1000's per 
day), so I don't have control over the HTML (i.e. it needs to be fault 
tolerant) and it needs to be efficient.  Parsing a big page (i.e. 
http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html) 
is another good way to stress the basic parsers (some are frighteningly CPU 
intensive).

Even though I think a solid HTML parser that is optimized for the task of 
indexing is actually quite important to Lucene, we can take any further 
discussions off-line as they are probably not deemed relevant to the Lucene 
list.

-Mike

Re: HTML Parsing problems...

Posted by Michael Giles <mg...@visionstudio.com>.

Tatu,

Thanks for the reply.  See below for comments.

> > just ignore everything inside of <script> tags.  It appears that the parser
> > is ignoring text inside script tags, but it seems like it needs to be a bit
> > smarter (or maybe dumber) about how it deals with this (so it doesn't get
>
>I would guess that often ignoring stuff in <script> (for indexing purposes)
>makes sense; exception being if someone wants to create HTML site creation
>IDE (like specifically wants to search for stuff in javascript sections?).
>Nonetheless HTML parser has to be able to handle these I think.

Fortunately, the sole purpose of the parser that ships with Lucene is 
indexing HTML documents.  As such, I see no reason to worry about 
functionality for other use cases (i.e. IDE development).  There are plenty 
of other parsers out there that try to be complete.  It would be great if 
this one was optimized for the task at hand (and thus can ignore text 
inside <script> tags).

> > confused by such occurrences).  I see a bug has been filed regarding
> > trouble parsing JavaScript, has anyone given it thought?
>
>If anyone would be interested I could give the source code and/or (if I have
>time) to implement efficient fault-tolerant indexer.
>Like I said this also works equally well for well-formed XML, but that's
>nothing special.

I'd definitely be interested to see what you did.  My application needs to 
index "public" documents as users submit requests (eventually 1000's per 
day), so I don't have control over the HTML (i.e. it needs to be fault 
tolerant) and it needs to be efficient.  Parsing a big page (i.e. 
http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html) 
is another good way to stress the basic parsers (some are frighteningly CPU 
intensive).

Even though I think a solid HTML parser that is optimized for the task of 
indexing is actually quite important to Lucene, we can take any further 
discussions off-line as they are probably not deemed relevant to the Lucene 
list.

-Mike




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: HTML Parsing problems...

Posted by Peter Becker <pb...@dstc.edu.au>.

Tatu Saloranta wrote:

>On Thursday 18 September 2003 14:50, Michael Giles wrote:
>  
>
>>I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
>>I also know that it is updated from time to time and performs much better
>>than the other ones that I have tested.  Frustratingly, the very first page
>>I tried to parse failed
>>(<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister
>>.co.uk/content/54/32593.html). It seems to be choking on tags that are being
>>written inside of JavaScript code (i.e. document.write('</scr' + 'ipt>');. 
>>Obviously, the simple solution (that I am using with another parser) is to
>>just ignore everything inside of <script> tags.  It appears that the parser
>>is ignoring text inside script tags, but it seems like it needs to be a bit
>>smarter (or maybe dumber) about how it deals with this (so it doesn't get
>>    
>>
>
>I would guess that often ignoring stuff in <script> (for indexing purposes) 
>makes sense; exception being if someone wants to create HTML site creation 
>IDE (like specifically wants to search for stuff in javascript sections?).
>Nonetheless HTML parser has to be able to handle these I think.
>
>  
>
>>confused by such occurrences).  I see a bug has been filed regarding
>>trouble parsing JavaScript, has anyone given it thought?
>>    
>>
>
>I implemented a rather robust (X[HT])ML parser ("QnD") that was able to work
>through many of such issues (<script> tag, unquoted single '&' and '<' chars,
>in attr values and elements, simplistic approach to optional end tags). Since 
>it was dead-optimized for speed (anything fully in memory in a char array, 
>optimizing based on that) I thought it might be useful for indexing (even 
>more so than for its original purpose which was to be very fast utility for 
>filtering [adding and/or removing stuff] of HTML pages).
>
>If anyone would be interested I could give the source code and/or (if I have 
>time) to implement efficient fault-tolerant indexer.
>Like I said this also works equally well for well-formed XML, but that's 
>nothing special.
>
We had reasonably good experiences with this simple bit of code, using 
Swing's HTML parser:

  
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/HtmlDocumentHandler.java?rev=1.4&content-type=text/vnd.viewcvs-markup

We haven't tested it much, but it does grok a local copy of the link given.

Here is our XML parsing code:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/XmlDocumentHandler.java?rev=1.6&content-type=text/vnd.viewcvs-markup
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/SaxTextContentParser.java?rev=1.3&content-type=text/vnd.viewcvs-markup

The XML bit is not too good yet. E.g. it chokes on large XML easily 
since it reads all content into memory at once.

Some of the code will require JDK 1.4, though. The XML relies on JAXP, I 
don't know about the HTMLEditorKit.

JTidy seems to be another option.

  Peter

Re: HTML Parsing problems...

Posted by Peter Becker <pb...@dstc.edu.au>.

Tatu Saloranta wrote:

>On Thursday 18 September 2003 14:50, Michael Giles wrote:
>  
>
>>I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
>>I also know that it is updated from time to time and performs much better
>>than the other ones that I have tested.  Frustratingly, the very first page
>>I tried to parse failed
>>(<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister
>>.co.uk/content/54/32593.html). It seems to be choking on tags that are being
>>written inside of JavaScript code (i.e. document.write('</scr' + 'ipt>');. 
>>Obviously, the simple solution (that I am using with another parser) is to
>>just ignore everything inside of <script> tags.  It appears that the parser
>>is ignoring text inside script tags, but it seems like it needs to be a bit
>>smarter (or maybe dumber) about how it deals with this (so it doesn't get
>>    
>>
>
>I would guess that often ignoring stuff in <script> (for indexing purposes) 
>makes sense; exception being if someone wants to create HTML site creation 
>IDE (like specifically wants to search for stuff in javascript sections?).
>Nonetheless HTML parser has to be able to handle these I think.
>
>  
>
>>confused by such occurrences).  I see a bug has been filed regarding
>>trouble parsing JavaScript, has anyone given it thought?
>>    
>>
>
>I implemented a rather robust (X[HT])ML parser ("QnD") that was able to work
>through many of such issues (<script> tag, unquoted single '&' and '<' chars,
>in attr values and elements, simplistic approach to optional end tags). Since 
>it was dead-optimized for speed (anything fully in memory in a char array, 
>optimizing based on that) I thought it might be useful for indexing (even 
>more so than for its original purpose which was to be very fast utility for 
>filtering [adding and/or removing stuff] of HTML pages).
>
>If anyone would be interested I could give the source code and/or (if I have 
>time) to implement efficient fault-tolerant indexer.
>Like I said this also works equally well for well-formed XML, but that's 
>nothing special.
>
We had reasonably good experiences with this simple bit of code, using 
Swing's HTML parser:

  
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/HtmlDocumentHandler.java?rev=1.4&content-type=text/vnd.viewcvs-markup

We haven't tested it much, but it does grok a local copy of the link given.

Here is our XML parsing code:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/XmlDocumentHandler.java?rev=1.6&content-type=text/vnd.viewcvs-markup
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/documenthandler/SaxTextContentParser.java?rev=1.3&content-type=text/vnd.viewcvs-markup

The XML bit is not too good yet. E.g. it chokes on large XML easily 
since it reads all content into memory at once.

Some of the code will require JDK 1.4, though. The XML relies on JAXP, I 
don't know about the HTMLEditorKit.

JTidy seems to be another option.

  Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: HTML Parsing problems...

Posted by Tatu Saloranta <ta...@hypermall.net>.

On Thursday 18 September 2003 14:50, Michael Giles wrote:
> I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
> I also know that it is updated from time to time and performs much better
> than the other ones that I have tested.  Frustratingly, the very first page
> I tried to parse failed
> (<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister
>.co.uk/content/54/32593.html). It seems to be choking on tags that are being
> written inside of JavaScript code (i.e. document.write('</scr' + 'ipt>');. 
> Obviously, the simple solution (that I am using with another parser) is to
> just ignore everything inside of <script> tags.  It appears that the parser
> is ignoring text inside script tags, but it seems like it needs to be a bit
> smarter (or maybe dumber) about how it deals with this (so it doesn't get

I would guess that often ignoring stuff in <script> (for indexing purposes) 
makes sense; exception being if someone wants to create HTML site creation 
IDE (like specifically wants to search for stuff in javascript sections?).
Nonetheless HTML parser has to be able to handle these I think.

> confused by such occurrences).  I see a bug has been filed regarding
> trouble parsing JavaScript, has anyone given it thought?

I implemented a rather robust (X[HT])ML parser ("QnD") that was able to work
through many of such issues (<script> tag, unquoted single '&' and '<' chars,
in attr values and elements, simplistic approach to optional end tags). Since 
it was dead-optimized for speed (anything fully in memory in a char array, 
optimizing based on that) I thought it might be useful for indexing (even 
more so than for its original purpose which was to be very fast utility for 
filtering [adding and/or removing stuff] of HTML pages).

If anyone would be interested I could give the source code and/or (if I have 
time) to implement efficient fault-tolerant indexer.
Like I said this also works equally well for well-formed XML, but that's 
nothing special.

-+ Tatu +-

Re: HTML Parsing problems...

Posted by Michael Giles <mg...@visionstudio.com>.

Yeah, I was using HTMLParser for a few days until I tried to parse a 400K 
document and it spun at 100% CPU for a very long time.  It is tolerant of 
bad HTML, but does not appear to scale.  TagSoup processed the same 
document in a second or less at <25% CPU.

-Mike

At 02:42 PM 9/22/2003 +0200, you wrote:

>TagSoup is great - however, it is not maintained nor developed (the same 
>could be said about JTidy as well, but TagSoup's history is much 
>shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net) for 
>my application, and it also works very well, even for ill-formed input. 
>It's also very actively developed.
>
>--
>Best regards,
>Andrzej Bialecki
>
>-------------------------------------------------
>Software Architect, System Integration Specialist
>CEN/ISSS EC Workshop, ECIMF project chair
>EU FP6 E-Commerce Expert/Evaluator
>-------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: HTML Parsing problems...

Posted by Michael Giles <mg...@visionstudio.com>.

Yeah, I was using HTMLParser for a few days until I tried to parse a 400K 
document and it spun at 100% CPU for a very long time.  It is tolerant of 
bad HTML, but does not appear to scale.  TagSoup processed the same 
document in a second or less at <25% CPU.

-Mike

At 02:42 PM 9/22/2003 +0200, you wrote:

>TagSoup is great - however, it is not maintained nor developed (the same 
>could be said about JTidy as well, but TagSoup's history is much 
>shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net) for 
>my application, and it also works very well, even for ill-formed input. 
>It's also very actively developed.
>
>--
>Best regards,
>Andrzej Bialecki
>
>-------------------------------------------------
>Software Architect, System Integration Specialist
>CEN/ISSS EC Workshop, ECIMF project chair
>EU FP6 E-Commerce Expert/Evaluator
>-------------------------------------------------

Re: HTML Parsing problems...

Posted by Andrzej Bialecki <ab...@getopt.org>.

Michael Giles wrote:
> Erik,
> 
> Probably a good idea to swap something else in, although Neko introduces 
> a dependency on Xerces.  I didn't play with Neko because I am currently 
> using a different XML parser and didn't want to deal with the conflicts 
> (and also find dependencies on specific parsers annoying).  However, 
> yesterday I downloaded 
> TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is great!  
> It is small and fast and so far has parsed every page I've thrown at 
> it.  I wrote a SAX ContentHandler that only grabs the text and does a 
> few other little things (like inserting spaces, removing tabs/line 
> feeds, grabbing title) and it seems to be a perfect fit for the job.  It 
> requires the SAX framework, but is parser independent.  The only tweak I 
> made to the TagSoup code was to add an "else" to deal with a bug where 
> it was consuming ";" after entities that it did not deal with.

TagSoup is great - however, it is not maintained nor developed (the same 
could be said about JTidy as well, but TagSoup's history is much 
shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net) 
for my application, and it also works very well, even for ill-formed 
input. It's also very actively developed.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

Re: HTML Parsing problems...

Posted by Andrzej Bialecki <ab...@getopt.org>.

Michael Giles wrote:
> Erik,
> 
> Probably a good idea to swap something else in, although Neko introduces 
> a dependency on Xerces.  I didn't play with Neko because I am currently 
> using a different XML parser and didn't want to deal with the conflicts 
> (and also find dependencies on specific parsers annoying).  However, 
> yesterday I downloaded 
> TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is great!  
> It is small and fast and so far has parsed every page I've thrown at 
> it.  I wrote a SAX ContentHandler that only grabs the text and does a 
> few other little things (like inserting spaces, removing tabs/line 
> feeds, grabbing title) and it seems to be a perfect fit for the job.  It 
> requires the SAX framework, but is parser independent.  The only tweak I 
> made to the TagSoup code was to add an "else" to deal with a bug where 
> it was consuming ";" after entities that it did not deal with.

TagSoup is great - however, it is not maintained nor developed (the same 
could be said about JTidy as well, but TagSoup's history is much 
shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net) 
for my application, and it also works very well, even for ill-formed 
input. It's also very actively developed.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: HTML Parsing problems...

Posted by Michael Giles <mg...@visionstudio.com>.

Erik,

Probably a good idea to swap something else in, although Neko introduces a 
dependency on Xerces.  I didn't play with Neko because I am currently using 
a different XML parser and didn't want to deal with the conflicts (and also 
find dependencies on specific parsers annoying).  However, yesterday I 
downloaded TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is 
great!  It is small and fast and so far has parsed every page I've thrown 
at it.  I wrote a SAX ContentHandler that only grabs the text and does a 
few other little things (like inserting spaces, removing tabs/line feeds, 
grabbing title) and it seems to be a perfect fit for the job.  It requires 
the SAX framework, but is parser independent.  The only tweak I made to the 
TagSoup code was to add an "else" to deal with a bug where it was consuming 
";" after entities that it did not deal with.

If Neko is potentially headed into the Apache fold, that probably makes 
sense.  But if you are interested in my TagSoup ContentHandler for testing 
it out, just let me know.

-Mike

At 08:08 PM 9/19/2003 -0400, you wrote:
>I'm going to swap in the neko HTML parser for the demo refactorings I'm
>doing.  I would be all for replacing the demo HTML parser with this.
>
>If you look at the Ant <index> task in the sandbox, you'll see that I
>used JTidy for it and it works well, but I've heard that neko is faster
>and better so I'll give it a try.
>
>         Erik
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: HTML Parsing problems...

Posted by Tatu Saloranta <ta...@hypermall.net>.

On Thursday 18 September 2003 14:50, Michael Giles wrote:
> I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
> I also know that it is updated from time to time and performs much better
> than the other ones that I have tested.  Frustratingly, the very first page
> I tried to parse failed
> (<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister
>.co.uk/content/54/32593.html). It seems to be choking on tags that are being
> written inside of JavaScript code (i.e. document.write('</scr' + 'ipt>');. 
> Obviously, the simple solution (that I am using with another parser) is to
> just ignore everything inside of <script> tags.  It appears that the parser
> is ignoring text inside script tags, but it seems like it needs to be a bit
> smarter (or maybe dumber) about how it deals with this (so it doesn't get

I would guess that often ignoring stuff in <script> (for indexing purposes) 
makes sense; exception being if someone wants to create HTML site creation 
IDE (like specifically wants to search for stuff in javascript sections?).
Nonetheless HTML parser has to be able to handle these I think.

> confused by such occurrences).  I see a bug has been filed regarding
> trouble parsing JavaScript, has anyone given it thought?

I implemented a rather robust (X[HT])ML parser ("QnD") that was able to work
through many of such issues (<script> tag, unquoted single '&' and '<' chars,
in attr values and elements, simplistic approach to optional end tags). Since 
it was dead-optimized for speed (anything fully in memory in a char array, 
optimizing based on that) I thought it might be useful for indexing (even 
more so than for its original purpose which was to be very fast utility for 
filtering [adding and/or removing stuff] of HTML pages).

If anyone would be interested I could give the source code and/or (if I have 
time) to implement efficient fault-tolerant indexer.
Like I said this also works equally well for well-formed XML, but that's 
nothing special.

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: HTML Parsing problems...

Posted by Michael Giles <mg...@visionstudio.com>.

Erik,

Probably a good idea to swap something else in, although Neko introduces a 
dependency on Xerces.  I didn't play with Neko because I am currently using 
a different XML parser and didn't want to deal with the conflicts (and also 
find dependencies on specific parsers annoying).  However, yesterday I 
downloaded TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is 
great!  It is small and fast and so far has parsed every page I've thrown 
at it.  I wrote a SAX ContentHandler that only grabs the text and does a 
few other little things (like inserting spaces, removing tabs/line feeds, 
grabbing title) and it seems to be a perfect fit for the job.  It requires 
the SAX framework, but is parser independent.  The only tweak I made to the 
TagSoup code was to add an "else" to deal with a bug where it was consuming 
";" after entities that it did not deal with.

If Neko is potentially headed into the Apache fold, that probably makes 
sense.  But if you are interested in my TagSoup ContentHandler for testing 
it out, just let me know.

-Mike

At 08:08 PM 9/19/2003 -0400, you wrote:
>I'm going to swap in the neko HTML parser for the demo refactorings I'm
>doing.  I would be all for replacing the demo HTML parser with this.
>
>If you look at the Ant <index> task in the sandbox, you'll see that I
>used JTidy for it and it works well, but I've heard that neko is faster
>and better so I'll give it a try.
>
>         Erik
>

Re: HTML Parsing problems...

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

I'm going to swap in the neko HTML parser for the demo refactorings I'm  
doing.  I would be all for replacing the demo HTML parser with this.

If you look at the Ant <index> task in the sandbox, you'll see that I  
used JTidy for it and it works well, but I've heard that neko is faster  
and better so I'll give it a try.

	Erik


On Thursday, September 18, 2003, at 04:50  PM, Michael Giles wrote:

> I know, I know, the HTML Parser in the demo is just that (i.e. a  
> demo), but I also know that it is updated from time to time and  
> performs much better than the other ones that I have tested.   
> Frustratingly, the very first page I tried to parse failed  
> (<http://www.theregister.co.uk/content/54/32593.html>http:// 
> www.theregister.co.uk/content/54/32593.html). It seems to be choking  
> on tags that are being written inside of JavaScript code (i.e.  
> document.write('</scr' + 'ipt>');.  Obviously, the simple solution  
> (that I am using with another parser) is to just ignore everything  
> inside of <script> tags.  It appears that the parser is ignoring text  
> inside script tags, but it seems like it needs to be a bit smarter (or  
> maybe dumber) about how it deals with this (so it doesn't get confused  
> by such occurrences).  I see a bug has been filed regarding trouble  
> parsing JavaScript, has anyone given it thought?
>
> Outside of the HTML parsing, all is well (and outside of a few pages,  
> the parser is a champ).
>
> Thanks!
> -Mike 
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: HTML Parsing problems...

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

I'm going to swap in the neko HTML parser for the demo refactorings I'm  
doing.  I would be all for replacing the demo HTML parser with this.

If you look at the Ant <index> task in the sandbox, you'll see that I  
used JTidy for it and it works well, but I've heard that neko is faster  
and better so I'll give it a try.

	Erik


On Thursday, September 18, 2003, at 04:50  PM, Michael Giles wrote:

> I know, I know, the HTML Parser in the demo is just that (i.e. a  
> demo), but I also know that it is updated from time to time and  
> performs much better than the other ones that I have tested.   
> Frustratingly, the very first page I tried to parse failed  
> (<http://www.theregister.co.uk/content/54/32593.html>http:// 
> www.theregister.co.uk/content/54/32593.html). It seems to be choking  
> on tags that are being written inside of JavaScript code (i.e.  
> document.write('</scr' + 'ipt>');.  Obviously, the simple solution  
> (that I am using with another parser) is to just ignore everything  
> inside of <script> tags.  It appears that the parser is ignoring text  
> inside script tags, but it seems like it needs to be a bit smarter (or  
> maybe dumber) about how it deals with this (so it doesn't get confused  
> by such occurrences).  I see a bug has been filed regarding trouble  
> parsing JavaScript, has anyone given it thought?
>
> Outside of the HTML parsing, all is well (and outside of a few pages,  
> the parser is a champ).
>
> Thanks!
> -Mike 
>