You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Stone, Timothy" <ts...@cityofhbg.com> on 2002/09/06 20:38:46 UTC

Demo provided HTML parser bug (was RE: Newbie quizzes further...)

List Fellows:

Lacking any knowledge of JavaCC, I solicted help in hacking the
HTMLParser.jj included in the demo. I retreat from this solication, for two
reasons: 1) I'm using other ideas gleaned from the list archives, 2) I'm not
prepared to dive into the world of complier compliers. The mere sound of it
is intimidating. 

So the bug. (If the bug is not worth fixing in the provided HTMLParser, drop
another one in, like Quiotix's; I did.)

Summary:
The current HTMLParser fails to correctly handle HTML decimal entities.

<title>MyWebsite&#8212;Home Page</title>
<p>My website&#8217;s address is...</p>

The following is produced after indexing the HTML and performing a query:

MyWebsite?Home Page
My website?s address is...

Another problem is manifest in the following oddity:

Given the following *source*; **note the use of the ampersand entity**

<title>MyWebsite&amp;#8212;Home Page</title> 
<p>My website&amp;#8217;s address is...</p>

This produces the output (where two dashes represent an em dash)

MyWebsite--Home Page
My website's address is...

And the source of the *results* appears correctly, even if the source
document that was indexed is incorrect! Some kind of entity replacement is
occuring here.

<title>MyWebsite&#8212;Home Page</title>
<p>My website&#8217;s address is...</p>

(I ran across the latter oddity courtesy of Adobe GoLive's annoying syntax
rewriter.)

Now, some might be asking, and rightly so, why hasn't this been seen before?
I know a search in the archives didn't turn anything up. It's likely because
the use of decimal entities is misunderstood by the HTM community at large.
A for instance is that some, quite possibly a whole lot, use &#151; for em
dash--this is incorrect as the whole range &#127; to &#159; is invalid.
Second, many may use named encoding. Named encoding, i.e. &emdash;, is fine,
but decimal encoding provides a more consistent behavior cross-platform. 

For more on this, read "The Trouble with EM 'n EN and Other Shady
Characters" at A List Apart (www.alistapart.com/stories/emen/) 

Yours in Lucene.
Tim



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>