You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jeffrey Baker <jw...@gmail.com> on 2009/02/09 20:59:54 UTC

Improving the highlighter output for use in html

The default highlighter output is bogus if you're trying to use the
snippets in a web browser.  With the default <em></em> delimiters, the
temptation is to just stick the snippets in an innerHTML property, but
the problem is that other HTML special characters (< > and &) are not
escaped.  For example, a highlight snippet might look like this:

"<em>this</em> & that"

... which when assigned to an innerHTML will throw an exception (in
Firefox at least).  The other option is to just assign the snippets to
an HTML text node, in which case your user sees a literal
"<em>this</em>" on the screen, which isn't desirable either.

There is no simple solution to this problem because the simple
formatter lacks an escape function.  Even if I set the pre/post to
[[[!!!111ONE]]], it would still fail when searching for this message.
The more satisfactory solution would seem to be to escape special
chars in the simple formatter, at least optionally.
hl.simple.escapeHTML=true?  Another option would be to use a structure
for highlight output instead of bare string.

Note that a resolution to SOLR-175 would also solve this.

Anybody have a way to work around this?

-jwb

Re: Improving the highlighter output for use in html

Posted by Jeffrey Baker <jw...@gmail.com>.
On Mon, Feb 9, 2009 at 2:59 PM, Jeffrey Baker <jw...@gmail.com> wrote:
> The default highlighter output is bogus if you're trying to use the
> snippets in a web browser.  With the default <em></em> delimiters, the
> temptation is to just stick the snippets in an innerHTML property, but
> the problem is that other HTML special characters (< > and &) are not
> escaped.  For example, a highlight snippet might look like this:
>
> "<em>this</em> & that"

So, there's a "SimpleHTMLEncoder" in Lucene (also included in the Solr
distribution) and that could be of use here.  When highlightTerm is
called, SimpleHTMLFormatter would run the text through
SimpleHTMLEncoder this naturally is going to result in some extra
garbage generation.  Is there any better place to put it?

-jwb