You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by okayndc <bo...@gmail.com> on 2012/04/05 19:34:10 UTC

HTML tags and Lucene highlighting

Hello,

I currently use Lucene version 3.0...probably need to upgrade to a more
current version soon.
The problem that I have is when I test search for a an HTML tag (ex.
<strong>), Lucene returns
the highlighted HTML tag ~ which is what I DO NOT want.  Is there a way to
"filter" HTML tags?
I have read up on HTMLStripChar filter (packaged with Solr) and wondered if
this is the way to go?

Any help will be greatly appreciated,
Thanks

Re: HTML tags and Lucene highlighting

Posted by okayndc <bo...@gmail.com>.
I want to retain the formatted HTML in a result but, want to ignore (or
filter out) HTML tags in a search, if this makes sense?

On Thu, Apr 5, 2012 at 3:44 PM, Steven A Rowe <sa...@syr.edu> wrote:

> okayndc,
>
> A field configured to use HTMLStripCharFilter as part of its index-time
> analyzer will strip out HTML tags before index terms are created by the
> tokenizer, so HTML tags will not be put into the index.  As a result,
> queries for HTML tags cannot match the original documents' HTML tags (in
> the field configured to use HTMLStripCharFilter, anyway).
>
> So HTMLStripCharFilter should do what you want.
>
> Steve
>
> From: okayndc [mailto:bodymoves@gmail.com]
> Sent: Thursday, April 05, 2012 3:36 PM
> To: Steven A Rowe
> Cc: java-user@lucene.apache.org
> Subject: Re: HTML tags and Lucene highlighting
>
> Hello,
>
> I want to ignore HTML tags within a search.  ~ I should not be able to
> search for a HTML tag (ex. <strong>) and get back the highlighted HTML tag
> (ex. <span class="highlighted"><strong></span>) in a result set.
>
> Thanks
>
> On Thu, Apr 5, 2012 at 3:24 PM, Steven A Rowe <sarowe@syr.edu<mailto:
> sarowe@syr.edu>> wrote:
> Hi okayndc,
>
> What *do* you want?
>
> Steve
>
> -----Original Message-----
> From: okayndc [mailto:bodymoves@gmail.com<ma...@gmail.com>]
> Sent: Thursday, April 05, 2012 1:34 PM
> To: java-user@lucene.apache.org<ma...@lucene.apache.org>
> Subject: HTML tags and Lucene highlighting
>
> Hello,
>
> I currently use Lucene version 3.0...probably need to upgrade to a more
> current version soon.
> The problem that I have is when I test search for a an HTML tag (ex.
> <strong>), Lucene returns
> the highlighted HTML tag ~ which is what I DO NOT want.  Is there a way to
> "filter" HTML tags?
> I have read up on HTMLStripChar filter (packaged with Solr) and wondered
> if this is the way to go?
>
> Any help will be greatly appreciated,
> Thanks
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:
> java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.org<mailto:
> java-user-help@lucene.apache.org>
>
>

RE: HTML tags and Lucene highlighting

Posted by Steven A Rowe <sa...@syr.edu>.
okayndc,

A field configured to use HTMLStripCharFilter as part of its index-time analyzer will strip out HTML tags before index terms are created by the tokenizer, so HTML tags will not be put into the index.  As a result, queries for HTML tags cannot match the original documents' HTML tags (in the field configured to use HTMLStripCharFilter, anyway).

So HTMLStripCharFilter should do what you want.

Steve

From: okayndc [mailto:bodymoves@gmail.com]
Sent: Thursday, April 05, 2012 3:36 PM
To: Steven A Rowe
Cc: java-user@lucene.apache.org
Subject: Re: HTML tags and Lucene highlighting

Hello,

I want to ignore HTML tags within a search.  ~ I should not be able to search for a HTML tag (ex. <strong>) and get back the highlighted HTML tag (ex. <span class="highlighted"><strong></span>) in a result set.

Thanks

On Thu, Apr 5, 2012 at 3:24 PM, Steven A Rowe <sa...@syr.edu>> wrote:
Hi okayndc,

What *do* you want?

Steve

-----Original Message-----
From: okayndc [mailto:bodymoves@gmail.com<ma...@gmail.com>]
Sent: Thursday, April 05, 2012 1:34 PM
To: java-user@lucene.apache.org<ma...@lucene.apache.org>
Subject: HTML tags and Lucene highlighting

Hello,

I currently use Lucene version 3.0...probably need to upgrade to a more current version soon.
The problem that I have is when I test search for a an HTML tag (ex.
<strong>), Lucene returns
the highlighted HTML tag ~ which is what I DO NOT want.  Is there a way to "filter" HTML tags?
I have read up on HTMLStripChar filter (packaged with Solr) and wondered if this is the way to go?

Any help will be greatly appreciated,
Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: java-user-help@lucene.apache.org<ma...@lucene.apache.org>


Re: HTML tags and Lucene highlighting

Posted by okayndc <bo...@gmail.com>.
Hello,

I want to ignore HTML tags within a search.  ~ I should not be able to
search for a HTML tag (ex. <strong>) and get back the highlighted HTML tag
(ex. <span class="highlighted"><strong></span>) in a result set.

Thanks


On Thu, Apr 5, 2012 at 3:24 PM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi okayndc,
>
> What *do* you want?
>
> Steve
>
> -----Original Message-----
> From: okayndc [mailto:bodymoves@gmail.com]
> Sent: Thursday, April 05, 2012 1:34 PM
> To: java-user@lucene.apache.org
> Subject: HTML tags and Lucene highlighting
>
> Hello,
>
> I currently use Lucene version 3.0...probably need to upgrade to a more
> current version soon.
> The problem that I have is when I test search for a an HTML tag (ex.
> <strong>), Lucene returns
> the highlighted HTML tag ~ which is what I DO NOT want.  Is there a way to
> "filter" HTML tags?
> I have read up on HTMLStripChar filter (packaged with Solr) and wondered
> if this is the way to go?
>
> Any help will be greatly appreciated,
> Thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: HTML tags and Lucene highlighting

Posted by Steven A Rowe <sa...@syr.edu>.
Hi okayndc,

What *do* you want?

Steve

-----Original Message-----
From: okayndc [mailto:bodymoves@gmail.com] 
Sent: Thursday, April 05, 2012 1:34 PM
To: java-user@lucene.apache.org
Subject: HTML tags and Lucene highlighting

Hello,

I currently use Lucene version 3.0...probably need to upgrade to a more current version soon.
The problem that I have is when I test search for a an HTML tag (ex.
<strong>), Lucene returns
the highlighted HTML tag ~ which is what I DO NOT want.  Is there a way to "filter" HTML tags?
I have read up on HTMLStripChar filter (packaged with Solr) and wondered if this is the way to go?

Any help will be greatly appreciated,
Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: HTML tags and Lucene highlighting

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
(12/04/06 2:34), okayndc wrote:
> Hello,
>
> I currently use Lucene version 3.0...probably need to upgrade to a more
> current version soon.
> The problem that I have is when I test search for a an HTML tag (ex.
> <strong>), Lucene returns
> the highlighted HTML tag ~ which is what I DO NOT want.  Is there a way to
> "filter" HTML tags?
> I have read up on HTMLStripChar filter (packaged with Solr) and wondered if
> this is the way to go?
>
> Any help will be greatly appreciated,
> Thanks
>

There is a way to encode HTML tags:

https://builds.apache.org/job/Lucene-3.x/javadoc/contrib-highlighter/org/apache/lucene/search/highlight/SimpleHTMLEncoder.html

koji
-- 
Query Log Visualizer for Apache Solr
http://soleami.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org