You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andrew May <am...@ingenta.com> on 2006/07/28 21:48:50 UTC

Highlighting problems with HTML tagged fields

Hi,

I'm indexing some content that contains HTML markup, and this seems to throw off the 
highlighting somehow.

Example title field:

<SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic 
terrane of NW Iberia

If I search form title:fabrics and turn highlighting on, the highlighted version has the 
<em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the 
sum of the lengths of the tags).

Because I don't want the tags indexed I'm using a modified version of the "text" field 
type that uses the HTMLStripWhitespaceTokenizerFactory instead of the normal 
WhitespaceTokenizerFactory. I've tried using this tokenizer just when indexing, or both 
when indexing and querying, but both do the same thing.

There's no problem if I use the normal WhitespaceTokenizerFactory, but then it's possible 
to search the tags and find matches, which isn't ideal.

This is about the closest thing I can find on the Lucene mailing list related to this - 
but this would kind of suggest that this ought to work?

http://www.gossamer-threads.com/lists/lucene/java-user/14981?search_string=HTML%20strip;#14981

Thanks,

Andrew

Re: [2] Highlighting problems with HTML tagged fields

Posted by nick19701 <to...@yahoo.com>.

Chris Hostetter wrote:
> 
> 
> patches for issues can't be applied until someone who cares about them
> write them and contribute them for committers to consider/apply :)
> 
> 

it seems I'm one of the very few people who care about this feature :)

Unfortunately my daily languages are c++ and c#. I only know a little bit
Java. Otherwise I'll contribute.

-- 
View this message in context: http://www.nabble.com/Highlighting-problems-with-HTML-tagged-fields-tf2017260.html#a9365098
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [2] Highlighting problems with HTML tagged fields

Posted by Chris Hostetter <ho...@fucit.org>.
: The suggested fix from Mirko seems very simple. Hopefull a patch will be
: applied
: very soon. In the meantime, I'll use my backup solution:

patches for issues can't be applied until someone who cares about them
write them and contribute them for committers to consider/apply :)

-Hoss


Re: [2] Highlighting problems with HTML tagged fields

Posted by nick19701 <to...@yahoo.com>.

Chris Hostetter wrote:
> 
> 
> It is tracked in http://issues.apache.org/jira/browse/SOLR-42
> 
> ...there are currently no patches.
> 
> 

The suggested fix from Mirko seems very simple. Hopefull a patch will be
applied 
very soon. In the meantime, I'll use my backup solution: 
http://fucoder.com/code/se-hilite/ http://fucoder.com/code/se-hilite/ 


-- 
View this message in context: http://www.nabble.com/Highlighting-problems-with-HTML-tagged-fields-tf2017260.html#a9363720
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [2] Highlighting problems with HTML tagged fields

Posted by Chris Hostetter <ho...@fucit.org>.
It is tracked in http://issues.apache.org/jira/browse/SOLR-42

...there are currently no patches.


: Date: Tue, 6 Mar 2007 15:04:25 -0800 (PST)
: From: nick19701 <to...@yahoo.com>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: [2] Highlighting problems with HTML tagged fields
:
:
:
: Yonik Seeley wrote:
: >
: > HTMLStripWhitespaceTokenizerFactory works in two phases...
: > HTMLStripReader removes the HTML and passes the result to
: > WhitespaceTokenizer... at that point, Tokens are generated, but the
: > offsets will correspond to the text after HTML removal, not before.
: >
: > I did it this way so that HTMLStripReader  could go before any
: > tokenizer (like StandardTokenizer).
: >
: > Can you open a JIRA bug for this?  The fix would be a special version
: > of HTMLStripReader integrated with a WhitespaceTokenizer to keep
: > offsets correct.
: >
: > -Yonik
: >
: >
: Is there a fix for this problem?
:
: my solr is dated on 12/17/2006. HTMLStripWhitespaceTokenizerFactory +
: highlighting still
: doesn't work. All the wrong items are highlighted.
: --
: View this message in context: http://www.nabble.com/Highlighting-problems-with-HTML-tagged-fields-tf2017260.html#a9343253
: Sent from the Solr - User mailing list archive at Nabble.com.
:



-Hoss


Re: [2] Highlighting problems with HTML tagged fields

Posted by nick19701 <to...@yahoo.com>.

Yonik Seeley wrote:
> 
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> 
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> 
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct.
> 
> -Yonik
> 
> 
Is there a fix for this problem?

my solr is dated on 12/17/2006. HTMLStripWhitespaceTokenizerFactory +
highlighting still
doesn't work. All the wrong items are highlighted.
-- 
View this message in context: http://www.nabble.com/Highlighting-problems-with-HTML-tagged-fields-tf2017260.html#a9343253
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Highlighting problems with HTML tagged fields

Posted by Yonik Seeley <yo...@apache.org>.
On 7/28/06, Andrew May <am...@ingenta.com> wrote:
> Because I don't want the tags indexed I'm using a modified version of the "text" field
> type that uses the HTMLStripWhitespaceTokenizerFactory instead of the normal
> WhitespaceTokenizerFactory.

HTMLStripWhitespaceTokenizerFactory works in two phases...
HTMLStripReader removes the HTML and passes the result to
WhitespaceTokenizer... at that point, Tokens are generated, but the
offsets will correspond to the text after HTML removal, not before.

I did it this way so that HTMLStripReader  could go before any
tokenizer (like StandardTokenizer).

Can you open a JIRA bug for this?  The fix would be a special version
of HTMLStripReader integrated with a WhitespaceTokenizer to keep
offsets correct.

-Yonik