You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sandeep chawla <sa...@gmail.com> on 2007/07/02 18:32:01 UTC

highlighting phrase query

Hi All,

I am developing a search tool using lucene. I am using lucene 2.1.

i have a requirement to highlight query words in the results.
.Lucene-highlighter 2.1 doesn't work well in highlighting phase query.

For example - if i have a query string "lucene Java" .It highlights
not only occurrences of "lucene java" but occurrences of lucene and
java too in the text.

I think, this is a known problem..is this issue solved in lucene 2.2.
well my application is almost complete and i really don't wanna switch
to lucene 2.2.

I was going through previous posts but i couldn't find a solution of
this problem. There r some alternate highlighter s but it seems, they
r not stable and still in evolution phase.

I am looking for a standard n stable API for this purpose..

I'd appreciate any thoughts/guidance in this issue.

Thanks
Sandeep

-- 
SANDEEP CHAWLA
House No- 23                     			
10th main 					
BTM 1st  Stage     					
Bangalore						Mobile: 91-9986150603

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: highlighting phrase query

Posted by Mark Miller <ma...@gmail.com>.
>
> has any one used Lucene-794? how stable it it. is it widely used in 
> industry.
>
>
I have used it extensively and I would say it is extremely stable. As I 
said, much of the code from it is literally the same compiled code from 
Contrib Highlighter (It is really just a new Scorer class for the 
Contrib Highlighter).

It is also the newest of the Highlighter issues in JIRA and I would say 
it is not even close to widely used in the industry.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: highlighting phrase query

Posted by sandeep chawla <sa...@gmail.com>.
Thanks a lot Mark,

has any one used Lucene-794? how stable it it. is it widely used in industry.

These are some of my questions :)

Thanks
Sandeep

On 03/07/07, Renaud Waldura <re...@library.ucsf.edu> wrote:
> Mark:
>
> Thanks a million for this comprehensive analysis. This is going straight to
> my manager. :)
>
> --Renaud
>
>
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: Monday, July 02, 2007 2:11 PM
> To: java-user@lucene.apache.org
> Subject: Re: highlighting phrase query
>
> There has been a lot of Highlighter discussion lately, but just to try and
> sum up the state of Highlighting in the Lucene world:
>
> There are four Highlighter implementations that I know of. From what I can
> tell, only the original Contrib Highlighter has received sustained active
> development by more than one individual.
>
> Contrib... [snip]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
SANDEEP CHAWLA
House No- 23
10th main
BTM 1st  Stage
Bangalore Mobile: 91-9986150603

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: highlighting phrase query

Posted by Renaud Waldura <re...@library.ucsf.edu>.
Mark:

Thanks a million for this comprehensive analysis. This is going straight to
my manager. :)

--Renaud
 

-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com] 
Sent: Monday, July 02, 2007 2:11 PM
To: java-user@lucene.apache.org
Subject: Re: highlighting phrase query

There has been a lot of Highlighter discussion lately, but just to try and
sum up the state of Highlighting in the Lucene world:

There are four Highlighter implementations that I know of. From what I can
tell, only the original Contrib Highlighter has received sustained active
development by more than one individual.

Contrib... [snip]



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: highlighting phrase query

Posted by Mark Miller <ma...@gmail.com>.
There has been a lot of Highlighter discussion lately, but just to try 
and sum up the state of Highlighting in the Lucene world:

There are four Highlighter implementations that I know of. From what I 
can tell, only the original Contrib Highlighter has received sustained 
active development by more than one individual.

Contrib Highlighter:
The Contrib Highlighter supports the widest array of analyzers and 
corner cases and has had the widest exposure. It is generally slower on 
larger documents due to the requirement that you re-analyze text and to 
support a wider variety of use cases -- the TokenGroup for token 
overlaps and inspecting every term for Fragmentation contribute to a 
huge performance drain on large documents. This highlighter does not 
support highlighting based on position and all terms from the query will 
be highlighted in the text. You can avoid some of the cost of 
re-analyzing by using the TokenSources class to rebuild a TokenStream 
using stored offsets and/or positions, but this is unlikely to be faster 
unless you are using very large documents with a complex analyzer. 
Getting and sorting offsets/positions is relatively slow and for smaller 
docs it is faster to just re-analyze.

LUCENE-403:
I have not spent a lot of time with this approach, but it is similar to 
the Contrib Highlighter approach. It almost certainly does not cover as 
many odd corner cases as Contrib Highlighter and the framework is 
lacking, but it does add some support for proper PhraseQuery 
highlighting by implementing some custom PhraseQuery search logic. 
Because LUCENE-403 is not as rigorous as the Contrib Highlighter, it may 
well be a bit faster. The author claims that HTML tags will not be 
broken when fragmenting.

LUCENE-644:
This Highlighter approach requires that you have stored term offsets in 
the index. This Highlighter can be very fast if you are using a 
complicated analyzer since there is no need for re-analyzing the text 
(due to the stored offsets). Also, rather then scoring every term like 
the Contrib Highlighter, only terms from the query are effectively 
"handled". For smaller documents and simpler analyzers there is not much 
speed improvement over the Contrib Highlighter (due to the time it takes 
to retrieve and sort offsets), but for larger documents , especially 
with more complex analyzers,  this Highlighter can be extremely fast. 
Again, positional highlighting for Phrase and Span queries is not 
supported.  

The biggest reason this implementation performs so well is that it deals 
with the text in much bigger chunks. Contrib Highlighter can also avoid 
re-analyzing by storing offsets and positions, but then it scores the 
document and rebuilds the text one token at a time using the performance 
draining TokenGroup (which helps cover some of those corner cases). This 
is very slow on very large documents.

LUCENE-794:
This approach extends the Contrib Highlighter to support Highlighting 
Span and Phrase queries. The approach used for non position sensitive 
Query clauses is the same as the Contrib Highlighter, and if you use the 
latest CachingTokenFilter the speed is roughly about the same. Position 
sensitive Query clauses are a bit slower as a MemoryIndex is used to 
retrieve the correct positions to Highlight. This gives exact 
highlighting without reimplementing search logic. Also, all of the use 
cases and corner cases that have been solved for the Contrib Highlighter 
are retained. All of the deficiencies of the Contrib Highlighter (slower 
on large docs) are also retained. The majority of the code for this 
comes from the Contrib Highlighter -- it uses the Contrib Highlighter 
framework. Which points out a plus for the Contrib Highlighter setup -- 
it allows for an extension like this, while LUCENE-644 could not easily 
be expanded to handle position sensitive queries.


There has been some discussion of getting Lucene to identify correct 
highlights as the search is processed. I am not very optimistic that 
this will be fruitful, but those that are discussing it know more more 
about this than I do.

- Mark

sandeep chawla wrote:
> Hi All,
>
> I am developing a search tool using lucene. I am using lucene 2.1.
>
> i have a requirement to highlight query words in the results.
> .Lucene-highlighter 2.1 doesn't work well in highlighting phase query.
>
> For example - if i have a query string "lucene Java" .It highlights
> not only occurrences of "lucene java" but occurrences of lucene and
> java too in the text.
>
> I think, this is a known problem..is this issue solved in lucene 2.2.
> well my application is almost complete and i really don't wanna switch
> to lucene 2.2.
>
> I was going through previous posts but i couldn't find a solution of
> this problem. There r some alternate highlighter s but it seems, they
> r not stable and still in evolution phase.
>
> I am looking for a standard n stable API for this purpose..
>
> I'd appreciate any thoughts/guidance in this issue.
>
> Thanks
> Sandeep
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org