You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sandeep chawla <sa...@gmail.com> on 2007/07/02 18:32:01 UTC
highlighting phrase query
Hi All,
I am developing a search tool using lucene. I am using lucene 2.1.
i have a requirement to highlight query words in the results.
.Lucene-highlighter 2.1 doesn't work well in highlighting phase query.
For example - if i have a query string "lucene Java" .It highlights
not only occurrences of "lucene java" but occurrences of lucene and
java too in the text.
I think, this is a known problem..is this issue solved in lucene 2.2.
well my application is almost complete and i really don't wanna switch
to lucene 2.2.
I was going through previous posts but i couldn't find a solution of
this problem. There r some alternate highlighter s but it seems, they
r not stable and still in evolution phase.
I am looking for a standard n stable API for this purpose..
I'd appreciate any thoughts/guidance in this issue.
Thanks
Sandeep
--
SANDEEP CHAWLA
House No- 23
10th main
BTM 1st Stage
Bangalore Mobile: 91-9986150603
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: highlighting phrase query
Posted by Mark Miller <ma...@gmail.com>.
>
> has any one used Lucene-794? how stable it it. is it widely used in
> industry.
>
>
I have used it extensively and I would say it is extremely stable. As I
said, much of the code from it is literally the same compiled code from
Contrib Highlighter (It is really just a new Scorer class for the
Contrib Highlighter).
It is also the newest of the Highlighter issues in JIRA and I would say
it is not even close to widely used in the industry.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: highlighting phrase query
Posted by sandeep chawla <sa...@gmail.com>.
Thanks a lot Mark,
has any one used Lucene-794? how stable it it. is it widely used in industry.
These are some of my questions :)
Thanks
Sandeep
On 03/07/07, Renaud Waldura <re...@library.ucsf.edu> wrote:
> Mark:
>
> Thanks a million for this comprehensive analysis. This is going straight to
> my manager. :)
>
> --Renaud
>
>
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: Monday, July 02, 2007 2:11 PM
> To: java-user@lucene.apache.org
> Subject: Re: highlighting phrase query
>
> There has been a lot of Highlighter discussion lately, but just to try and
> sum up the state of Highlighting in the Lucene world:
>
> There are four Highlighter implementations that I know of. From what I can
> tell, only the original Contrib Highlighter has received sustained active
> development by more than one individual.
>
> Contrib... [snip]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--
SANDEEP CHAWLA
House No- 23
10th main
BTM 1st Stage
Bangalore Mobile: 91-9986150603
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: highlighting phrase query
Posted by Renaud Waldura <re...@library.ucsf.edu>.
Mark:
Thanks a million for this comprehensive analysis. This is going straight to
my manager. :)
--Renaud
-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com]
Sent: Monday, July 02, 2007 2:11 PM
To: java-user@lucene.apache.org
Subject: Re: highlighting phrase query
There has been a lot of Highlighter discussion lately, but just to try and
sum up the state of Highlighting in the Lucene world:
There are four Highlighter implementations that I know of. From what I can
tell, only the original Contrib Highlighter has received sustained active
development by more than one individual.
Contrib... [snip]
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: highlighting phrase query
Posted by Mark Miller <ma...@gmail.com>.
There has been a lot of Highlighter discussion lately, but just to try
and sum up the state of Highlighting in the Lucene world:
There are four Highlighter implementations that I know of. From what I
can tell, only the original Contrib Highlighter has received sustained
active development by more than one individual.
Contrib Highlighter:
The Contrib Highlighter supports the widest array of analyzers and
corner cases and has had the widest exposure. It is generally slower on
larger documents due to the requirement that you re-analyze text and to
support a wider variety of use cases -- the TokenGroup for token
overlaps and inspecting every term for Fragmentation contribute to a
huge performance drain on large documents. This highlighter does not
support highlighting based on position and all terms from the query will
be highlighted in the text. You can avoid some of the cost of
re-analyzing by using the TokenSources class to rebuild a TokenStream
using stored offsets and/or positions, but this is unlikely to be faster
unless you are using very large documents with a complex analyzer.
Getting and sorting offsets/positions is relatively slow and for smaller
docs it is faster to just re-analyze.
LUCENE-403:
I have not spent a lot of time with this approach, but it is similar to
the Contrib Highlighter approach. It almost certainly does not cover as
many odd corner cases as Contrib Highlighter and the framework is
lacking, but it does add some support for proper PhraseQuery
highlighting by implementing some custom PhraseQuery search logic.
Because LUCENE-403 is not as rigorous as the Contrib Highlighter, it may
well be a bit faster. The author claims that HTML tags will not be
broken when fragmenting.
LUCENE-644:
This Highlighter approach requires that you have stored term offsets in
the index. This Highlighter can be very fast if you are using a
complicated analyzer since there is no need for re-analyzing the text
(due to the stored offsets). Also, rather then scoring every term like
the Contrib Highlighter, only terms from the query are effectively
"handled". For smaller documents and simpler analyzers there is not much
speed improvement over the Contrib Highlighter (due to the time it takes
to retrieve and sort offsets), but for larger documents , especially
with more complex analyzers, this Highlighter can be extremely fast.
Again, positional highlighting for Phrase and Span queries is not
supported.
The biggest reason this implementation performs so well is that it deals
with the text in much bigger chunks. Contrib Highlighter can also avoid
re-analyzing by storing offsets and positions, but then it scores the
document and rebuilds the text one token at a time using the performance
draining TokenGroup (which helps cover some of those corner cases). This
is very slow on very large documents.
LUCENE-794:
This approach extends the Contrib Highlighter to support Highlighting
Span and Phrase queries. The approach used for non position sensitive
Query clauses is the same as the Contrib Highlighter, and if you use the
latest CachingTokenFilter the speed is roughly about the same. Position
sensitive Query clauses are a bit slower as a MemoryIndex is used to
retrieve the correct positions to Highlight. This gives exact
highlighting without reimplementing search logic. Also, all of the use
cases and corner cases that have been solved for the Contrib Highlighter
are retained. All of the deficiencies of the Contrib Highlighter (slower
on large docs) are also retained. The majority of the code for this
comes from the Contrib Highlighter -- it uses the Contrib Highlighter
framework. Which points out a plus for the Contrib Highlighter setup --
it allows for an extension like this, while LUCENE-644 could not easily
be expanded to handle position sensitive queries.
There has been some discussion of getting Lucene to identify correct
highlights as the search is processed. I am not very optimistic that
this will be fruitful, but those that are discussing it know more more
about this than I do.
- Mark
sandeep chawla wrote:
> Hi All,
>
> I am developing a search tool using lucene. I am using lucene 2.1.
>
> i have a requirement to highlight query words in the results.
> .Lucene-highlighter 2.1 doesn't work well in highlighting phase query.
>
> For example - if i have a query string "lucene Java" .It highlights
> not only occurrences of "lucene java" but occurrences of lucene and
> java too in the text.
>
> I think, this is a known problem..is this issue solved in lucene 2.2.
> well my application is almost complete and i really don't wanna switch
> to lucene 2.2.
>
> I was going through previous posts but i couldn't find a solution of
> this problem. There r some alternate highlighter s but it seems, they
> r not stable and still in evolution phase.
>
> I am looking for a standard n stable API for this purpose..
>
> I'd appreciate any thoughts/guidance in this issue.
>
> Thanks
> Sandeep
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org