You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Doug Cutting <cu...@apache.org> on 2005/05/06 21:04:25 UTC
multi-field highlighting
There's a post over at SearchEngineWatch theorizing about how Google
produces summaries.
http://forums.searchenginewatch.com/showthread.php?threadid=5448
Lucene's current highlighter doesn't easily support multi-fields, nor
does it take phrasal matching into account. It might be useful to have
a highligher API that takes a Document and summarizes all of its fields,
incorporating their boosts in fragment scores. Thoughts?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: multi-field highlighting
Posted by markharw00d <ma...@yahoo.co.uk>.
Doug Cutting wrote:
> Shouldn't the search code already take care of that?
No, the search may return documents that happen to contain "Doug
Cutting" and Google - the current highlighter implementation uses all
query terms (ignoring any AND/OR() operators) and looks for matches.
Ideally "Doug Cutting" shouldn't be highlighted in the document "Doug
Cutting loves google" when I searched for ("Doug Cutting" AND lucene) OR
google.
This is a nice-to-have and I suspect this is not an issue people feel
strongly about. We could continue to ignore the complexities of
representing the results of such boolean logic - most queries don't use
it anyway.
> The query should thus be compared to each potential highlight
> fragment. This evaluation is different than the whole-document
> evaluation performed by search. If no fragments match the entire
> query, then fragments should be selected which, considered together,
> match the entire query.
Is this based on the approach (I think you suggested before now) to chop
the doc into fragment-sized docs held in a RAM directory and then query
it to get the best fragments? I think it would prove difficult to
identify the combination of fragments that ultimately satisfied a query
which contained complex boolean logic.
My original idea for an approach was to let the queries initially
generate a "heat map" which scored every token in the document. Any
boolean queries which failed to be satisfied completely (eg the Doug AND
lucene example) would not generate a score for its tokens. Phrase
queries would only score the token occurences in the document where all
tokens were grouped.
The highlighter would then use the heat map to pick the best "runs" of
tokens.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: multi-field highlighting
Posted by Doug Cutting <cu...@apache.org>.
markharw00d wrote:
> Before we leap into adding code into the highlighter though I think it's
> worth considering what we are trying to fix here in a more general sense.
> As a basic principle I think highlighting should attempt to show the
> user what the search engine saw as important in the document.
> With that principle in mind I should really make sure that if I search for:
> ("Doug Cutting" AND lucene) OR google
>
> I shouldn't highlight "Doug Cutting" in a matching document that has
> google but not lucene.
Shouldn't the search code already take care of that? That said, for a
document that contains both "Doug Cutting loves Lucene" and "Doug
Cutting loves Google", ideally a highlighter should prefer "Doug Cutting
loves Google". The query should thus be compared to each potential
highlight fragment. This evaluation is different than the
whole-document evaluation performed by search. If no fragments match
the entire query, then fragments should be selected which, considered
together, match the entire query.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: multi-field highlighting
Posted by markharw00d <ma...@yahoo.co.uk>.
Phrase highlighting (and spans) would certainly be useful, as would
multi-field.
Before we leap into adding code into the highlighter though I think it's
worth considering what we are trying to fix here in a more general sense.
As a basic principle I think highlighting should attempt to show the
user what the search engine saw as important in the document.
With that principle in mind I should really make sure that if I search for:
("Doug Cutting" AND lucene) OR google
I shouldn't highlight "Doug Cutting" in a matching document that has
google but not lucene.
If we are going to try to be true to representing the query logic in our
display we end up having to re-implement a lot of the query logic in
the highlighter eg taking account of slop factors etc
We could avoid over-complicating the highlighter in this way if the
different queries could provide information of use in highlighting - a
variant of the "explain" function that would describe not only the
scoring but the sections of the document to which these scores relate.
Does this approach sound feasible?
> There's a post over at SearchEngineWatch theorizing about how Google
> produces summaries.
>
> http://forums.searchenginewatch.com/showthread.php?threadid=5448
>
> Lucene's current highlighter doesn't easily support multi-fields, nor
> does it take phrasal matching into account. It might be useful to
> have a highligher API that takes a Document and summarizes all of its
> fields, incorporating their boosts in fragment scores. Thoughts?
>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: multi-field highlighting
Posted by Martin Haye <m1...@snyder-haye.com>.
Hi Mark,
Thanks very much for your comments.
> It looks like all queries on the site have to be a span of some
> kind (ie all search terms must appear in the document). Is your
> highlighting code applicable to other modes of querying?
In a sense you're correct, in that the system relies on all leaf queries to be span queries. Doug's original span code was already set up to report the positional information of hits, while a normal TermQuery for instance doesn't even iterate the positions. I also added a couple new span queries to fill out the set: SpanRangeQuery and SpanWildcardQuery.
For complex queries within one field, they're joined with the normal span operators -- SpanOrQuery, SpanNearQuery, SpanNotQuery, etc. To form a single query across multiple fields, it's the the usual (non-span-oriented) BooleanQuery.
So to answer your question, the system isn't applicable to queries that don't use spans. But since we want to highlight based on positional data, that seemed only right.
Someone mentioned a while back modifying the query parser to (optionally?) produce span queries. This might be another good reason to do so... or are you suggesting something else?
--Martin
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: multi-field highlighting
Posted by markharw00d <ma...@yahoo.co.uk>.
Hi Martin, welcome to the group.
>>You can see it in action here:
Very nice work! I like the forward/backward links between hits.
>Unfortunately, it involves significant additions to the Lucene core. In essence it relies on an amped-up span system that is capable of scoring the spans, as well as recording which spans matched for each document field.
>
>
It looks like all queries on the site have to be a span of some kind (ie
all search terms must appear in the document). Is your highlighting code
applicable to other modes of querying?
Cheers,
Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: multi-field highlighting
Posted by Martin Haye <m1...@snyder-haye.com>.
As part of my work on XTF for the California Digital Library, I've written such a highlighter. You can see it in action here:
http://texts.cdlib.org/escholarship/
It supports multi-field highlighting, and ranks the matches within a document field. It highlights the extent of the actual hits, as well as the terms within a hit (click on a text hit to see this highlighting). I think that's what Doug means by "phrasal" matching.
Unfortunately, it involves significant additions to the Lucene core. In essence it relies on an amped-up span system that is capable of scoring the spans, as well as recording which spans matched for each document field.
This is the second rev of the code, and was designed to be contributed to back into Lucene. It's already apache licensed, and pretty well documented. I also tried to ensure zero speed impact for queries that don't need span recording. Here's the project page: http://sourceforge.net/projects/xtf
A few weeks ago I joined the Lucene dev mailing list, and I've been trying to get the lay of the land before I suggest changes to the Lucene core. Okay, that's only partly true. Actually, I've never contributed to a project like this before, and have been trying to work up the courage.
The code is based on 1.4.3; if people are interested, I'll work on a patch to the current svn trunk. I'll also have to port our test suite over to junit.
--Martin
On Fri, 06 May 2005 12:04:25 -0700, Doug Cutting wrote:
>�There's a post over at SearchEngineWatch theorizing about how
>�Google produces summaries.
>�
>�http://forums.searchenginewatch.com/showthread.php?threadid=5448
>�
>�Lucene's current highlighter doesn't easily support multi-fields,
>�nor does it take phrasal matching into account. �It might be useful
>�to have a highligher API that takes a Document and summarizes all
>�of its fields, incorporating their boosts in fragment scores. �
>�Thoughts?
>�
>�Doug
>�
>�
>�--------------------------------------------------------------------
>�- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>�For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org