You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Doug Cutting <cu...@apache.org> on 2005/05/06 21:04:25 UTC

multi-field highlighting

There's a post over at SearchEngineWatch theorizing about how Google 
produces summaries.

http://forums.searchenginewatch.com/showthread.php?threadid=5448

Lucene's current highlighter doesn't easily support multi-fields, nor 
does it take phrasal matching into account.  It might be useful to have 
a highligher API that takes a Document and summarizes all of its fields, 
incorporating their boosts in fragment scores.  Thoughts?

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: multi-field highlighting

Posted by markharw00d <ma...@yahoo.co.uk>.

Doug Cutting wrote:

> Shouldn't the search code already take care of that?  

No, the search may return documents that happen to contain "Doug 
Cutting" and Google - the current highlighter implementation uses all 
query terms (ignoring any AND/OR() operators) and looks for matches. 
Ideally "Doug Cutting" shouldn't be highlighted in the document "Doug 
Cutting loves google" when I searched for ("Doug Cutting" AND lucene) OR 
google.

This is a nice-to-have and I suspect this is not an issue people feel 
strongly about. We could continue to ignore the complexities of 
representing the results of such boolean logic - most queries don't use 
it anyway.

> The query should thus be compared to each potential highlight 
> fragment.  This evaluation is different than the whole-document 
> evaluation performed by search.  If no fragments match the entire 
> query, then fragments should be selected which, considered together, 
> match the entire query.

Is this based on the approach (I think you suggested before now) to chop 
the doc into fragment-sized docs held in a RAM directory and then query 
it to get the best fragments? I think it would prove difficult to 
identify the combination of fragments that ultimately satisfied a query 
which contained complex boolean logic.

My original idea for an approach was to let the queries initially 
generate a "heat map" which scored every token in the document. Any 
boolean queries which failed to be satisfied completely (eg the Doug AND 
lucene example) would not generate a score for its tokens. Phrase 
queries would only score the token occurences in the document where all 
tokens were grouped.
The highlighter would then use the heat map to pick the best "runs" of 
tokens.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: multi-field highlighting

Posted by Doug Cutting <cu...@apache.org>.

markharw00d wrote:
> Before we leap into adding code into the highlighter though I think it's 
> worth considering what we are trying to fix here in a more general sense.
> As a basic principle I think highlighting should attempt to show the 
> user what the search engine saw as important in the document.
> With that principle in mind I should really make sure that if I search for:
> ("Doug Cutting" AND lucene) OR google
> 
> I shouldn't highlight  "Doug Cutting" in a matching document that has 
> google but not lucene.

Shouldn't the search code already take care of that?  That said, for a 
document that contains both "Doug Cutting loves Lucene" and "Doug 
Cutting loves Google", ideally a highlighter should prefer "Doug Cutting 
loves Google".  The query should thus be compared to each potential 
highlight fragment.  This evaluation is different than the 
whole-document evaluation performed by search.  If no fragments match 
the entire query, then fragments should be selected which, considered 
together, match the entire query.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: multi-field highlighting

Posted by markharw00d <ma...@yahoo.co.uk>.

Phrase highlighting (and spans) would certainly be useful, as would 
multi-field.

Before we leap into adding code into the highlighter though I think it's 
worth considering what we are trying to fix here in a more general sense.
As a basic principle I think highlighting should attempt to show the 
user what the search engine saw as important in the document.
With that principle in mind I should really make sure that if I search for:
("Doug Cutting" AND lucene) OR google

I shouldn't highlight  "Doug Cutting" in a matching document that has 
google but not lucene.

If we are going to try to be true to representing the query logic in our 
display we end up having to re-implement a lot of  the query logic in 
the highlighter eg taking account of slop factors etc
We could avoid over-complicating the highlighter in this way if the 
different queries could provide information of use in highlighting - a 
variant of  the "explain" function that would describe not only the 
scoring but  the sections of the document to which these scores relate.

Does this approach sound feasible?




> There's a post over at SearchEngineWatch theorizing about how Google 
> produces summaries.
>
> http://forums.searchenginewatch.com/showthread.php?threadid=5448
>
> Lucene's current highlighter doesn't easily support multi-fields, nor 
> does it take phrasal matching into account.  It might be useful to 
> have a highligher API that takes a Document and summarizes all of its 
> fields, incorporating their boosts in fragment scores.  Thoughts?
>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: multi-field highlighting

Posted by Martin Haye <m1...@snyder-haye.com>.

Hi Mark,

Thanks very much for your comments.

> It looks like all queries on the site have to be a span of some
> kind (ie all search terms must appear in the document). Is your
> highlighting code applicable to other modes of querying?

In a sense you're correct, in that the system relies on all leaf queries to be span queries. Doug's original span code was already set up to report the positional information of hits, while a normal TermQuery for instance doesn't even iterate the positions. I also added a couple new span queries to fill out the set: SpanRangeQuery and SpanWildcardQuery.

For complex queries within one field, they're joined with the normal span operators -- SpanOrQuery, SpanNearQuery, SpanNotQuery, etc. To form a single query across multiple fields, it's the the usual (non-span-oriented) BooleanQuery.

So to answer your question, the system isn't applicable to queries that don't use spans. But since we want to highlight based on positional data, that seemed only right.

Someone mentioned a while back modifying the query parser to (optionally?) produce span queries. This might be another good reason to do so... or are you suggesting something else?

--Martin


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: multi-field highlighting

Posted by markharw00d <ma...@yahoo.co.uk>.

Hi Martin, welcome to the group.

 >>You can see it in action here:
Very nice work! I like the forward/backward links between hits.

>Unfortunately, it involves significant additions to the Lucene core. In essence it relies on an amped-up span system that is capable of scoring the spans, as well as recording which spans matched for each document field.
>  
>
It looks like all queries on the site have to be a span of some kind (ie 
all search terms must appear in the document). Is your highlighting code 
applicable to other modes of querying?

Cheers,
Mark




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: multi-field highlighting

Posted by Martin Haye <m1...@snyder-haye.com>.

As part of my work on XTF for the California Digital Library, I've written such a highlighter. You can see it in action here:

	http://texts.cdlib.org/escholarship/

It supports multi-field highlighting, and ranks the matches within a document field. It highlights the extent of the actual hits, as well as the terms within a hit (click on a text hit to see this highlighting). I think that's what Doug means by "phrasal" matching.

Unfortunately, it involves significant additions to the Lucene core. In essence it relies on an amped-up span system that is capable of scoring the spans, as well as recording which spans matched for each document field.

This is the second rev of the code, and was designed to be contributed to back into Lucene. It's already apache licensed, and pretty well documented. I also tried to ensure zero speed impact for queries that don't need span recording. Here's the project page: http://sourceforge.net/projects/xtf 

A few weeks ago I joined the Lucene dev mailing list, and I've been trying to get the lay of the land before I suggest changes to the Lucene core. Okay, that's only partly true. Actually, I've never contributed to a project like this before, and have been trying to work up the courage.

The code is based on 1.4.3; if people are interested, I'll work on a patch to the current svn trunk. I'll also have to port our test suite over to junit.

--Martin

On Fri, 06 May 2005 12:04:25 -0700, Doug Cutting wrote:
>�There's a post over at SearchEngineWatch theorizing about how
>�Google produces summaries.
>�
>�http://forums.searchenginewatch.com/showthread.php?threadid=5448
>�
>�Lucene's current highlighter doesn't easily support multi-fields,
>�nor does it take phrasal matching into account. �It might be useful
>�to have a highligher API that takes a Document and summarizes all
>�of its fields, incorporating their boosts in fragment scores. �
>�Thoughts?
>�
>�Doug
>�
>�
>�--------------------------------------------------------------------
>�- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>�For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org