You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by mike_大雄 <42...@qq.com> on 2014/02/17 05:10:53 UTC
Only highlight terms that caused a search hit/match

3 posts
Hello, 

I have recently been given a requirement to improve document highlights
within our system. Unfortunately, the current functionality gives more of a
best-guess on what terms to highlight vs the actual terms to highlight that
actually did perform the match. A couple examples of issues that were found: 

Nested boolean clause with a term that doesn’t exist ANDed with a term that
does highlights the ignored term in the query 
Text: a b c 
Logical Query: a OR (b AND z) 
Result: *a* *b* c 
Expected: *a* b c 
Nested span query doesn’t maintain the proper positions and offsets 
Text: y z x y z a 
Logical Query: (“x y z”, a) span near 10 
Result: *y* *z* *x* *y* *z* *a* 
Expected: y z *x* *y* *z* *a* 

I am currently using the Highlighter with a QueryScorer and a
SimpleSpanFragmenter. While looking through the code it looks like the
entire query structure is dropped in the WeightedSpanTermExtractor by just
grabbing any positive TermQuery and flattening them all into a simple Map
which is then passed on to highlight all of those terms. I believe this over
simplification of term extraction is the crux of the issue and needs to be
modified in order to produce more “exact” highlights. 

I was brainstorming with a colleague and thought perhaps we can spin up a
MemoryIndex to index that one document and start performing a depth-first
search of all queries within the overall Lucene query graph. At that point
we can start querying the MemoryIndex for leaf queries and start walking
back up the tree, pruning branches that don’t result in a search hit which
results in a map of actual matched query terms. This approach seems pretty
painful but will hopefully produce better matches. I would like to see what
the experts on the mailing list would have to say about this approach or is
there a better way to retrieve the query terms & positions that produced the
match? Or perhaps there is a different Highlighter implementation that
should be used, though our user queries are extremely complex with a lot of
nested queries of various types. 

Thanks, 





--
View this message in context: http://lucene.472066.n3.nabble.com/Only-highlight-terms-that-caused-a-search-hit-match-tp4117692.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org