You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ilya Zavorin <iz...@caci.com> on 2011/12/01 17:49:09 UTC

Design qs: search for multiple terms in document collection

I am trying to make some high- (and not so high) level design decisions for my app that is supposed to check a collection of documents against a set of terms/queries. Basically, I need to perform a triage of sorts when I would find only those docs in the collection which have occurrences of at least one term from the term list. For those docs, I also need to find where in the document each occurrence is, since I then need to collect a small amount of surrounding text for a more detailed analysis.

Clearly, I will need to index the document collection using indexing classes of Lucene. This is pretty straighforward. 

Then I will need to use the highlighting classes. In some sample cose I found online, a query is first searched for and hits are returned. Then docids are extracted for the hits and query is highlighted. Some questions:

Q1: Does Lucene perform essentially the same searching operation twice, first to find hits, then to highlight? If so, does this mean that if I expect most of the docs in my collection to contain at least one of the search terms, it might be faster for me to skip searching and simply go over all docs, applying highlighting? Then for those docs where no hits occurred I would simply get an empty list of relevant fragments. 

Q2: Is the same scoring mechanism used during search and during highlighting? That is, can I be sure that if I get a hit during search, the corresponding document indeed contains my query that will then be found dyuring highlighting?

Q3: Are there any mechanisms in Lucene that would facilitate merging of highlighting results for two different queries against a single document? 

Q4: I did some small tests of highlighting and noticed that some of the fragments returned for a query contained highlighted text that was quite far from the original query. For instance, I was looking for a 3-word term and it highlighted a sequence of only 2 of these 3 words. How can I control how close highlighted fragments should be to the original query?



Thanks much,

Ilya Zavorin



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org