You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dan Funk <fu...@BATTELLE.ORG> on 2005/08/17 21:29:07 UTC

to filter or not to filter

Currently I'm working with a single index where content is indexed by 
it's original printed page. I have to show the total number of matching 
documents, so I end up running through all the hits and taking an order 
of magnitude hit on performance as I calculate the number of unique 
documents.  It's stupid for many many reasons.

To correct all this, I've decided to create two (maybe three) indexes 
for the same set of documents: in the first index there is a one to one 
relationship between the original document and the Lucene Document 
object.  The other index is a paragraph index, where each lucene 
document represents a single paragraph.   I may even throw in a third 
index where each lucene document represents a logical section/chapter.

When I'm building the search results page I'll have to execute a fair 
number of queries. The first query will execute on the Document-Index, 
then for each of the 10 to 2o results I'm displaying at the time, I'll 
execute another query to find the best paragraph and or section.

Is this a reasonable solution to the problem? 

Thanks for the advice.
Dan


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: to filter or not to filter

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Aug 17, 2005, at 3:29 PM, Dan Funk wrote:
> Currently I'm working with a single index where content is indexed  
> by it's original printed page. I have to show the total number of  
> matching documents, so I end up running through all the hits and  
> taking an order of magnitude hit on performance as I calculate the  
> number of unique documents.  It's stupid for many many reasons.
>
> To correct all this, I've decided to create two (maybe three)  
> indexes for the same set of documents: in the first index there is  
> a one to one relationship between the original document and the  
> Lucene Document object.  The other index is a paragraph index,  
> where each lucene document represents a single paragraph.   I may  
> even throw in a third index where each lucene document represents a  
> logical section/chapter.
>
> When I'm building the search results page I'll have to execute a  
> fair number of queries. The first query will execute on the  
> Document-Index, then for each of the 10 to 2o results I'm  
> displaying at the time, I'll execute another query to find the best  
> paragraph and or section.
>
> Is this a reasonable solution to the problem?
> Thanks for the advice.

Just one design alternative - a Lucene index does not have to be  
homogenous in terms of the fields for a document.  So you could index  
all those various granularities into a single index with an  
additional field per document indicating whether it is a document,  
paragraph, or section/chapter.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org