You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dennis Kubes <nu...@dragonflymc.com> on 2006/10/24 16:32:34 UTC

Re: Plugin HitCollector

Our problem is that we need to count hits for sub-categories.  There are 
over 550,000 categories.  I am assuming I can't do this inside of a 
bitset?  Is there a good way to do this type of functionality?

Dennis

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> We are running into the same issue.  Remember that hits just give you 
>> doc id and getting hit details from the hit does another read.  So 
>> looping through the hits to access every document will do a read per 
>> document.  If it is a small number of hits, no big deal, but the more 
>> hits to access, the more time.  For our situation limiting the query 
>> doesn't work, we need to know information about the hit itself (i.e. 
>> a certain field so we can do a count based on the field).  We 
>> implemented it using HitCollector modifications in Lucene.  This 
>> works but is not ideal in terms of speed so we are looking at making 
>> modifications to the IndexReader itself so when it gets the Hits it 
>> also gets our field.  Understand that doing something like this 
>> though changes core Lucene functionality.  I am not necessarily 
>> recommending doing it this way, we just couldn't find another way.
>
> Well, all depends on what kind of details you need to get from each 
> hit. Have you tried using FieldCache instead? Or pre-populated BitSets 
> which you then would intersect with the result BitSet to get counts of 
> matching docs?
>