You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Tim Johnson <ti...@saic.com> on 2005/07/27 22:32:36 UTC

hit count within categories

I'm working on a problem where I need to search over 160 million
documents.  I know Lucene can do this no sweat; my problem is that these
documents are grouped in more then 500 categories.  I need to get a
count of documents that match a given query, within each category.
There is no need for scoring the documents or even access the documents,
I just need the count.

Currently I'm using an index per category so I can access the total
number of hits quickly.  I've tried to use a custom HitsCollector object
and one large index to achieve the same thing but found that it was 3 to
4 times slower then iterating over 500 individual indexes.

Searches are sometimes taking more than 60 sec to run and can return
counts in the millions.

So my overall question is can this be done??  Any suggestions would be
helpful.

Thanks

Tim


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: hit count within categories

Posted by Ray Tsang <sa...@gmail.com>.
I also had similar problem.  It was essentially a 'group by'-like
requirement.    I used both get(fieldName) and getTermFreqVector(...),
it seemed that get(fieldName) on a page of results (say, 10 results
per page) was faster than getTermFreqVector() for me.

ray,

On 7/29/05, mark harwood <ma...@yahoo.co.uk> wrote:
> > Is there a faster way to access the total hits
> > count??
> 
> The solution I outlined could be adapted to work
> across multiple indexes - you'd just have to aggregate
> the totals.
> 
> If going from all category terms to matching doc ids
> is slow you could do it the other way going from
> matching doc ids to terms.
> 
> You can feasibly do this by :
> a) IndexReader.document(hitDocId).get("category")
> or
> b)
> IndexReader.getTermFreqVector(hitDocId,"category").getTerms()
> 
> Unfortunately a) reads ALL fields for a doc off the
> disk and is probably very slow. b) would be quicker
> but would require you to index with TermFreqVector
> support.
> I'm not sure if b) would be faster than the term to
> docids approach I originally suggested - you'd have to
> try it and see how it performs on your data.
> 
> Cheers,
> Mark
> 
> 
> 
> 
> ___________________________________________________________
> How much free photo storage do you get? Store your holiday
> snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: hit count within categories

Posted by mark harwood <ma...@yahoo.co.uk>.
> Is there a faster way to access the total hits
> count??

The solution I outlined could be adapted to work
across multiple indexes - you'd just have to aggregate
the totals.

If going from all category terms to matching doc ids
is slow you could do it the other way going from
matching doc ids to terms.

You can feasibly do this by :
a) IndexReader.document(hitDocId).get("category")
or
b)
IndexReader.getTermFreqVector(hitDocId,"category").getTerms()

Unfortunately a) reads ALL fields for a doc off the
disk and is probably very slow. b) would be quicker
but would require you to index with TermFreqVector
support.
I'm not sure if b) would be faster than the term to
docids approach I originally suggested - you'd have to
try it and see how it performs on your data.

Cheers,
Mark



		
___________________________________________________________ 
How much free photo storage do you get? Store your holiday 
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: hit count within categories

Posted by Tim Johnson <ti...@saic.com>.
Thanks Mark

I've looked at your posting and it's not the answer to my problem.  In
testing one large index v. several small indexes, I've found that for
high frequency terms, the small individual indexes perform better by a
factory of 2 to 3 times.  I know this is contrary to what is recommended
but in my case I need a total hits count, not a ranked list with the
first 100 docs cached.

It appears the overhead involved in resolving terms to docs ids, for
large indexes (+25G), and well getting a total count is what is slowing
things down.

Is there a faster way to access the total hits count??

-----Original Message-----
From:
java-user-return-15600-timothy.w.johnson=saic.com@lucene.apache.org
[mailto:java-user-return-15600-timothy.w.johnson=saic.com@lucene.apache.
org] On Behalf Of markharw00d
Sent: Wednesday, July 27, 2005 4:42 PM
To: java-user@lucene.apache.org
Subject: Re: hit count within categories

I posted the code I use to do this (based on a single index) here:

http://marc.theaimsgroup.com/?l=lucene-dev&m=111044178212335&w=2

Cheers
Mark


	
	
		
___________________________________________________________ 
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with
voicemail http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: hit count within categories

Posted by markharw00d <ma...@yahoo.co.uk>.
I posted the code I use to do this (based on a single index) here:

http://marc.theaimsgroup.com/?l=lucene-dev&m=111044178212335&w=2

Cheers
Mark


	
	
		
___________________________________________________________ 
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org