You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mike Austin <ma...@gmail.com> on 2006/01/25 19:57:48 UTC

how to select top categories.

I have ~5 million documents that are in categories and subcategories. Let us
say that my query is for search terms in one top-level category and it
returns a large amount of documents and I want to list the top 5
sub-categories by highest count total. I know I can't go one by one counting
through the results because it is too slow.

I really don't care about the counts, I just want the sub-categories with
the most documents in them.

Options that I was thinking about:
1) Do queries for each sub-category using the results of the first initial
query and use the hits count to select the sub-categories to display, but I
might have thousands of sub-categories and it would be too slow..

2) Just select some sub-categories for each top-level category to always
show, but this is bad because that sub-category might not have items found
in it after the initial query.

3) Give-Up, but this is bad also..

Thanks for your help.

Re: how to select top categories.

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 25 January 2006 22:24, Chris Hostetter wrote:
> 
> : for this site, but would you cash all manufacturers and intersect all with
> : the initial query in one page load? Seems like that would be alot.
> 
> Yep it is a lot, but if you've got the RAM, it's not that time intensive.
> At CNET, depending on what page you are looking at, i'm doing
> anywhere from 100-1000 intersections.
> 
> : So your saying that in a single page load I might be able to do one 
intitial
> : query, and intersect thousands of bitsets in under a second with and an
> : index of around 5 million documents, assuming that the server/pc is decent
> 
> I can't predict exactly what your performance will be based on your
> index size -- and i certainly can't make any promises about it being under
> a second total time ... but you should try it, i think it's very feasible.
> 
> : speed and enough memory? I still would have to cache thousands of 625k
> : bits.. Could I do this with files instead of RAM maybe?
> 
> i played with serializing BitSets to disk, it can be done .. but reading
> the cached files has a definite added cost.  if you really don't have the
> ram to store alll of the bitsets in memory then you're going to want to at
> least use an in memory cache for some fixed number of commonly used
> categories ... wether that cache falls back to reding from disk on miss,
> or falls back to executing the category specific filter/query again
> depends on your goals.
> 

These more compact filters might be useful here:
http://issues.apache.org/jira/browse/LUCENE-328

They would probably take about two bytes per filtered document in this
case, both in RAM and on disk. Perhaps one won't even need the disk,
but adding some code to read/write such filters from a stream is easy.

I'd still prefer to give a score to each category based on the best
(say 100) document scores for the initial query, even though that would
need some more coding to keep track of the category scores.

These compact filters should work reasonably well as long as the
categories don't contain too many documents. For larger categories
these filters have the disadvantage that they still need to access every
document number for doing an intersection with the category, but BitSets
have this problem too.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: how to select top categories.

Posted by Chris Hostetter <ho...@fucit.org>.

: for this site, but would you cash all manufacturers and intersect all with
: the initial query in one page load? Seems like that would be alot.

Yep it is a lot, but if you've got the RAM, it's not that time intensive.
At CNET, depending on what page you are looking at, i'm doing
anywhere from 100-1000 intersections.

: So your saying that in a single page load I might be able to do one intitial
: query, and intersect thousands of bitsets in under a second with and an
: index of around 5 million documents, assuming that the server/pc is decent

I can't predict exactly what your performance will be based on your
index size -- and i certainly can't make any promises about it being under
a second total time ... but you should try it, i think it's very feasible.

: speed and enough memory? I still would have to cache thousands of 625k
: bits.. Could I do this with files instead of RAM maybe?

i played with serializing BitSets to disk, it can be done .. but reading
the cached files has a definite added cost. if you really don't have the
ram to store alll of the bitsets in memory then you're going to want to at
least use an in memory cache for some fixed number of commonly used
categories ... wether that cache falls back to reding from disk on miss,
or falls back to executing the category specific filter/query again
depends on your goals.

you should also keep in mind that if you goal is only to suggest *good*
matching categories and not neccessarily to provide a list of *all*
matching categories or hte *top* matching categories sorted by hits, then
you don't need to cache/intersect all of hte BitSets ... you can try the
biggest ones first, and ocne you've got "enough" "good" matches you can
stop and return what you have so far the the user.

quantifying "enough" and "good" depends (again) on your goals.

: On 1/25/06, Chris Hostetter <ho...@fucit.org> wrote:
: >
: >
: > You will likely find this thread interesting...
: >
: >
: > http://www.nabble.com/Announcement%3A-Lucene-powering-CNET.com-Product-Category-Listings-t266441.html
: >
: > : 1) Do queries for each sub-category using the results of the first
: > initial
: > : query and use the hits count to select the sub-categories to display,
: > but I
: > : might have thousands of sub-categories and it would be too slow..
: >
: > The key is not to repeat the query for every sub-cat with an added clause,
: > it's to do the query once using a HitCollector that generates a BitSet of
: > all matching results, and then intersect that with BitSets returned by
: > Filter's (or other HitCollectors) that you've used to get a list of *all*
: > results in each sub-cat.
: >
: > why is this faster? you ask .. because only the initial query changes on
: > each user search -- the set of all documents in a sub-cat doesn't change
: > untill new documents are added or deleted, so they can be cached (either
: > manually, or using CachingWrappingFIlter) ... doing a few thousand BitSet
: > intersections doesn't take as much time as you think.
: >
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: > For additional commands, e-mail: java-user-help@lucene.apache.org
: >
: >
:

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: how to select top categories.

Posted by Mike Austin <ma...@gmail.com>.

Chris.. thanks for you quick response.

:doing a few thousand BitSet intersections doesn't take as much time as you
think
Even if the BitSet is around 4-5 million? and I would have to quickly go
through about a thousand of these?

I guess I would have to decide what sub-cats to cache the bitsets for. But a
bitset of 5 million documents would be around 625k each (right?).. I would
need lots of RAM to do this for several thousand sub-cats. I read your link
and it would be the same for the mfgrs... How would you know what
manufacurers to display? I'm not sure how many mfgrs are in each category
for this site, but would you cash all manufacturers and intersect all with
the initial query in one page load? Seems like that would be alot.

So your saying that in a single page load I might be able to do one intitial
query, and intersect thousands of bitsets in under a second with and an
index of around 5 million documents, assuming that the server/pc is decent
speed and enough memory? I still would have to cache thousands of 625k
bits.. Could I do this with files instead of RAM maybe?

Thanks,
Mike

On 1/25/06, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> You will likely find this thread interesting...
>
>
> http://www.nabble.com/Announcement%3A-Lucene-powering-CNET.com-Product-Category-Listings-t266441.html
>
> : 1) Do queries for each sub-category using the results of the first
> initial
> : query and use the hits count to select the sub-categories to display,
> but I
> : might have thousands of sub-categories and it would be too slow..
>
> The key is not to repeat the query for every sub-cat with an added clause,
> it's to do the query once using a HitCollector that generates a BitSet of
> all matching results, and then intersect that with BitSets returned by
> Filter's (or other HitCollectors) that you've used to get a list of *all*
> results in each sub-cat.
>
> why is this faster? you ask .. because only the initial query changes on
> each user search -- the set of all documents in a sub-cat doesn't change
> untill new documents are added or deleted, so they can be cached (either
> manually, or using CachingWrappingFIlter) ... doing a few thousand BitSet
> intersections doesn't take as much time as you think.
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: how to select top categories.

Posted by Chris Hostetter <ho...@fucit.org>.

You will likely find this thread interesting...

http://www.nabble.com/Announcement%3A-Lucene-powering-CNET.com-Product-Category-Listings-t266441.html

: 1) Do queries for each sub-category using the results of the first initial
: query and use the hits count to select the sub-categories to display, but I
: might have thousands of sub-categories and it would be too slow..

The key is not to repeat the query for every sub-cat with an added clause,
it's to do the query once using a HitCollector that generates a BitSet of
all matching results, and then intersect that with BitSets returned by
Filter's (or other HitCollectors) that you've used to get a list of *all*
results in each sub-cat.

why is this faster? you ask .. because only the initial query changes on
each user search -- the set of all documents in a sub-cat doesn't change
untill new documents are added or deleted, so they can be cached (either
manually, or using CachingWrappingFIlter) ... doing a few thousand BitSet
intersections doesn't take as much time as you think.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org