You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Dima May <di...@gmail.com> on 2007/03/18 06:55:32 UTC

categorizing results

 <ja...@lucene.apache.org>

I have a Lucene related questions/problem.

My search results can potentially get very large 200,000+. I want to
categorize my results. So for example if I have an indexed field "type" that
has such things as CDs, books, videos, power drills, or anything else in the
world, I would want to display how many results are found for each unique
"type". So I may get 200 CDs and 3000 books and so on.

The brute-force way of doing that is traversing the returned
documents/results and counting each unique type. Unfortunately this solution
is too slow. I am having trouble figuring out a faster approach and was
hoping someone can suggest one.

Thank you for your time.

Dima

Re: categorizing results

Posted by Dima May <di...@gmail.com>.

Nicolas,

Thank you for the reply! The problem that my categories are not static they
are generated at runtime, they are added and removed all the time. For that
reason I have no way to pre-generate the filters.

Dima

On 3/18/07, Nicolas Lalevée <ni...@anyware-tech.com> wrote:
>
> Le dimanche 18 mars 2007 06:55, Dima May a écrit:
> >  <ja...@lucene.apache.org>
> >
> > I have a Lucene related questions/problem.
> >
> > My search results can potentially get very large 200,000+. I want to
> > categorize my results. So for example if I have an indexed field "type"
> > that has such things as CDs, books, videos, power drills, or anything
> else
> > in the world, I would want to display how many results are found for
> each
> > unique "type". So I may get 200 CDs and 3000 books and so on.
> >
> > The brute-force way of doing that is traversing the returned
> > documents/results and counting each unique type. Unfortunately this
> > solution is too slow. I am having trouble figuring out a faster approach
> > and was hoping someone can suggest one.
>
> The solution I use is using filters. So before doing the search, you have
> to
> already compute a pool a filters, one filter for each category.
> Then use the search(Query, HitCollector), and use a collector which will
> dispatch in different queues the results, depending of which filter
> matched.
> The collect function of the collector should look like that :
>
>         public void collect(int doc, float score) {
>             total++;
>             for (int i = 0; i < filters.length; i++) {
>                 if (filters[i].get(doc)) {
>                     DocScoreCategory docScoreCategory = new
> DocScoreCategory(doc, score, i);
>                     queue[i].insert(docScoreCategory);
>                     subTotals[i]++;
>                 }
>             }
>         }
>
> It is nearly as fast as a normal search but has the background that some
> filters have to be computed, and it can take some time. In our application
> updates are done not so often, so we can generate them in background. With
> an
> index permanently updated this will be very time consuming.
>
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: categorizing results

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Le dimanche 18 mars 2007 06:55, Dima May a écrit :
>  <ja...@lucene.apache.org>
>
> I have a Lucene related questions/problem.
>
> My search results can potentially get very large 200,000+. I want to
> categorize my results. So for example if I have an indexed field "type"
> that has such things as CDs, books, videos, power drills, or anything else
> in the world, I would want to display how many results are found for each
> unique "type". So I may get 200 CDs and 3000 books and so on.
>
> The brute-force way of doing that is traversing the returned
> documents/results and counting each unique type. Unfortunately this
> solution is too slow. I am having trouble figuring out a faster approach
> and was hoping someone can suggest one.

The solution I use is using filters. So before doing the search, you have to 
already compute a pool a filters, one filter for each category.
Then use the search(Query, HitCollector), and use a collector which will 
dispatch in different queues the results, depending of which filter matched. 
The collect function of the collector should look like that :

        public void collect(int doc, float score) {
            total++;
            for (int i = 0; i < filters.length; i++) {
                if (filters[i].get(doc)) {
                    DocScoreCategory docScoreCategory = new 
DocScoreCategory(doc, score, i);
                    queue[i].insert(docScoreCategory);
                    subTotals[i]++;
                }
            }
        }

It is nearly as fast as a normal search but has the background that some 
filters have to be computed, and it can take some time. In our application 
updates are done not so often, so we can generate them in background. With an 
index permanently updated this will be very time consuming.

Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: categorizing results

Posted by Erick Erickson <er...@gmail.com>.

You might also want to search the mail archive for
"faceted search" and/or Categories. This topic has been discussed
under that heading, I believe...

Erick

On 3/18/07, Dima May <di...@gmail.com> wrote:
>
> <ja...@lucene.apache.org>
>
> I have a Lucene related questions/problem.
>
> My search results can potentially get very large 200,000+. I want to
> categorize my results. So for example if I have an indexed field "type"
> that
> has such things as CDs, books, videos, power drills, or anything else in
> the
> world, I would want to display how many results are found for each unique
> "type". So I may get 200 CDs and 3000 books and so on.
>
> The brute-force way of doing that is traversing the returned
> documents/results and counting each unique type. Unfortunately this
> solution
> is too slow. I am having trouble figuring out a faster approach and was
> hoping someone can suggest one.
>
> Thank you for your time.
>
> Dima
>