You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Pablo Gomes Ludermir <go...@gmail.com> on 2005/04/24 16:05:26 UTC

categorized search

Hi all,

I have indexed a field that describes the "category" of the document.
Thus, I want to know how many categories have a specific term. Could
someone help me to get this with good performance?

Regards,
Pablo
-- 
Pablo Gomes Ludermir
gomesp@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: categorized search

Posted by Chris Hostetter <ho...@fucit.org>.
well ... once you have the list of all "category" names that are in docs
which match your orriginal query, you can either redo the orriginal query
with "and category:XXXX" to get the counts, or you can pre-compute (and
save) a BitSet for each category in your index (esay to build using a
HitCollector or a Filter), and find the cardinality of the intersection of
each of those BitSets with a BitSet from your search (again: using a
HitCollector on your orriginal query)

for the record: this is not a trivial task.  i've describe the bare basics
of the issue ... but there's a lot of processing going on to get these
kinds of numbers.

if you search the list for "category" and "count" you'll find this has
come up at least one other time in the last few months.


: Date: Thu, 5 May 2005 20:37:19 +0200
: From: Pablo Gomes Ludermir <go...@gmail.com>
: Reply-To: java-user@lucene.apache.org,
:      Pablo Gomes Ludermir <go...@gmail.com>
: To: java-user@lucene.apache.org
: Subject: Re: categorized search
:
: Chris,
:
: That was partially what I needed. You got it right when I said I
: needed the number of categories that I particular term appears (and it
: works).
: But, I also would like to know in how many documents in each category
: that term appears.
:
: For instance: title:lucene appears in the category "search engines"
: and "open source software", and it appears in the documents 1, 2 and 3
: in the category "search engines" and in documents 4 and 7 in the
: categoy "open source". I could not get it to work yet (maybe because
: of my lack of experience with Lucene).
: Someone could give me a hand???
: Thanks
: Pablo
:
: On 4/24/05, Chris Hostetter <ho...@fucit.org> wrote:
: >
: > : >I have indexed a field that describes the "category" of the document.
: > : >Thus, I want to know how many categories have a specific term. Could
: > : >someone help me to get this with good performance?
: >
: > I think I'm reading this question different than Chuck, so I'll toss out
: > somethign totally different...
: >
: > as I understand it, you've indexed a bunch of documents, with a variety of
: > fields, one of which is "category" (for example, maybe you are indexing
: > news articles, that each have a "title", "description", "url", and
: > "category").  Now you have a term like "title:lucene" (or
: > "description:pope") and you want to know the number of unique terms in the
: > category field that exist in articles that contain your input term.
: >
: > If that's what you're looking for, then you can problem achieve this by:
: >   1) make a TermQuery for your input term (ie: "title:lucene")
: >   2) put that TermQuery in a QueryFilter, and call bits(reader)
: >   3) call FieldCache.DEFAULT.getStrings(reader,"category")
: >   3) loop over the true bits in the BitSet from #3, and for each one, add
: >      the corrisponding entry from the String[] in #4 to a Set.
: >
: > when you're all done, the Set will be the list of categories, and the size
: > of that Set is the number (i think) you wanted.
: >
: > (DISCLAIMER: I've never acctaully used FieldCache, i'm just giving you my
: > advice based on reading the javadocs)
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: > For additional commands, e-mail: java-user-help@lucene.apache.org
: >
: >
:
:
: --
: Pablo Gomes Ludermir
: gomesp@gmail.com
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: categorized search

Posted by Pablo Gomes Ludermir <go...@gmail.com>.
Chris,

That was partially what I needed. You got it right when I said I
needed the number of categories that I particular term appears (and it
works).
But, I also would like to know in how many documents in each category
that term appears.

For instance: title:lucene appears in the category "search engines"
and "open source software", and it appears in the documents 1, 2 and 3
in the category "search engines" and in documents 4 and 7 in the
categoy "open source". I could not get it to work yet (maybe because
of my lack of experience with Lucene).
Someone could give me a hand???
Thanks
Pablo

On 4/24/05, Chris Hostetter <ho...@fucit.org> wrote:
> 
> : >I have indexed a field that describes the "category" of the document.
> : >Thus, I want to know how many categories have a specific term. Could
> : >someone help me to get this with good performance?
> 
> I think I'm reading this question different than Chuck, so I'll toss out
> somethign totally different...
> 
> as I understand it, you've indexed a bunch of documents, with a variety of
> fields, one of which is "category" (for example, maybe you are indexing
> news articles, that each have a "title", "description", "url", and
> "category").  Now you have a term like "title:lucene" (or
> "description:pope") and you want to know the number of unique terms in the
> category field that exist in articles that contain your input term.
> 
> If that's what you're looking for, then you can problem achieve this by:
>   1) make a TermQuery for your input term (ie: "title:lucene")
>   2) put that TermQuery in a QueryFilter, and call bits(reader)
>   3) call FieldCache.DEFAULT.getStrings(reader,"category")
>   3) loop over the true bits in the BitSet from #3, and for each one, add
>      the corrisponding entry from the String[] in #4 to a Set.
> 
> when you're all done, the Set will be the list of categories, and the size
> of that Set is the number (i think) you wanted.
> 
> (DISCLAIMER: I've never acctaully used FieldCache, i'm just giving you my
> advice based on reading the javadocs)
> 
> -Hoss
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 


-- 
Pablo Gomes Ludermir
gomesp@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: categorized search

Posted by Chris Hostetter <ho...@fucit.org>.
: >I have indexed a field that describes the "category" of the document.
: >Thus, I want to know how many categories have a specific term. Could
: >someone help me to get this with good performance?

I think I'm reading this question different than Chuck, so I'll toss out
somethign totally different...

as I understand it, you've indexed a bunch of documents, with a variety of
fields, one of which is "category" (for example, maybe you are indexing
news articles, that each have a "title", "description", "url", and
"category").  Now you have a term like "title:lucene" (or
"description:pope") and you want to know the number of unique terms in the
category field that exist in articles that contain your input term.

If that's what you're looking for, then you can problem achieve this by:
  1) make a TermQuery for your input term (ie: "title:lucene")
  2) put that TermQuery in a QueryFilter, and call bits(reader)
  3) call FieldCache.DEFAULT.getStrings(reader,"category")
  3) loop over the true bits in the BitSet from #3, and for each one, add
     the corrisponding entry from the String[] in #4 to a Set.

when you're all done, the Set will be the list of categories, and the size
of that Set is the number (i think) you wanted.


(DISCLAIMER: I've never acctaully used FieldCache, i'm just giving you my
advice based on reading the javadocs)

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: categorized search

Posted by Chuck Williams <ch...@allthingslocal.com>.
Pablo Gomes Ludermir wrote:

>Hi all,
>
>I have indexed a field that describes the "category" of the document.
>Thus, I want to know how many categories have a specific term. Could
>someone help me to get this with good performance?
>  
>
If you want a complete count of all documents in the index for each 
category term, you should keep these counts in your own data structure 
as you index.  If you want to determine a global count for a specific 
category term on demand, I'd suggest using IndexReader.termDocs().  If 
you want to determine counts relative to the results of a particular 
search query, then you need to either iterate the results of the query 
and count occurrences of the category term(s), or perform queries that 
AND the additional category term of interest (the former will be much 
faster if you are computing this for many category terms simultaneously).

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org